pycharm+Scrapy
距离上次使用Scrapy已经是大半年前的事情了,赶紧把西瓜皮捡回来。。
简单粗暴上爬取目标:
初始URL:http://quotes.toscrape.com/
目标:将每一页中每一栏的语录、作者、标签解析出来,保存到json文件或者MongoDB数据库中
打开命令行,敲
scrapy startproject quotetutorial #在当前目录下生成了一个叫quotetutorial的项目
然后敲cd quotetutorail,然后敲
scrapy genspider quotes quotes.toscrape.com #创建一个目标站点的爬虫
此时项目结构如下:
做一下解释:
iems:定义存储数据的Item类
settings:变量的配置信息
pipeline:负责处理被Spider提取出来的Item,典型应用有:清理HTML数据;验证爬取数据的合法性,检查Item是否包含某些字段;查重并丢弃;将爬取结果保存到文件或者数据库中
middlewares:中间件
spiders > quotes:爬虫模块
接着我们修改quotes.py代码:
# -*- coding: utf-8 -*-import scrapyfrom quotetutorial.items import QuotetutorialItemfrom urllib.parse import urljoinclass QuotesSpider(scrapy.Spider): name = 'quotes' allowed_domains = ['quotes.toscrape.com'] start_urls = ['http://quotes.toscrape.com/'] def parse(self, response): quotes = response.css('.quote') for quote in quotes: item = QuotetutorialItem() text = quote.css('.text::text').extract_first() author = quote.css('.author::text').extract_first() tags = quote.css('.tags .tag::text').extract() item['text'] = text item['author'] = author item['tags'] = tags yield item next = response.css('.pager .next a::attr(href)').extract_first()#提取翻页的url url = response.urljoin(next) #作url拼接 if url: yield scrapy.Request(url=url,callback=self.parse)#回调parse函数
然后是pipelines.py文件
# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlfrom scrapy.exceptions import DropItemfrom pymongo import MongoClientclass TextPipeline(object):#对item数据处理,限制字段大小 def __init__(self): self.limit = 50 def process_item(self, item, spider): if item['text']: if len(item['text']) > self.limit: item['text'] = item['text'][0:self.limit].rstrip() + '...' return item else: return DropItem('Missing Text')class MongoPipeline(object):#保存到MongoDB数据库 def __init__(self,mongo_uri,mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls,crawler): return cls( mongo_uri = crawler.settings.get('MONGO_URI'), mongo_db = crawler.settings.get('MONGO_DB') ) def open_spider(self,spider): self.client = MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def process_item(self,item,spider): name = item.__class__.__name__ self.db[name].insert(dict(item)) return item def close_spider(self,spider): self.client.close()
然后是items.py
# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass QuotetutorialItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() text = scrapy.Field() author = scrapy.Field() tags = scrapy.Field()
然后修改settings.py
SPIDER_MODULES = ['quotetutorial.spiders']NEWSPIDER_MODULE = 'quotetutorial.spiders'MONGO_URI = 'localhost'MONGO_DB = 'quotestutorial'# Configure item pipelines# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = { 'quotetutorial.pipelines.TextPipeline': 300, #数字越小表示优先级越高,先处理 'quotetutorial.pipelines.MongoPipeline': 400,}
这里需要注意的地方是:
Scrapy有自己的一套数据提取机制,成为Selector,通过Xpath或者CSS来解析HTML,用法和普通的选择器一样
把CSS换成XPATH如下:
def parse(self, response): quotes = response.xpath(".//*[@class='quote']") for quote in quotes: item = QuotetutorialItem() # text = quote.css('.text::text').extract_first() # author = quote.css('.author::text').extract_first() # tags = quote.css('.tags .tag::text').extract() text = quote.xpath(".//span[@class='text']/text()").extract()[0] author = quote.xpath(".//span/small[@class='author']/text()").extract()[0] tags = quote.xpath(".//div[@class='tags']/a/text()").extract() item['text'] = text item['author'] = author item['tags'] = tags # item['tags'] = tags yield item