爬虫框架scrapy
介绍scrapy这个爬虫框架的Spider(爬虫器)、Scheduler(调度器)、Downloader(下载器)、Pipeline(数据通道)基本使用,以及scrapy-redis的基本使用。
具体内容
- scrapy
- Spider
xpath的使用
hxs = HtmlXPathSelector(response=response)
pages = hxs.xpath('//div[@id="page-area"]//a[@class="ct_pagepa"]/@href').extract()
将爬虫网址yield到调度器
def parse(self, response):
soup = BeautifulSoup(response.text,'html.parser')
获取a标签 a = soup.find(name='a',attrs={'class': 'p_n_p_prefix'})
获取所有数字 pattern = re.compile(r'\d+')
post_id = pattern.findall(a.get('href'))[0]
拼接字符串 next_url = 'http://www.cnblogs.com/post/prevnext?postId= {0}&blogId=133379&dateCreated=2018%2F5%2F23+20%3A28%3A00&postType=1'.format(post_id)
yield Request(url=next_url, callback=self.parse)
- Scheduler
调度器的起始方法,配置
def from_crawler(cls, crawler):
配置文件 settings = crawler.settings
去重 dupefilter_cls = load_object(settings['DUPEFILTER_CLASS'])
dupefilter = dupefilter_cls.from_settings(settings)
优先级 pqclass = load_object(settings['SCHEDULER_PRIORITY_QUEUE'])
硬盘存储 dqclass = load_object(settings['SCHEDULER_DISK_QUEUE'])
内存存储 mqclass = load_object(settings['SCHEDULER_MEMORY_QUEUE'])
日志 logunser = settings.getbool('LOG_UNSERIALIZABLE_REQUESTS', settings.getbool('SCHEDULER_DEBUG'))
return cls(dupefilter, jobdir=job_dir(settings), logunser=logunser, stats=crawler.stats, pqclass=pqclass, dqclass=dqclass, mqclass=mqclass)
- Downloader
下载器中间件 - Pipeline
数据持久化
过程 def process_item(self, item, spider):
开始 def open_spider(self, spider):
结束 def close_spider(self, spider):
- Spider
- scrapy-redis
- 去重
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
- 设置起始URL
import redis
conn = redis.Redis(host='127.0.0.1',port=6379)
(起始url的Key: chouti:start_urls) conn.lpush("chouti:start_urls",'https://dig.chouti.com')
清空 redis conn.flushdb()
- 数据持久化
启用pipelines
ITEM_PIPELINES = { 'myproject.pipelines.PricePipeline': 300, }
编写你自己的item pipeline
process_item(self, item, spider)
- 调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue' # 默认使用优先级队列(默认),其他:PriorityQueue(有序集合),FifoQueue(列表)、LifoQueue(列表)
SCHEDULER_QUEUE_KEY = '%(spider)s:requests' # 调度器中请求存放在redis中的key
SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat" # 对保存到redis中的数据进行序列化,默认使用pickle
SCHEDULER_PERSIST = True # 是否在关闭时候保留原来的调度器和去重记录,True=保留,False=清空
SCHEDULER_FLUSH_ON_START = False # 是否在开始之前清空 调度器和去重记录,True=清空,False=不清空
SCHEDULER_IDLE_BEFORE_CLOSE = 10 # 去调度器中获取数据时,如果为空,最多等待时间(最后没数据,未获取到)
SCHEDULER_DUPEFILTER_KEY = '%(spider)s:dupefilter' # 去重规则,在redis中保存时对应的key
SCHEDULER_DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'# 去重规则对应处理的类
- 去重
网友评论