爬虫课堂（二十六）|使用scrapy-redis框架实现分布式爬

作者: 小怪聊职场 | 来源:发表于2018-04-12 23:16 被阅读244次

爬虫课堂（二十六）|使用scrapy-redis框架实现分布式爬
scrapy-redis源码解读之发送POST请求
Scrapy-Redis分布式爬虫组件
6.2 Scrapy-Redis分布式组件（二）：Scrapy-
爬虫课堂（二十七）|使用scrapy-redis框架实现分布式爬
游戏领域舆论的数据获取与分析项目总结
第九章 scrapy-redis 分布式爬虫
python ------scrapy-redis分布式爬虫
（2018-05-22.Python从Zero到One）6、（爬
scrapy-redis的一些配置

到了讲scrapy-redis框架的时候啦，在讲它之前先提出三个问题：

我们要使用分布式，那么分布式有什么优点？
Scrapy不支持分布式，是为什么？
如果要使Scrapy支持分布式，需要解决哪些问题？
scrapy-redis是怎么解决这些问题的？

接下来，我们逐个回答：

分布式的主要优点包括如下两种：
1）充分利用多机器的宽带加速爬取。
2）充分利用多机的IP加速爬取速度。
在爬虫课堂（十六）|Scrapy框架结构及工作原理章节中，我们已经讲解过Scrapy运行流程，如下图26-1所示：
1）当爬虫（Spider）要爬取某URL地址的页面时，使用该URL初始化Request对象提交给引擎（Scrapy Engine），并设置回调函数。
2）Request对象进入调度器(Scheduler)按某种算法进行排队，之后的每个时刻调度器将其出列，送往下载器。

在Scrapy中，以上的流程都是在单机操作，其他服务器是无法从现在的Scheduler中取出requests任务队列，另外这块的去重操作也是在当前服务器的内存中进行，这就导致Scrapy不支持分布式。

图26-1 Scrapy架构图

基于上面的分析，我们知道要使Scrapy支持分布式，那么就需要解决三个问题：
1）requests队列需要集中管理。
2）去重逻辑也需要集中管理。
3）保持数据逻辑也需要集中管理。
scrapy-redis是怎么解决这些问题的？
我们先进入scrapy-redis的GitHub页面https://github.com/rmax/scrapy-redis，它在Usage明确说明了需要设置的地方：

# Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# Default requests serializer is pickle, but it can be changed to any module
# with loads and dumps functions. Note that pickle is not compatible between
# python versions.
# Caveat: In python 3.x, the serializer must return strings keys and support
# bytes as values. Because of this reason the json or msgpack module will not
# work by default. In python 2.x there is no such issue and you can use
# 'json' or 'msgpack' as serializers.
#SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"

# Don't cleanup redis queues, allows to pause/resume crawls.
#SCHEDULER_PERSIST = True

# Schedule requests using a priority queue. (default)
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'

# Alternative queues.
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'

# Max idle time to prevent the spider from being closed when distributed crawling.
# This only works if queue class is SpiderQueue or SpiderStack,
# and may also block the same time when your spider start at the first time (because the queue is empty).
#SCHEDULER_IDLE_BEFORE_CLOSE = 10

# Store scraped item in redis for post-processing.
ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 300
}

# The item pipeline serializes and stores the items in this redis key.
#REDIS_ITEMS_KEY = '%(spider)s:items'

# The items serializer is by default ScrapyJSONEncoder. You can use any
# importable path to a callable object.
#REDIS_ITEMS_SERIALIZER = 'json.dumps'

# Specify the host and port to use when connecting to Redis (optional).
#REDIS_HOST = 'localhost'
#REDIS_PORT = 6379

# Specify the full Redis URL for connecting (optional).
# If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.
#REDIS_URL = 'redis://user:pass@hostname:9001'

# Custom redis client parameters (i.e.: socket timeout, etc.)
#REDIS_PARAMS  = {}
# Use custom redis client class.
#REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient'

# If True, it uses redis' ``SPOP`` operation. You have to use the ``SADD``
# command to add URLs to the redis queue. This could be useful if you
# want to avoid duplicates in your start urls list and the order of
# processing does not matter.
#REDIS_START_URLS_AS_SET = False

# Default start urls key for RedisSpider and RedisCrawlSpider.
#REDIS_START_URLS_KEY = '%(name)s:start_urls'

# Use other encoding than utf-8 for redis.
#REDIS_ENCODING = 'latin1'

设置里面主要包括三个地方，SCHEDULER处理列队的问题（分配任务），DUPEFILTER_CLASS处理去重的问题（任务去重），RedisPipeline处理保存的问题（数据存储）。

# Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# Store scraped item in redis for post-processing.
ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 300
}

在创建爬虫的时候也有一个调整。
原来非分布式爬虫时的方式如下：

class MySpider(Spider):
    name = 'myspider'

    def parse(self, response):
        # do stuff
        pass

要使用分布式的时候，需要把Spider修改为RedisSpider。

from scrapy_redis.spiders import RedisSpider

class MySpider(RedisSpider):
    name = 'myspider'

    def parse(self, response):
        # do stuff
        pass

抱歉，本章因为时间问题就先写到这里，今天加班太晚啦。读者也早点休息，明天继续。
下一章节，我们通过分析scrapy-redis源码，来进一步了解scrapy-redis框架是如何解决分配任务、任务去重以及把所有爬虫采集的数据汇总一处的三个问题的。

爬虫课堂（二十六）|使用scrapy-redis框架实现分布式爬
到了讲scrapy-redis框架的时候啦，在讲它之前先提出三个问题：我们要使用分布式，那么分布式有什么优点？ ...
scrapy-redis源码解读之发送POST请求
1 引言这段时间在研究美团爬虫，用的是scrapy-redis分布式爬虫框架，奈何scrapy-redis与sc...
Scrapy-Redis分布式爬虫组件
Scrapy-Redis分布式爬虫组件 Scrapy是一个框架，他本身是不支持分布式的。如果我们想要做分布式的爬虫...
6.2 Scrapy-Redis分布式组件（二）：Scrapy-
Scrapy-Redis分布式爬虫组件 Scrapy是一个框架，他本身是不支持分布式的。如果我们想要做分布式的爬虫...
爬虫课堂（二十七）|使用scrapy-redis框架实现分布式爬
我们在说Scrapy之所以不支持分布式，主要是因为有三大问题没有解决： requests队列不能集中管理。去重逻...
游戏领域舆论的数据获取与分析项目总结
项目目标数据获取。使用scrapy-redis框架构建分布式爬虫，数据分析。主要分为以下三个模块:新词发现。由...
第九章 scrapy-redis 分布式爬虫
scrapy-redis 分布式爬虫标签（空格分隔）： python scrapy scrapy-redis 分...
python ------scrapy-redis分布式爬虫
一，scrapy和scrapy-redis的区别？ scrapy是一个爬虫通用框架，但不支持分布式，scrapy-...
（2018-05-22.Python从Zero到One）6、（爬
Scrapy 和 scrapy-redis的区别 Scrapy 是一个通用的爬虫框架，但是不支持分布式，Scrap...
scrapy-redis的一些配置
scrapy爬虫中使用scrapy-redis做分布式至少需要配置如下：参考文档：https://doc.sc...