美文网首页python-Scrapy
5.scrapy-redis使用简介

5.scrapy-redis使用简介

作者: 零_WYF | 来源:发表于2018-01-18 19:49 被阅读35次

    1.下载安装scrapy-redis

    windows下:
    pip install scrapy-redis
    或python.x -m pip install scrapy-redis

    2.scrapy-redis的作用和特点

    作用:scrapy-redis为Scrapy提供Redis-backed组件
    特点: 可以启动多个爬虫实例共享一个单一的 redis队列。是最适合广泛的多域爬虫。
    分布式的post处理。scrapy到的items放入一个redis队列意味着可以分享这个items队列,并在其中启用足够多的post处理进程。

    3.要求

    Python版本:2.7,3.4+
    Redis> = 2.8
    Scrapy > = 1.1
    redis-py > = 2.10
    官方使用案例:
    https://github.com/rmax/scrapy-redis

    4.用法

    在项目中使用如下settings设置:

    # Enables scheduling storing requests queue in redis.
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    
    # Ensure all spiders share same duplicates filter through redis.
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    
    # Default requests serializer is pickle, but it can be changed to any module
    # with loads and dumps functions. Note that pickle is not compatible between
    # python versions.
    # Caveat: In python 3.x, the serializer must return strings keys and support
    # bytes as values. Because of this reason the json or msgpack module will not
    # work by default. In python 2.x there is no such issue and you can use
    # 'json' or 'msgpack' as serializers.
    #SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"
    
    # Don't cleanup redis queues, allows to pause/resume crawls.
    #SCHEDULER_PERSIST = True
    
    # Schedule requests using a priority queue. (default)
    #SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'
    
    # Alternative queues.
    #SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'
    #SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'
    
    # Max idle time to prevent the spider from being closed when distributed crawling.
    # This only works if queue class is SpiderQueue or SpiderStack,
    # and may also block the same time when your spider start at the first time (because the queue is empty).
    #SCHEDULER_IDLE_BEFORE_CLOSE = 10
    
    # Store scraped item in redis for post-processing.
    ITEM_PIPELINES = {
        'scrapy_redis.pipelines.RedisPipeline': 300
    }
    
    # The item pipeline serializes and stores the items in this redis key.
    #REDIS_ITEMS_KEY = '%(spider)s:items'
    
    # The items serializer is by default ScrapyJSONEncoder. You can use any
    # importable path to a callable object.
    #REDIS_ITEMS_SERIALIZER = 'json.dumps'
    
    # Specify the host and port to use when connecting to Redis (optional).
    #REDIS_HOST = 'localhost'
    #REDIS_PORT = 6379
    
    # Specify the full Redis URL for connecting (optional).
    # If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.
    #REDIS_URL = 'redis://user:pass@hostname:9001'
    
    # Custom redis client parameters (i.e.: socket timeout, etc.)
    #REDIS_PARAMS  = {}
    # Use custom redis client class.
    #REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient'
    
    # If True, it uses redis' ``SPOP`` operation. You have to use the ``SADD``
    # command to add URLs to the redis queue. This could be useful if you
    # want to avoid duplicates in your start urls list and the order of
    # processing does not matter.
    #REDIS_START_URLS_AS_SET = False
    
    # Default start urls key for RedisSpider and RedisCrawlSpider.
    #REDIS_START_URLS_KEY = '%(name)s:start_urls'
    
    # Use other encoding than utf-8 for redis.
    #REDIS_ENCODING = 'latin1'
    

    4.通过redis来喂饱爬虫

    类scrapy_redis.spiders.RedisSpider使爬虫程序能够从redis中读取URL。如果第一个请求产生更多的请求,则redis队列中的url将被一个接一个地处理,爬虫程序将在从redis获取另一个url之前处理这些请求。
    创建一个myspider.py文件:

    from scrapy_redis.spiders import RedisSpider
    
    class MySpider(RedisSpider):
        name = 'myspider'
    
        def parse(self, response):
            # do stuff
            pass
    
    运行爬虫程序:
    scrapy runspider myspider.py
    
    向redis中装入urls:
    redis-cli lpush myspider:start_urls http://baidu.com
    

    相关文章

      网友评论

        本文标题:5.scrapy-redis使用简介

        本文链接:https://www.haomeiwen.com/subject/duuwoxtx.html