美文网首页
scrapy redis分布式爬虫

scrapy redis分布式爬虫

作者: zenos876 | 来源:发表于2019-08-16 14:09 被阅读0次

    原生scrapy主要是单机爬虫, 其请求队列和调度器都在本地,要想实现多机爬虫, 必须要想办法实现共享请求队列和调度器

    1.安装scrapy-redis

    pip install scrapy-redis
    

    2.创建工程

    scrapy startporject scrapy_redis_demo
    

    3.创建spider

    scrapy genspider cnblogs 'news.cnblogs.com'
    

    4.导入scrapy_redis模板, 并继承, 配置redis_key

    from scrapy_redis.spiders import RedisSpider
    class CnblogsSpider(RedisSpider):
        name = 'cnblogs'
        allowed_domains = ['news.cnblogs.com']
        redis_key = 'cnblogs:start_urls'
        # start_urls = ['http://news.cnblogs.com/abc']
        curr_page = 1
    

    5.配置信息

    # scrapy redis 配置
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    
    # redis://[user:pass]@host:port/db
    REDIS_URL = 'redis://@localhost:6379'
    
    # Schedule requests using a priority queue. (default)
    #SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'
    
    # Alternative queues.
    #SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'
    #SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'
    
    # 持久化,不自动清空
    SCHEDULER_PERSIST = True
    
    # Store scraped item in redis for post-processing.
    # ITEM_PIPELINES = {
    #     'scrapy_redis.pipelines.RedisPipeline': 300
    # }
    
    

    6.启动redis-server

    7.启动爬虫

    scrapy crawl cnblogs
    

    8.启动redis-cli, 插入初始请求

    lpush cnblogs:start_urls https://news.cnblogs.com
    


    9.最后结果
    结果


    url 指纹

    相关文章

      网友评论

          本文标题:scrapy redis分布式爬虫

          本文链接:https://www.haomeiwen.com/subject/wlxgjctx.html