美文网首页
Scrapy-Redis搭建分布式爬虫

Scrapy-Redis搭建分布式爬虫

作者: 简单1典 | 来源:发表于2020-04-29 13:52 被阅读0次

    一、Scrapy-Redis框架
    github 传送门: https://github.com/rmax/scrapy-redis
    git clone https://github.com/rmax/scrapy-redis.git

    1、安装环境
    Python 2.7, 3.4 or 3.5
    Redis >= 2.8
    Scrapy >= 1.1
    redis-py >= 2.10

    2、Settings文件配置

    Enables scheduling storing requests queue in redis.
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"

    Ensure all spiders share same duplicates filter through redis.
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

    Store scraped item in redis for post-processing.(非必须)
    ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 300
    }

    3、举个栗子

    1)spider genspider myspider www.abc.com
    from scrapy_redis.spiders import RedisSpider

    class MySpider(RedisSpider):
    name = 'myspider'

    def parse(self, response):
        # do stuff
        pass
    

    2)spider genspider -t crawl myspider www.abc.com
    class MySpider(RedisCrawlSpider):

    rules = (
        Rule(LinkExtractor(), callback='parse_item'),
    )
    
    def parse_item(self, response):
        # do stuff
        pass
    

    相关文章

      网友评论

          本文标题:Scrapy-Redis搭建分布式爬虫

          本文链接:https://www.haomeiwen.com/subject/zctgwhtx.html