美文网首页
Scrapy-Redis搭建分布式爬虫

Scrapy-Redis搭建分布式爬虫

作者: 简单1典 | 来源:发表于2020-04-29 13:52 被阅读0次

一、Scrapy-Redis框架
github 传送门: https://github.com/rmax/scrapy-redis
git clone https://github.com/rmax/scrapy-redis.git

1、安装环境
Python 2.7, 3.4 or 3.5
Redis >= 2.8
Scrapy >= 1.1
redis-py >= 2.10

2、Settings文件配置

Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

Store scraped item in redis for post-processing.(非必须)
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 300
}

3、举个栗子

1)spider genspider myspider www.abc.com
from scrapy_redis.spiders import RedisSpider

class MySpider(RedisSpider):
name = 'myspider'

def parse(self, response):
    # do stuff
    pass

2)spider genspider -t crawl myspider www.abc.com
class MySpider(RedisCrawlSpider):

rules = (
    Rule(LinkExtractor(), callback='parse_item'),
)

def parse_item(self, response):
    # do stuff
    pass

相关文章

网友评论

      本文标题:Scrapy-Redis搭建分布式爬虫

      本文链接:https://www.haomeiwen.com/subject/zctgwhtx.html