原生scrapy主要是单机爬虫, 其请求队列和调度器都在本地,要想实现多机爬虫, 必须要想办法实现共享请求队列和调度器
1.安装scrapy-redis
pip install scrapy-redis
2.创建工程
scrapy startporject scrapy_redis_demo
3.创建spider
scrapy genspider cnblogs 'news.cnblogs.com'
4.导入scrapy_redis模板, 并继承, 配置redis_key
from scrapy_redis.spiders import RedisSpider
class CnblogsSpider(RedisSpider):
name = 'cnblogs'
allowed_domains = ['news.cnblogs.com']
redis_key = 'cnblogs:start_urls'
# start_urls = ['http://news.cnblogs.com/abc']
curr_page = 1
5.配置信息
# scrapy redis 配置
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# redis://[user:pass]@host:port/db
REDIS_URL = 'redis://@localhost:6379'
# Schedule requests using a priority queue. (default)
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'
# Alternative queues.
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'
# 持久化,不自动清空
SCHEDULER_PERSIST = True
# Store scraped item in redis for post-processing.
# ITEM_PIPELINES = {
# 'scrapy_redis.pipelines.RedisPipeline': 300
# }
6.启动redis-server
7.启动爬虫
scrapy crawl cnblogs
8.启动redis-cli, 插入初始请求
lpush cnblogs:start_urls https://news.cnblogs.com
9.最后结果
结果
url 指纹
网友评论