美文网首页
scrapy redis分布式爬虫

scrapy redis分布式爬虫

作者: zenos876 | 来源:发表于2019-08-16 14:09 被阅读0次

原生scrapy主要是单机爬虫, 其请求队列和调度器都在本地,要想实现多机爬虫, 必须要想办法实现共享请求队列和调度器

1.安装scrapy-redis

pip install scrapy-redis

2.创建工程

scrapy startporject scrapy_redis_demo

3.创建spider

scrapy genspider cnblogs 'news.cnblogs.com'

4.导入scrapy_redis模板, 并继承, 配置redis_key

from scrapy_redis.spiders import RedisSpider
class CnblogsSpider(RedisSpider):
    name = 'cnblogs'
    allowed_domains = ['news.cnblogs.com']
    redis_key = 'cnblogs:start_urls'
    # start_urls = ['http://news.cnblogs.com/abc']
    curr_page = 1

5.配置信息

# scrapy redis 配置
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# redis://[user:pass]@host:port/db
REDIS_URL = 'redis://@localhost:6379'

# Schedule requests using a priority queue. (default)
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'

# Alternative queues.
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'

# 持久化,不自动清空
SCHEDULER_PERSIST = True

# Store scraped item in redis for post-processing.
# ITEM_PIPELINES = {
#     'scrapy_redis.pipelines.RedisPipeline': 300
# }

6.启动redis-server

7.启动爬虫

scrapy crawl cnblogs

8.启动redis-cli, 插入初始请求

lpush cnblogs:start_urls https://news.cnblogs.com


9.最后结果
结果


url 指纹

相关文章

网友评论

      本文标题:scrapy redis分布式爬虫

      本文链接:https://www.haomeiwen.com/subject/wlxgjctx.html