美文网首页
Scrapy_redis和Scrapy_splash配合使用

Scrapy_redis和Scrapy_splash配合使用

作者: haoxuan_xia | 来源:发表于2021-11-10 11:31 被阅读0次

    1.配置信息

    1.1 Scrapy_redis配置信息

    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 指纹生成以及去重类
    SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 调度器类
    SCHEDULER_PERSIST = True # 持久化请求队列和指纹集合
    ITEM_PIPELINES = {'scrapy_redis.pipelines.RedisPipeline': 400} # 数据存入redis的管道
    REDIS_URL = "redis://host:port" # redis的url
    

    1.2 Scrapy_splash配置信息

    SPLASH_URL = 'http://127.0.0.1:8050'
    DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    }
    DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' 
    HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
    

    1.3 分析

    • scrapy-redis中配置了”DUPEFILTER_CLASS”: “scrapy_redis.dupefilter.RFPDupeFilter”,与scrapy-splash配置的DUPEFILTER_CLASS = ‘scrapy_splash.dupefilter.SplashAwareDupeFilter’相冲突!
    • 查看scrapy_splash.SplashAwareDupeFilter源码后,发现他继承了scrapy.dupefilter.RFPDupeFilter,并重写了request_fingerprint()方法。
    • 比较scrapy.dupefilter.RFPDupeFilterscrapy_redis.dupefilter.RFPDupeFilter中的request_fingerprint()方法后,发现是一样的
    • 因此重写了一个SplashAwareDupeFilter,继承scrapy_redis.dupefilter.RFPDupeFilter,其他代码不变。

    2. 实现

    重写dupefilter去重类,并在settings.py中使用

    1. 重写去重类
    from __future__ import absolute_import
    from copy import deepcopy
    from scrapy.utils.request import request_fingerprint
    from scrapy.utils.url import canonicalize_url
    from scrapy_splash.utils import dict_hash
    from scrapy_redis.dupefilter import RFPDupeFilter
    
    
    def splash_request_fingerprint(request, include_headers=None):
        """ Request fingerprint which takes 'splash' meta key into account """
        # 生成指纹
        fp = request_fingerprint(request, include_headers=include_headers)
    
        # 实现了splash的渲染功能代码
        if 'splash' not in request.meta:
            return fp
    
        splash_options = deepcopy(request.meta['splash'])
        args = splash_options.setdefault('args', {})
    
        if 'url' in args:
            args['url'] = canonicalize_url(args['url'], keep_fragments=True)
    
        return dict_hash(splash_options, fp)
    
    # 实现了scrapy_redis的指纹去重功能
    # 又完成了scrapy_splash渲染功能
    class SplashAwareDupeFilter(RFPDupeFilter):
        """
        DupeFilter that takes 'splash' meta key in account.
        It should be used with SplashMiddleware.
        """
        def request_fingerprint(self, request):
            return splash_request_fingerprint(request)
    
    1. 爬虫代码
    from scrapy_redis.spiders import RedisSpider
    from scrapy_splash import SplashRequest
    
    # 需求:scrapy_redis实现分布式爬虫[断点爬虫] + scrapy_splash [ 渲染服务]
    class SplashAndRedisSpider(RedisSpider):
        name = 'splash_and_redis'
        allowed_domains = ['baidu.com']
    
        # start_urls = ['https://www.baidu.com/s?wd=13161933309']
        redis_key = 'splash_and_redis'
        # lpush splash_and_redis 'https://www.baidu.com'
    
        # 分布式的起始的url不能使用splash服务!
        # 需要重写dupefilter去重类!
    
        def parse(self, response):
            yield SplashRequest('https://www.baidu.com/s?wd=13161933309',
                                callback=self.parse_splash,
                                args={'wait': 10}, # 最大超时时间,单位:秒
                                endpoint='render.html') # 使用splash服务的固定参数
    
        def parse_splash(self, response):
            with open('splash_and_redis.html', 'w') as f:
                f.write(response.body.decode())
    
    1. 配置
    # 渲染服务的url
    SPLASH_URL = 'http://127.0.0.1:8050'
    # 下载器中间件
    DOWNLOADER_MIDDLEWARES = {
        'scrapy_splash.SplashCookiesMiddleware': 723,
        'scrapy_splash.SplashMiddleware': 725,
        'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    }
    # 使用Splash的Http缓存
    HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
    
    # 去重过滤器
    # DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
    # DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 指纹生成以及去重类
    DUPEFILTER_CLASS = 'test_splash.spiders.splash_and_redis.SplashAwareDupeFilter' # 混合去重类的位置
    
    # scrapy_redis断点续爬
    SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 调度器类
    SCHEDULER_PERSIST = True # 持久化请求队列和指纹集合, scrapy_redis和scrapy_splash混用使用splash的DupeFilter!
    ITEM_PIPELINES = {'scrapy_redis.pipelines.RedisPipeline': 400} # 数据存入redis的管道
    REDIS_URL = "redis://127.0.0.1:6379" # redis的url
    

    注意

    • 重写的dupefilter去重类可以自定义位置,也须在配置文件中写入相应的路径

    相关文章

      网友评论

          本文标题:Scrapy_redis和Scrapy_splash配合使用

          本文链接:https://www.haomeiwen.com/subject/bmvbzltx.html