Scrapy中HttpCacheMiddleware定制

作者: 佑岷 | 来源:发表于2019-02-21 15:13 被阅读0次

Scrapy中HttpCacheMiddleware定制
（2018-05-20.Python从Zero到One）4、（爬
Scrapy框架学习---Settings(九)
爬虫系列（二十三）：setting设置
Scrapy Settings.py文件配置
Scrapy Settings.py文件配置
Scrapy学习篇（八）之settings
scrapy中item的处理技巧
pycharm中打开scrapy项目，import scrapy
Scrapy 框架中的Request类（二十四）

Scrapy本身支持请求数据缓存，提供｛DbmCacheStorage，FilesystemCacheStorage｝存储并支持DummyPolicy，RFC2616Policy策略。
默认是FilesystemCacheStorage文件系统存储和DummpyPolicy存储请求到的所有数据。
开启服务需要配置：

HTTPCACHE_ENABLED = True # 开启
HTTPCACHE_EXPIRATION_SECS = 0 # 有效时长 
HTTPCACHE_DIR = '/data/ajk/httpcache'  # 存储路径                        
HTTPCACHE_IGNORE_HTTP_CODES = []  # 忽略请求
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'  #本地存储                 
HTTPCACHE_GZIP = False # 压缩格式

缓存的实现是在scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware，在数据抓取过程中遇到目标网站跳转到验证码页面，重试3次后会自动退出请求并缓存最终结果，从而导致即便切换IP重新获取数据时拿的数据也是跳向验证码页面的数据。由于这样的数据体量小，不可能为此删除所有缓存重新抓取成功的数据，因此需要对HttpCacheMiddleware进行重写。

分析源码后，决定当从缓存中获取数据后，检测是否跳转请求，再检测跳转的url是否是无效的验证码url，无效的缓存需要将缓存文件删除并返回读取的数据为空，代码如下：

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware
from scrapy.utils.request import request_fingerprint
from ajk.proxy import *
import logging, os, shutil

class AjkCacheMiddleware(HttpCacheMiddleware):

    def process_request(self, request, spider): # Skip
        if request.meta.get('dont_cache', False):
            return

        # Skip uncacheable requests
        if not self.policy.should_cache_request(request): # Skip
            request.meta['_dont_cache'] = True  # flag as uncacheable
            return

        # Look for cached response and check if expired
        cachedresponse = self.storage.retrieve_response(spider, request)  # Skip
        
        if cachedresponse is None: # Skip
            self.stats.inc_value('httpcache/miss', spider=spider)
            if self.ignore_missing:
                self.stats.inc_value('httpcache/ignore', spider=spider)
                raise IgnoreRequest("Ignored request not in cache: %s" % request)
            return  # first time request

        # 此处判定缓存数据的返回状态，302--redirect，并转向captcha-verify页面
        # 然后获取本地文件路径， 并删除，同时返回读取内容 nothing。
        if cachedresponse.status==302 and cachedresponse.url.find('captcha-verify/') > -1:
            cachepath = self._get_request_path(spider, request)
            shutil.rmtree(cachepath)
            return 

        # Return cached response only if not expired
        cachedresponse.flags.append('cached')
        if self.policy.is_cached_response_fresh(cachedresponse, request):
            self.stats.inc_value('httpcache/hit', spider=spider)
            return cachedresponse

        # Keep a reference to cached response to avoid a second cache lookup on
        # process_response hook
        request.meta['cached_response'] = cachedresponse

    def _get_request_path(self, spider, request):
        key = request_fingerprint(request)
        return os.path.join(settings['HTTPCACHE_DIR'], spider.name, key[0:2], key)

为调高效率，不希望缓存成功的数据被再次解析出Item重新入库，可以在item生成前判定是否从缓存中读取，eg：

if 'cached' in response.flags: return

image.png

网友评论

本文标题：Scrapy中HttpCacheMiddleware定制

本文链接：https://www.haomeiwen.com/subject/fubryqtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Scrapy中HttpCacheMiddleware定制

相关文章