美文网首页
crawlspider-zhihu总结

crawlspider-zhihu总结

作者: gogoforit | 来源:发表于2017-03-08 00:13 被阅读0次

    1)解决500和423错误403错误
    在settings里面设置header可以解决500错误
    限速可以解决423错误
    403错误,使用ip中间件以后,可能该ip已经被网站封了
    2)allowed_domains域很重要,这里决定了可以访问的网址范围,加上dont_filter=True以后不受限制
    3)异常处理

    try:
    
     except Exception as e:
                print(e)
    
    

    4)response.status response.url
    5)对异常ip的处理,虽然我不明白原理

    from scrapy.core.downloader.handlers.http11 import TunnelError
    
    from scrapy.contrib.downloadermiddleware.retry import RetryMiddleware
    class RetryMiddleware(RetryMiddleware):
        def process_exception(self, request, exception, spider):
            if ( isinstance(exception, self.EXCEPTIONS_TO_RETRY) or isinstance(exception, TunnelError) ) \
                    and 'dont_retry' not in request.meta:
                return self._retry(request, exception, spider)
    
    settings.py设置如下
    DOWNLOADER_MIDDLEWARES = {
       # 'zhihu_basic.middlewares.UAMiddleware': 543,
       'zhihu_basic.middlewares.RetryMiddleware': 200,
       'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': None
    }
    
    

    6)settings里面设置header,cookies,可以用来访问

        def make_requests_from_url(self, url):
            return scrapy.Request(url, method='GET', headers=settings['ZHIHU_HEADER'], cookies=settings['ZHIHU_COOKIE'])
       
    

    相关文章

      网友评论

          本文标题:crawlspider-zhihu总结

          本文链接:https://www.haomeiwen.com/subject/tfivgttx.html