美文网首页
scrapy用proxy的零零总总

scrapy用proxy的零零总总

作者: ddm2014 | 来源:发表于2018-09-03 14:21 被阅读0次

scrapy框架及中间件中说到了中间件相关的数据流程,刚好在用proxy爬数据的时候会用到中间件的零零总总,这回可以一起说说了。
我觉得写中间件要先找到内置的相关中间件,根据你的需求改写其中的request/response/exceptions。
因为scrapy里内置的downloadermiddlewares应该已经足够满足大部分的需求了,文档上说了一个顺序,也是把所有的downloadermiddlewares罗列出来。以及每个中间件要启用哪些设置,在文档中间件有写明。

{
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,#Robots协议
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,#http认证
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,#压缩方式——Accept-Encoding: gzip, deflate
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,#重定向301,302
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,#代理
'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,#底层缓存支持
}

另spidermiddlewares
{
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,#直接跳过非2**的request,
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,#在domain之外的网址不被过滤
'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,#根据request和response生成request headers中的referer
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,#控制爬取得url长度
'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,#控制爬取得深度
}

这回想要用proxy爬取百度首页,想的是基本流程是
1.setting里导入ip-list,同时DOWNLOAD_TIMEOUT=3,默认180,3分钟太长了
2.修改HttpProxyMiddleware,让其从setting里都每次都取第一个proxy发起request
2.修改RetryMiddleware,如果出现timeout等错误(重写exception)或者ip被封出现503(重写response)之类,就把这个ip删掉,把删除后的iplist重写进setting,如果iplist为0,就结束spider。

middleware:

from scrapy import signals
from scrapy.utils.project import get_project_settings
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware
import time
import random
from scrapy.utils.response import response_status_message
from scrapy.log import logger

class MyProxyMiddleware(HttpProxyMiddleware):
    def process_request(self, request, spider):
        settings = get_project_settings()
        proxies = settings.get('IPOOL')

        logger.debug('now ip is '+proxies[0])
        request.meta['proxy'] = proxies[0]

class MyRetryMiddleware(RetryMiddleware):
    def delete_proxy(self,spider):
        settings = get_project_settings()
        proxies = settings.get('IPOOL')
        if proxies:
            proxies.pop(0)
            settings.set('IPOOL',proxies)
        else:
            spider.crawler.engine.close_spider(spider, 'response msg error , job done!')

    def process_exception(self, request, exception, spider):
        if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \
                and not request.meta.get('dont_retry', False):
            self.delete_proxy(spider)
            time.sleep(random.randint(3, 5))
            return self._retry(request, exception, spider)

    def process_response(self, request, response, spider):
        if request.meta.get('dont_retry', False):
            return response
        if response.status == 200:
            self.delete_proxy(spider)
            return response
        if response.status in self.retry_http_codes:
            reason = response_status_message(response.status)
            self.delete_proxy(spider)
            time.sleep(random.randint(3, 5))
            return self._retry(request, reason, spider) or response
        return response

settings:

import pandas as pd
df = pd.read_csv('F:\\pycharm project\\pachong\\vpn.csv')
IPOOL = df['address'][df['status'] == 'yes'].tolist()
DOWNLOADER_MIDDLEWARES = {
   # 'mytset.middlewares.MytsetDownloaderMiddleware': 543,
    'mytset.middlewares.MyRetryMiddleware':550,
    'mytset.middlewares.MyProxyMiddleware': 750,
}
DOWNLOAD_TIMEOUT=3

spider:

import scrapy
from pyquery import PyQuery as pq

class BaiduSpider(scrapy.Spider):
    name = 'baidu'
    allowed_domains = ['www.baidu.com']


    def start_requests(self):
        for _ in range(30):
            yield scrapy.Request(url='http://www.baidu.com/',callback=self.parse,dont_filter=True)
    def parse(self, response):
        res = pq(response.body)
        proxy = response.meta['proxy']
        print(proxy)
        print(res('title').text())

相关文章

  • scrapy用proxy的零零总总

    在scrapy框架及中间件中说到了中间件相关的数据流程,刚好在用proxy爬数据的时候会用到中间件的零零总总,这回...

  • 零零总总

    时间应该是最任性的。 你说时光时光你慢些吧,年初定下目标到年末还没来得及开始呢,眼看着一生就这样在一分一秒中消耗,...

  • 零零总总的往事…

    有人说成长是痛苦的。因为他把一个人从天真无邪,无忧无虑,内心毫无罪恶的世界,带入一个艰辛甚至邪恶,世俗烦恼,虚伪的...

  • 伞的零零总总

    今年上海的雨天特别多,自进入2019年以来,淅淅沥沥的小雨不停歇。街上尽是花花绿绿伞的世界。当然也有不打伞(打伞是...

  • iOS_GCD的零零总总

    1、主线程队列 VS 分线程队列 dispatch_sync 和 dispatch_async 区别: dispa...

  • Scrapy爬虫实战 - 下

    本文的示例代码参考scrapy-tutorial 目录 Env Paging Proxy Selenium Dep...

  • 迷迷茫茫零零总总

    一个人在迷茫的时候是因为自己的智慧不够吗? 好像是,因为没有办法用大脑思考清楚自己到底想要什么,没有打开一扇窗把视...

  • SSH 代理连接

    OpenSSH connection via proxy Command line for proxy 用很长的命...

  • Scrapy框架的使用

    一 . scrapy的介绍 1. 什么是scrapy? (1) Scrapy是用纯Python实现...

  • 平凡值得 人间值得

    晚间的日常散步,开心聊天。 聊着,聊着,聊到每个人终其一生也终究平凡。 …… 零零总总,碎碎总总,有了今天的主题。...

网友评论

      本文标题:scrapy用proxy的零零总总

      本文链接:https://www.haomeiwen.com/subject/bjmbwftx.html