美文网首页
使用scrapy防止爬去的网站把我们的程序被ban

使用scrapy防止爬去的网站把我们的程序被ban

作者: 阿汤8阿义 | 来源:发表于2016-09-23 17:10 被阅读865次

    方法是自己写一个仿浏览器。他不可能把我们的浏览器给禁止掉吧。除了做一个浏览器还要有几个辅助的条件,cookie机制,IP代理,请求时间间隔。

    一想到浏览器,我天浏览器可是一个我们的工具啊,那要用多少代码给堆起来啊,一这么头就大了。可事实并非要做到浏览器那边复杂和健全,只是拿一些必不可少东西让爬去的网站认为是浏览器在访问就可以了。这里的必不可少的东西就请求中的Hesders,只要我们把该传的传过去了就可以实现欺骗了。

    如图:

    具体实现是这样:

    首先创建一个.py

    #encoding:utf-8

    importrandom

    importbase64

    fromsettingsimportPROXIES

    classRandomUserAgent(object):

    def__init__(self, agents):

    self.agents = agents

    @classmethod

    deffrom_crawler(cls,crawler):

    returncls(crawler.settings.getlist('USER_AGENTS'))

    defprocess_request(self,

    request,spider):

    print"**************************"+ random.choice(self.agents)

    request.headers.setdefault('User-Agent', random.choice(self.agents))

    classProxyMiddleware(object):

    defprocess_request(self,

    request,spider):

    proxy = random.choice(PROXIES)

    ifproxy['user_pass']is notNone:

    request.meta['proxy'] ="http://%s"% proxy['ip_port']

    encoded_user_pass = base64.encodestring(proxy['user_pass'])

    request.headers['Proxy-Authorization'] ='Basic '+ encoded_user_pass

    print"**************ProxyMiddleware have

    pass************"+ proxy['ip_port']

    print"请求头部是什么",request.headers

    else:

    print"**************ProxyMiddleware no

    pass************"+ proxy['ip_port']

    request.meta['proxy'] ="http://%s"% proxy['ip_port']

    这里是代码

    第二步在settings.py中打开或些必要的东西

    添加USER_AGENTS

    USER_AGENTS =[

    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",

    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",

    "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",

    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",

    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",

    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",

    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",

    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",

    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",

    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",

    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",

    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",

    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",

    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",

    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",

    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",

    ]

    添加代理IP设置PROXIES

    PROXIES =[

    {'ip_port':'111.11.228.75:80','user_pass':''},

    {'ip_port':'120.198.243.22:80','user_pass':''},

    {'ip_port':'111.8.60.9:8123','user_pass':''},

    {'ip_port':'101.71.27.120:80','user_pass':''},

    {'ip_port':'122.96.59.104:80','user_pass':''},

    {'ip_port':'122.224.249.122:8088','user_pass':''},

    ]

    禁用cookies

    COOKIES_ENABLED=False

    设置下载延迟

    DOWNLOAD_DELAY=3

    添加DOWNLOADER_MIDDLEWARES

    DOWNLOADER_MIDDLEWARES= {

    'cnblogs.middlewares.RandomUserAgent':1,#随机user agent

    # 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,'cnblogs.middlewares.ProxyMiddleware':100,#代理需要用到}

    添加DEFAULT_REQUEST_HEADERS

    DEFAULT_REQUEST_HEADERS= {

    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

    'Accept-Language':'en',

    }

    好了到这里就可以运行了。

    免费代理地址:http://www.xicidaili.com/

    项目代码地址:https://github.com/tangyi1234/cnblogs

    相关文章

      网友评论

          本文标题:使用scrapy防止爬去的网站把我们的程序被ban

          本文链接:https://www.haomeiwen.com/subject/tnwtyttx.html