美文网首页
python爬虫--day08

python爬虫--day08

作者: 陈small末 | 来源:发表于2019-01-10 08:57 被阅读0次

    自定义中间件

    process_request(self, request, spider)
    当每个request通过下载中间件时,该方法被调用。
    
    process_response(self, request, response, spider)
    当下载器完成http请求,传递响应给引擎的时候调用
    
    修改settings.py配置USER_AGENTS和PROXIES
    # 添加USER_AGENTS:
    USER_AGENTS = [
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
        "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
        "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
        "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
        "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5"
    ]
    
    # 添加代理IP设置PROXIES:
    # 免费代理IP可以网上搜索(免费的不太稳定),或者付费购买一批可用的私密代理IP:
    PROXIES = [
        {'ip_port': '111.8.60.9:8123'},
        {'ip_port': '101.71.27.120:80'},
        {'ip_port': '122.96.59.104:80'},
        {'ip_port': '122.224.249.122:8088'},
    ]
    
    创建中间件类
    # -*- coding: utf-8 -*-
    import random
    from settings import USER_AGENTS
    from settings import PROXIES
    
    # 随机的User-Agent
    class RandomUserAgent(object):
        def process_request(self, request, spider):
            useragent = random.choice(USER_AGENTS)
            request.headers.setdefault("User-Agent", useragent)
    
    # 随机代理IP
    class RandomProxy(object):
        def process_request(self, request, spider):
            proxy = random.choice(PROXIES)
            request.meta['proxy'] = "http://" + proxy['ip_port']
    
    配置中间件
    # 最后设置setting.py里的DOWNLOADER_MIDDLEWARES,添加自己编写的下载中间件类
    DOWNLOADER_MIDDLEWARES = {
        #'mySpider.middlewares.MyCustomDownloaderMiddleware': 543,
        'mySpider.middlewares.RandomUserAgent': 81,
        'mySpider.middlewares.ProxyMiddleware': 100
    }
    

    Scrapy模拟登录

    注意:模拟登录时,必须保证settings.py里的COOKIES_ENABLED(Cookies中间件) 处于开启状态

    COOKIES_ENABLED = True# COOKIES_ENABLED = False

    一:发送POST请求进行登录

    # POST请求
    # 百度翻译:
    url = "http://fanyi.baidu.com/sug"
    参数: {'kw': 'wolf'}
    
    # 只要是需要提供post数据的,就可以用这种方法。下面示例里post的数据是账户密码:
    # 可以使用yield scrapy.FormRequest(url, formdata, callback)方法发送POST请求。
    # 如果希望程序执行一开始就发送POST请求,可以重写Spider类的start_requests(self)方法,并且不再调用start_urls里的url。
    
    class mySpider(scrapy.Spider):
        # start_urls = ["http://www.example.com/"]
    
        def start_requests(self):
            url = 'http://www.renren.com/PLogin.do'
            # FormRequest 是Scrapy发送POST请求的方法
            yield scrapy.FormRequest(
                url = url,
                formdata = {"email" : "18588403840", "password" : "Changeme_123"},
                callback = self.parse_page
            )
        def parse_page(self, response):
            print(response.text)
            # do something
    

    二:直接使用保存登录状态的Cookie模拟登录

    • 如果实在没办法了,可以用这种方法模拟登录,虽然麻烦一点,但是成功率100%
    # -*- coding: utf-8 -*-
    import scrapy
    
    class RenrenSpider(scrapy.Spider):
        name = "renren"
        allowed_domains = ["renren.com"]
        start_urls = [
            'http://www.renren.com/111111',
            'http://www.renren.com/222222',
            'http://www.renren.com/333333',
        ]
    
        cookies = {
            "anonymid" : "ixrna3fysufnwv",
            "_r01_" : "1",
            "ap" : "327550029",
            "JSESSIONID" : "abciwg61A_RvtaRS3GjOv",
            "depovince" : "GW",
            "springskin" : "set",
            "jebe_key" : "f6fb270b-d06d-42e6-8b53-e67c3156aa7e%7Cc13c37f53bca9e1e7132d4b58ce00fa3%7C1484060607478%7C1%7C1486198628950",
            "t" : "691808127750a83d33704a565d8340ae9",
            "societyguester" : "691808127750a83d33704a565d8340ae9",
            "id" : "327550029",
            "xnsid" : "f42b25cf",
            "loginfrom" : "syshome"
        }
    
        # 可以重写Spider类的start_requests方法,附带Cookie值,发送POST请求
        def start_requests(self):
            return [scrapy.FormRequest(url, cookies = self.cookies, callback = self.parse)]
    
        # 处理响应内容
        def parse(self, response):
            print "===========" + response.url
            with open("deng.html", "w") as filename:
                filename.write(response.body)
                
    

    三:使用selenium插件进行登录

    模拟登录知乎

    相关文章

      网友评论

          本文标题:python爬虫--day08

          本文链接:https://www.haomeiwen.com/subject/pqvprqtx.html