美文网首页爬虫
scrapy设置header部分随机-写给自己看爬虫系列2

scrapy设置header部分随机-写给自己看爬虫系列2

作者: wfishj | 来源:发表于2017-10-20 11:55 被阅读0次

    前言

    需求:用scrapy设置request的请求头ua是随机的,header中其他参数是固定的。
    方法:由于scrapy局部设置优先于全局设置。所以在middleware中设置随机ua,在settings中DEFAULT_REQUEST_HEADERS设置固定部分,就能够实现header中ua是随机的,其他参数是固定的

    middleware中设置随机ua
    class AgentMiddleware(UserAgentMiddleware):
        
        def __init__ (self,user_agent=""):
            self.user_agent =user_agent
    
            self.ua_list = ["Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
            "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
            "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
            "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
            "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
            "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
            "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
            "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
            "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
            "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
            "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",]
    
    
        def process_request(self,request,spider):
            
            ua = random.choice(self.ua_list)
            request.headers.setdefault('Use-Agent',ua)
    
    settings中设置固定部分
    DEFAULT_REQUEST_HEADERS = {
        'Accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        'Accept-Language': "zh-CN,zh;q=0.8",
        "Accept-Encoding":"gzip, deflate",
        "Connection":"keep-alive",
        "Host":"baidu.cn",
        "Referer":"http://ris.szpl.gov.cn/bol/projectdetail.aspx",
        "User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36",
        "Origin":"http://baidu.com",
        'Upgrade-Insecure-Requests':'1',
        'Content-Type':'application/x-www-form-urlencoded'}
    

    相关文章

      网友评论

        本文标题:scrapy设置header部分随机-写给自己看爬虫系列2

        本文链接:https://www.haomeiwen.com/subject/rwamuxtx.html