美文网首页
Scrapy框架中各个部件中的设置---集合

Scrapy框架中各个部件中的设置---集合

作者: 阪本先生_ | 来源:发表于2019-08-19 14:48 被阅读0次

    scrapy startproject xxx 创建项目
    scrapy crawl xxxx -o xx.csv 保存csv格式在本地
    Spriders爬虫文件创建 scrapy genspider xxx xxx.com(网站域名)

    设置请求头的方法

    一、手动添加单一UserAgent

    1.第一种方法,单一方式,可用在非框架爬虫爬取简单量小的项目中。

    headers = {
       'User-Agent':"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) 
    Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3650.400 QQBrowser/10.4.3341.400"}#headers包含其他请求信息,一并传递给requests
    response = requests.get(url, headers=headers)
    

    2.第二种方法,在Scrapy框架中:改写两个地方
    Settings.py

    USER_AGENT_LIST = [
        'zspider/0.9-dev http://feedback.redkolibri.com/',
        'Xaldon_WebSpider/2.0.b1',
        'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) Speedy Spider (http://www.entireweb.com/about/search_tech/speedy_spider/)',
        'Mozilla/5.0 (compatible; Speedy Spider; http://www.entireweb.com/about/search_tech/speedy_spider/)',
        'Speedy Spider (Entireweb; Beta/1.3; http://www.entireweb.com/about/search_tech/speedyspider/)',
        'Speedy Spider (Entireweb; Beta/1.2; http://www.entireweb.com/about/search_tech/speedyspider/)',
        'Speedy Spider (Entireweb; Beta/1.1; http://www.entireweb.com/about/search_tech/speedyspider/)',
        'Speedy Spider (Entireweb; Beta/1.0; http://www.entireweb.com/about/search_tech/speedyspider/)',
        'Speedy Spider (Beta/1.0; www.entireweb.com)',
        'Speedy Spider (http://www.entireweb.com/about/search_tech/speedy_spider/)',
        'Speedy Spider (http://www.entireweb.com/about/search_tech/speedyspider/)',
        'Speedy Spider (http://www.entireweb.com)',
        'Sosospider+(+http://help.soso.com/webspider.htm)',
        'sogou spider',
        'Nusearch Spider (www.nusearch.com)',
        'nuSearch Spider (compatible; MSIE 4.01; Windows NT)',
        'lmspider (lmspider@scansoft.com)',
        'lmspider lmspider@scansoft.com',
        'ldspider (http://code.google.com/p/ldspider/wiki/Robots)',
        'iaskspider/2.0(+http://iask.com/help/help_index.html)',
        'iaskspider',
        'hl_ftien_spider_v1.1',
        'hl_ftien_spider',
        'FyberSpider (+http://www.fybersearch.com/fyberspider.php)',
        'FyberSpider',
        'everyfeed-spider/2.0 (http://www.everyfeed.com)',
        'envolk[ITS]spider/1.6 (+http://www.envolk.com/envolkspider.html)',
        'envolk[ITS]spider/1.6 ( http://www.envolk.com/envolkspider.html)',
        'Baiduspider+(+http://www.baidu.com/search/spider_jp.html)',
        'Baiduspider+(+http://www.baidu.com/search/spider.htm)',
        'BaiDuSpider',
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0) AddSugarSpiderBot www.idealobserver.com",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/531.21.8 (KHTML, like Gecko) Version/4.0.4 Safari/531.21.10",
        "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/533.17.8 (KHTML, like Gecko) Version/5.0.1 Safari/533.17.8",
        "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.19.4 (KHTML, like Gecko) Version/5.0.2 Safari/533.18.5",
        "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.1.17) Gecko/20110123 (like Firefox/3.x) SeaMonkey/2.0.12",
        "Mozilla/5.0 (Windows NT 5.2; rv:10.0.1) Gecko/20100101 Firefox/10.0.1 SeaMonkey/2.7.1",
        "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-US) AppleWebKit/532.8 (KHTML, like Gecko) Chrome/4.0.302.2 Safari/532.8",
        "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.464.0 Safari/534.3",
        "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_5; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.15 Safari/534.13",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.186 Safari/535.1",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.54 Safari/535.2",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
        "Mozilla/5.0 (Macintosh; U; Mac OS X Mach-O; en-US; rv:2.0a) Gecko/20040614 Firefox/3.0.0 ",
        "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; en-US; rv:1.9.0.3) Gecko/2008092414 Firefox/3.0.3",
        "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1) Gecko/20090624 Firefox/3.5",
        "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.14) Gecko/20110218 AlexaToolbar/alxf-2.0 Firefox/3.6.14",
        "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"
        ]
    
    DOWNLOADER_MIDDLEWARES = {
        'spider_douban.middlewares.RandomUserAgentMiddleware': 400,
        'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
    }
    

    Middlewares.py 中添加

    from spider_douban.settings import USER_AGENT_LIST
    import random
    
    class RandomUserAgentMiddleware():
        def process_request(self, request, spider):
            ua  = random.choice(USER_AGENT_LIST)
            if ua:
                request.headers.setdefault('User-Agent', ua)
    

    3.第三种方法 使用fake_useragent包 pip安装
    更新fake_useragent,在命令行中输入pip install -U fake-useragent即可完成更新

    from fake_useragent import UserAgen
    
    #请求的网址
    
    url=”http://www.baidu.com”
    
    #请求头
    
    headers={“User-Agent”:UserAgent().random}
    headers = {"User-Agent": UserAgent(use_cache_server=False).chrome}#禁用服务器缓存
    ua = UserAgent(cache=False)#不缓存数据:
    ua = UserAgent(verify_ssl=False)#忽略ssl验证:
    
    #请求网址
    
    response=requests.get(url=url,headers=headers)
    
    #响应体内容
    
    print(response.text)
    
    #响应状态信息
    
    print(response.status_code)
    
    #响应头信息
    
    print(response.headers)
    

    二、储存到Mongo数据库

    pipeline.py文件中如下

    import pymongo
    
    class MongoPipeline(object):
        def __init__(self,mongo_uri,mongo_db):
            self.mongo_uri = mongo_uri
            self.mongo_db = mongo_db
            
        @classmethod
        def from_crawler(cls,crawler):
            return cls(
                mongo_uri = crawler.settings.get('MONGO_URI'),
                mongo_db = crawler.settings.get('MONGO_DB')
            )
            
        def open_spider(self,spider):
            self.client = pymongo.MongoClient(self.mongo_uri)
            self.db = self.client[self.mongo_db]
            
        def process_item(self,item,spider):
            name = item.__class__.__name__
            self.db[name].insert(dict(item))
            return item
        
        def close_spider(self,spider):
            self.client.close()
    

    setting.py中添加

    ITEM_PIPELINES = {
        
        'douban.pipelines.MongoPipeline':400,
        }
    MONGO_URI = 'localhost'#数据库地址
    MONGO_DB = 'douban' #数据库名字
    

    相关文章

      网友评论

          本文标题:Scrapy框架中各个部件中的设置---集合

          本文链接:https://www.haomeiwen.com/subject/anissctx.html