美文网首页大数据 爬虫Python AI Sql
关于scrapy-splash使用以及如何设置代理ip

关于scrapy-splash使用以及如何设置代理ip

作者: sunoath | 来源:发表于2017-08-23 22:26 被阅读1867次

    首先我们先介绍下如何使用scrapy-splash:

    1、安装:$ pip install scrapy-splash

    2、启动docker:$ docker run -p 8050:8050 scrapinghub/splash

    3、在setting.py文件中配置:

    3.1、SPLASH_URL = 'http://192.168.59.103:8050'
    
    3.2、 DOWNLOADER_MIDDLEWARES = {
    
                    'scrapy_splash.SplashCookiesMiddleware': 723,
    
                     'scrapy_splash.SplashMiddleware': 725,
    
                    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    
             } 
    
    3.3、SPIDER_MIDDLEWARES = {
    
                'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
    
        }
    
    3.4、DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
    
    3.5、HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
    
    以上就已经配置好scrapy-splash了,接着就是我们如何来使用。

    这里我们以京东某商品为例抓取:

    spider.py

    from scrapy.spiders import CrawlSpider, Spider
    from scrapy_splash import SplashRequest
    
    class TaoBaoSpider(CrawlSpider):
        name = 'taobao_spider'
        start_urls = ['https://item.jd.com/4736647.html?cpdad=1DLSUE']
    
        def start_requests(self):
            for url in self.start_urls:
                yield SplashRequest(url=url, callback=self.parse, args={'wait': '0.5'})
    
        def parse(self, response):
            pic = response.xpath('//span[@class="price J-p-4736647"]/text()').extract()[0]
            print pic
    

    抓取到商品价格:

    image.png

    现在我们需要给我们的scrapy添加代理中间件

    middlewares.py

      class ProxyMiddleware(object):
          def process_request(self, request, spider):
          request.meta['splash']['args']['proxy'] = proxyServer
          request.headers["Proxy-Authorization"] = proxyAuth
    
    • 这里我们需要注意的是设置代理不再是request.meta['proxy'] = proxyServer而是request.meta['splash'] ['args']['proxy'] = proxyServer

    接着我们把ProxyMiddleware添加到setting.py中

      DOWNLOADER_MIDDLEWARES = {
          'scrapy_splash.SplashCookiesMiddleware': 723,
          'scrapy_splash.SplashMiddleware': 725,
          'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
          'Spider.middlewares.ProxyMiddleware': 843,
      }
    
    • 自定义的中间件的权重需要在scrapy-splash的后面才行。
    这样就可以使用代理用scrapy-splash愉快的抓取数据了!

    相关文章

      网友评论

        本文标题:关于scrapy-splash使用以及如何设置代理ip

        本文链接:https://www.haomeiwen.com/subject/drnddxtx.html