scrapy+selenium+chrome headless

作者: 羽煊 | 来源:发表于2017-12-06 10:53 被阅读0次

scrapy+selenium+chrome headless
Statefulset
Service之headless和statefulSet结合
kubernetes的Headless Services
初识phantomjs（二）
Chrome无头浏览器-selenium3.0
Puppeteer和Headless chrome的简介和应用
用Python驱动Headless Chrome
Java爬虫:chrome headless
React Native Headless JS（后台任务)

在用scrpay写爬虫的时候对于一些js动态页面会需要一些自动化的工具来分析页面，selenium+phantomJs 是一个不错的选择，但是在使用过程中发现了一个很头痛的问题，当解析页面超时时，phantomJs就一直卡死。对于selenium + chrome这种我一直都很排斥，原因是要打开浏览器的可视化界面。
正当一筹莫展时发现了chrome headless。headless模式支持Chromium和Blink渲染引擎提供的所有现代网页平台的特征都转化成命令行。
代码如下：

from selenium import webdriver
from scrapy.http import HtmlResponse
from selenium.webdriver.chrome.options import Options

class SeleniumMiddleware(object):
  def process_request(self,request,spider):
    options = Options()
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    options.binary_locaion = '/usr/bin/google-chrome-stable'
    capabilities = {}
    capabilities['platform'] = 'Linux'
    capabilities['version'] = '16.04'
    if spider.name == "music163":
      print 'start parse {0}'.format(request.url)
      #driver = webdriver.PhantomJS()
      driver = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver',chrome_options=options,desired_capabilities=capabilities)
      try:
        driver.get(request.url)
        driver.switch_to.frame('g_iframe')
        body = driver.page_source
        print 'finished parse {0}'.format(request.url)
        return HtmlResponse(driver.current_url,body=body,encoding='utf-8',request=request)
      except:
        driver.quit()

selenium要使用chrome需要chromedriver驱动协助，ChromeDriver通过chrome的自动代理框架控制浏览器，chromedriver和chrome的版本要对于匹配，否则无法运行chrome.
chromedriver和chrome的版本映射表，见： http://chromedriver.storage.googleapis.com/2.33/notes.txt

网友评论

我爱编程

本文标题：scrapy+selenium+chrome headless

本文链接：https://www.haomeiwen.com/subject/tfdkixtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

scrapy+selenium+chrome headless

相关文章