美文网首页python爬虫
python+selenium使用

python+selenium使用

作者: 风一样的存在 | 来源:发表于2019-01-05 14:33 被阅读0次

    有时候遇到这种情况,每个请求里面有cookies和header,但是抓包怎么也抓不到是怎么来的,用 scrapy和requests都不能执行js,只能是爬取静态的页面。利用scrapy-splash虽然可以爬取动态的页面,但是自己必须起一个服务来跑scrapy-splash。这个时候觉得还是采用selenium,selenium支持chrome和firefox等。

        def __init__(self):
            chrome_options = Options()
            chrome_options.add_argument('--disable-gpu')
            chrome_options.add_argument('--hide-scrollbars')
            # 不显示浏览器窗口
            # chrome_options.add_argument('--headless')
            self.browser = webdriver.Chrome(executable_path='/opt/webdriver/chrome/chromedriver',
                                            chrome_options=chrome_options)
            self.browser.set_page_load_timeout(30)
    
        # 重写start_requests方法
        def start_requests(self):
            cookies = self.convert_cookies(self.get_cookies())
            for form_data in self.form_data_list:
                yield scrapy.FormRequest(self.start_url, method="POST", cookies=cookies, formdata=form_data,
                                         dont_filter=True)
            pass
    
        # 通过webdriver获取cookies
        def get_cookies(self):
            self.browser.get(self.cookies_url)
            cookies = []
            try:
                WebDriverWait(self.browser, 100).until(
                    expected_conditions.element_to_be_clickable((By.XPATH, "//a[@class='searchbutton']")))
                cookies = self.browser.get_cookies()
            except Exception as e:
                self.logger.info("获取cookies出错")
    
            finally:
                # 关闭浏览器
                self.browser.quit()
            return cookies
    
        def convert_cookies(self, cookies):
            newcookies = {}
            for cookie in cookies:
                newcookies[cookie['name']] = cookie['value']
            return newcookies
    
        # 表单数据转化为dict
        def fromData2Dict(self, formData):
            # urlencode会把空格转化为+,此处做个转换
            params = urllib.parse.unquote(formData).replace('+', ' ').split("&")
            nums = len(params)
            form_data = {}
            for i in range(0, nums):
                param = params[i].split("=", 1)
                key = param[0]
                value = param[1]
                form_data[key] = value
            return form_data
    
    

    设置无头模式,不显示窗口(遇到问题:导致寻找不到页面元素)

    chrome_options.add_argument('--headless')
    

    关闭沙盒:

    options.add_argument('--no-sandbox')
    

    遇到了的问题汇总:
    1.在mac环境运行的好好的,在Linux环境一直报错,DevToolsActivePort文件找不到,参考了很多国外国内的博客都写的禁用沙箱然并卵。
    比如:

    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    chrome_options.add_argument('--disable-setuid-sandbox')
    
      File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 81, in __init__
        desired_capabilities=desired_capabilities)
      File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
        self.start_session(capabilities, browser_profile)
      File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
        response = self.execute(Command.NEW_SESSION, parameters)
      File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
        self.error_handler.check_response(response)
      File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
        raise exception_class(message, screen, stacktrace)
    selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally
      (unknown error: DevToolsActivePort file doesn't exist)
      (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
      (Driver info: chromedriver=2.45.615279 (12b89733300bd268cff3b78fc76cb8f3a7cc44e5),platform=Linux 3.10.0-327.el7.x86_64 x86_64)
    

    增加了无头模式虽然可以跑,但是无法找到页面元素

    2019-01-08 16:43:00 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
    [2019-01-08 16:43:00] 140734813173184 POST http://127.0.0.1:56931/session/cd22b1e86a32e3f65f5b2fb0a0795a49/element {"using": "xpath", "value": "//a[@class='searchbutton']", "sessionId": "cd22b1e86a32e3f65f5b2fb0a0795a49"}
    2019-01-08 16:43:00 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:56931/session/cd22b1e86a32e3f65f5b2fb0a0795a49/element {"using": "xpath", "value": "//a[@class='searchbutton']", "sessionId": "cd22b1e86a32e3f65f5b2fb0a0795a49"}
    [2019-01-08 16:43:00] 140734813173184 http://127.0.0.1:56931 "POST /session/cd22b1e86a32e3f65f5b2fb0a0795a49/element HTTP/1.1" 200 358
    2019-01-08 16:43:00 [urllib3.connectionpool] DEBUG: http://127.0.0.1:56931 "POST /session/cd22b1e86a32e3f65f5b2fb0a0795a49/element HTTP/1.1" 200 358
    

    在看别人博客发现linux服务器是无界面的,知道了xvfb这个概念:Xvfb在内存中执行所有的图形操作,不需要借助任何显示设备。就尝试安装一下看看是否能解决问题:

    yum install Xvfb
    

    还是一如既往的报错,决定降低chrome版本试试,看了下linux版本信息:

    [root@localhost google]# uname -a
    Linux localhost.localdomain 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
    

    我卸载了当前的goole-chrome(版本信息:google-chrome-stable-71.0.3578.98),重新安装了google-chrome(版本信息:google-chrome-stable-62.0.3202.94)。chromedriver版本从2.45.615279改为了2.33.506092
    最后还是报错了:

    File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 81, in __init__
        desired_capabilities=desired_capabilities)
      File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
        self.start_session(capabilities, browser_profile)
      File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
        response = self.execute(Command.NEW_SESSION, parameters)
      File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
        self.error_handler.check_response(response)
      File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
        raise exception_class(message, screen, stacktrace)
    selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally
      (Driver info: chromedriver=2.33.506092 (733a02544d189eeb751fe0d7ddca79a0ee28cce4),platform=Linux 3.10.0-327.el7.x86_64 x86_64)
    

    不过,和以前的错误不一样,感觉离成功更近了一步。
    查找资料安装pyvirtualdisplay:

    pip install pyvirtualdisplay
    

    在代码中使用:

    from pyvirtualdisplay import Display
    display = Display(visible=0, size=(800, 800))  
    display.start()
    driver = webdriver.Chrome()
    

    功夫不负有心人代码完美运行。
    2.scrapy定义初始化方法,本地python 3.7环境直接定义__init__(self)格式,但是Linux python 3.6的环境却报错,按理说使用的scrapy版本都是3.5.1。linux python 3.6的写法:

    def __init__(self, *args, **kwargs):
            super(SpdSpider, self).__init__(*args, **kwargs)
    

    参考文档

    相关文章

      网友评论

        本文标题:python+selenium使用

        本文链接:https://www.haomeiwen.com/subject/myirrqtx.html