Selenium

Selenium封装了能够进行浏览器自动化的一系列工具和库的一款自动化工具。Selenium提供了遵守W3C WebDriver specification的一个平台，并且该平台提供了能与当前市面上几乎所有浏览器兼容的接口。如果需要使用SeleniumAPI控制浏览器，只需要定义 Selenium WebDriver并下载相应浏览器的驱动程序(executable)即可。需要注意的是，该驱动程序需要位于系统路径（PATH环境变量）下，这样Selenuim才可以找得到这个程序。同时我们需要保证该驱动程序属性是可执行文件（可以利用chmod a+x chromedriver修改）

因为其良好的兼容性，Selenium现在被广泛用作自动化前端测的工具，或者用于驱动一些复杂的爬虫程序。

在headless环境中运行Selenium webdriver

默认情况下，WebDriver将会自动启动一个浏览器然后在该浏览器中运行脚本中指定的步骤，最后退出。但这种情况要求我们一定要有GUI输出。如果我们希望在一个无GUI的环境（如一台linux服务器）中通过命令行来执行我们的脚本，则需要进行一定配置。

首先我们需要安装Firefox或者Chromium浏览器。

$ sudo apt-get update
$ sudo apt-get install chromium-browser

或者

$ sudo apt-get update
$ sudo apt-get install firefox

其次我们需要安装Selenium。

$ sudo pip install selenium

最后我们需要下载相应浏览器的driver驱动文件并将其放在PATH路径下。

方法1：使用Xvfb创建虚拟Xwindow输出

首先我们需要安装xvfb（X windows virtual frame buffer）。运行如下命令：

$ sudo apt-get update
$ sudo apt-get install xvfb

然后我们需要启动Xvfb并指定一个输出端口号（本例中为54321）。

$ Xvfb :54321 -ac &

接着我们指定上一步选定的输出端口号作为DISPLAY环境变量。

$ export DISPLAY=:54321

最后我们就可以测试浏览器是否可以正常运行。

$ firefox

或者

$ chromium-browser

如果浏览器能够正常运行，不报错，则说明我们已经配置好了，可以用Ctrl-C退出。

现在我们就可以正常运行我们的Selenium WebDriver脚本了（这个脚本的代码甚至不需要任何改变！）。该方法的优势正在于此。配置好Xvfb之后，我们可以在该headless服务器的环境中运行任何WebDriver。

下面是一个Python+Selenium WebDriver脚本的简单例子。

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

def get_redurection_chain(url):
    """
    Given a url, return the urls in redirection chain and the length of the redirection chain.
    The redirection chain will be checked using selenium driven chrome browser and retrieved from
    browser log.

    :param url: the url that will be checked.
    :return: (
        length of redirection chain,
        a list of the urls in the redirection ordered based on the sequence they are visited,
    )
    """
    # landing_urls record origins->url->other intermedia urls->final_url
    landing_urls = list()
    landing_urls.append(url)

    curr_url = url

    capabilities = DesiredCapabilities.CHROME
    capabilities['loggingPrefs'] = {
        'performance': 'ALL',
    }

    driver = webdriver.Chrome(
        desired_capabilities=capabilities,
    )

    driver.get(url)

    for log in driver.get_log('performance'):
        log_entry = json.loads(log['message'])

        if 'redirectResponse' not in log_entry['message']['params']:
            continue
        if log_entry['message']['params']['redirectResponse']['url'] == curr_url:
            redirect_url = log_entry['message']['params']['request']['url']
            landing_urls.append(redirect_url)
            curr_url = redirect_url

    driver.close()

    return len(landing_urls), landing_urls

if __name__ == '__main__':
    get_redurection_chain('http://facebook.com/')

方法2：使用浏览器自带的headless模式运行

事实上，从去年以来，Chrome和Firefox都提供了headless运行的选项。这一举动对于像是PhantomJS这样的轻量级headless浏览器产生了极大的冲击。

以Chromium（或者Chrome）为例，要指定Selenium脚本使用headless模式运行浏览器，只需要增加option即可。如下面例子所示。

这一方法的优势在于不需要进行任何额外配置即可在无GUI环境中运行Selenium脚本。但缺点在于该方法依赖对应浏览器提供相应的headless模式。

from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('headless')

相应的，方法1例子中的程序将变为：

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities


def get_redurection_chain(url):
    """
    Given a url, return the urls in redirection chain and the length of the redirection chain.
    The redirection chain will be checked using selenium driven chrome browser and retrieved from
    browser log.

    :param url: the url that will be checked.
    :return: (
        length of redirection chain,
        a list of the urls in the redirection ordered based on the sequence they are visited,
    )
    """
    # landing_urls record origins->url->other intermedia urls->final_url
    landing_urls = list()
    landing_urls.append(url)

    curr_url = url

    capabilities = DesiredCapabilities.CHROME
    capabilities['loggingPrefs'] = {
        'performance': 'ALL',
    }

    options = webdriver.ChromeOptions()
    options.add_argument('--ignore-certificate-errors')
    options.add_argument('headless')

    driver = webdriver.Chrome(
        desired_capabilities=capabilities,
        chrome_options=options,
    )

    driver.get(url)

    for log in driver.get_log('performance'):
        log_entry = json.loads(log['message'])

        if 'redirectResponse' not in log_entry['message']['params']:
            continue
        if log_entry['message']['params']['redirectResponse']['url'] == curr_url:
            redirect_url = log_entry['message']['params']['request']['url']
            landing_urls.append(redirect_url)
            curr_url = redirect_url

    driver.close()

    return len(landing_urls), landing_urls

if __name__ == '__main__':
    get_redurection_chain('http://facebook.com/')