使用 Selenium 抓取 Google 趋势的热门搜索排行榜

作者: BossOx | 来源:发表于2017-02-21 21:47 被阅读509次

使用 Selenium 抓取 Google 趋势的热门搜索排行榜
Java Selenium抓取网页
Python爬虫
selenium抓取淘宝商品信息
使用selenium&phantomjs+bs4抓取斗鱼
全球影迷的最爱，年度电影搜索排行top10
python selenium的使用,自动化浏览器
Google搜索语句
Python的某个任务
如何爬取 Google Trends 数据

本文以 Google 趋势为例，总结在抓取全动态网页信息时遇到的几个问题及对应的解决方法。包括如何等待动态获取的内容加载完成，以及当搜索到的对象不在可视范围内不可被点击等。

注意：本文内容具有时效性，只保证在撰写当时是正确可用的，Google 的网站更新变化后，代码的抓取结果不可预测。

另外，业余实习僧，非专业码农，纯属给自己写备忘录，技术层面难登大雅之堂，见谅。

背景

Google 趋势热门搜索排行榜是个有趣的网页，如字面所示，它提供了全球各地在指定历史年月的热门搜索的关键字榜单，按排名算每个种类提供最多的 10 个，若有并列则向后顺延。通过分析上面的数据，可以对网络流行趋势和社会热点有一个大概的把握。

网页本身不提供内容下载通道，手动整理相当低效，于是自然而然的应了那句老话——能用代码解决的问题不要复制粘贴——好吧，这只是我一家之言。

这是个全动态渲染的网页，禁用 JavaScript 后一片空白，查看源代码发现其中 80% 的部分是 JS 脚本，HTML 只占很少一部分。用传统的抓取静态网页解析 HTML 标签的办法无法获取其中的内容，需要专门的处理手段。

最著名的莫过于使用 Selenium WebDriver 引擎来驱动实体浏览器对网页进行解析，然后从浏览器的结果中提取信息。

关于如何上手使用这一框架的教程一大堆，你转我的我转你的，搜索一大片所获得的还是写差不多的内容，不是很具体和详细。我在实际使用过程中遇到了两个大坑，因为很少有人给出简单有效的解决办法，所以花了不少时间才得以解决。现在把个人经验总结于此，以来日后自己忘了可以回查，二来如果有幸能帮助到有同样困惑的人，也算好事一桩。

准备工作

环境：Python 3.5、Selenium 3.0.2、ChromeDriver，具体配置方法从略。

Google 趋势的热门搜索排行榜的地址是https://trends.google.com/trends/topcharts，在其后用#作为分隔来添加参数，geo表示地区，date表示时间（年月），不同参数用&隔开，例如查询 美国 2016 年 9 月 的排行榜，就在 URL 后添加#geo=US&date=201609。这是基本的 URL 约定，不再赘述。

添加引用

from selenium import webdriver

定义网页引擎并打开指定页面

driver = webdriver.Chrome()
driver.get("https://trends.google.com/trends/topcharts#geo=US&date=201611")

解决等待页面内容加载的问题

抓取内容需要等待目标元素被加载后才可以进行，否则会引起无法定位元素的异常。在静态网页中，页面加载结束后所有的内容就都已经存在在浏览器中，但是在动态加载的网页中，页面加载完毕后，动态加载的元素不一定已经被获取，需要确保目标元素已经完成加载后在进行抓取操作。

在网上查询解决方案时大多为很鸡肋的“硬方法”，即人为将程序暂停一段时间，等待页面加载完成。

import time

driver = webdriver.Chrome()
driver.get("https://trends.google.com/trends/topcharts#geo=US&date=201611")

# Wait for completion.
time.sleep(3)

# Extract information.

这样做弊病很多，一方面由于网络环境的不确定性，程序无法确保在规定等待时间结束后目标元素已经加载完成；另一方面如果在指定时间内就已经加载完成，则会造成不必要的时间浪费。无论哪一种都不是理想的解决思路。

应该使用 Selenium 框架提供的官方解决方案，由检测目标元素的可见性确定加载是否完成，阻塞程序然后再进行下一步的处理。

alecxe, MrE - StakOverflow
You need to do this step by step checking the visibility of the elements you are going to interact with using Explicit Waits, do not use time.sleep() - it is not reliable and error-prone.

为此，新增引用

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

指定第一个要查找的目标元素，在这里也就是 Google 趋势页面上的一个分类的名字，用 XPath 来定位，并且使用官方提供的“等待直到”方法来等待目标元素加载完成

xpath = '//*[@id="djs-trending"]/div/a/div[1]/div/span'
element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, xpath)))

如此，程序在寻找目标元素时会被阻塞，直到在浏览器中能够找到该元素，也即该元素加载完成，即可恢复执行，进行后面的操作。

实际上，除了 显式等待 以外还有 隐式等待 可以使用，这两点的用法在官方文档中有详细的说明。比起显式等待来说，隐式等待更有“一劳永逸”的效果，只要进行如下设置

# Set timeout to 10 seconds.
driver.implicitly_wait(10)

即可在后续的操作中的每一步都进行加载完成与否的检验，比显式等待要清爽得多。

解决目标元素不在可视范围内无法点击的问题

个别时候，并不是任何时候，在获取到目标元素后，对其发送点击事件或者键盘事件时，会提示元素无法接收该事件，事件会被其他元素拦截或者找不到该对象。在确定无疑不是新弹出的上层元素将其覆盖的情况下，这可能是因为目标元素没有出现在浏览器可见范围内而导致的。

并不清楚背后的原理，但是解决思路简单暴力——将目标元素滚动到可视范围内来。可以通过对可接受事件的元素发送按键事件来模拟向下滚动，也可以通过 JS 来实现。最为精准而安全的措施是直接将对象滚动到可视范围的最顶端，类似页面内书签的定位

# Scroll element to the top edge of the view.
driver.execute_script("return arguments[0].scrollIntoView();", element)

而后再进行键鼠事件操作即可。

完整代码

# Get top 10 keywords in https://trends.google.com/trends/topcharts
# Boss Ox / 2017.02.20 / Beijing @ByteDance

import threading
import time

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Settings.
URL = 'https://trends.google.com/trends/topcharts#geo=US&date='
SaveFolder = r'F:\Project\Python\GoogleTrends' + '\\'
ConcurrentNumber = 5

# Date to fetch.
Dates = [
    '201611',
    '201610',
    '201609',
    '201608',
    '201607',
    '201606',
    '201605',
    '201604',
    '201603',
    '201602',
    '201601'
]

# Genres to fetch, acquired by category text in page, by XPath.
Genres = [
    '//*[@id="djs-trending"]/div/a/div[1]/div/span',
    '//*[@id="people-trending"]/div/a/div[1]/div/span',
    '//*[@id="authors-trending"]/div/a/div[1]/div/span',
    '//*[@id="childrens_tv_programs-trending"]/div/a/div[1]/div/span',
    '//*[@id="animals-trending"]/div/a/div[1]/div/span',
    '//*[@id="countries-trending"]/div/a/div[1]/div/span',
    '//*[@id="books-trending"]/div/a/div[1]/div/span',
    '//*[@id="cities-trending"]/div/a/div[1]/div/span',
    '//*[@id="celestial_objects-trending"]/div/a/div[1]/div/span',
    '//*[@id="whiskey-top"]/div/a/div[1]/div/span',
    '//*[@id="fast_food_restaurants-trending"]/div/a/div[1]/div/span',
    '//*[@id="governmental_bodies-top"]/div/a/div[1]/div/span',
    '//*[@id="politicians-trending"]/div/a/div[1]/div/span',
    '//*[@id="fashion_labels-top"]/div/a/div[1]/div/span',
    '//*[@id="baseball_players-trending"]/div/a/div[1]/div/span',
    '//*[@id="baseball_teams-trending"]/div/a/div[1]/div/span',
    '//*[@id="songs-top"]/div/a/div[1]/div/span',
    '//*[@id="automobile_models-trending"]/div/a/div[1]/div/span',
    '//*[@id="auto_companies-top"]/div/a/div[1]/div/span',
    '//*[@id="games-top"]/div/a/div[1]/div/span',
    '//*[@id="actors-trending"]/div/a/div[1]/div/span',
    '//*[@id="dog_breeds-trending"]/div/a/div[1]/div/span',
    '//*[@id="sports_teams-trending"]/div/a/div[1]/div/span',
    '//*[@id="films-trending"]/div/a/div[1]/div/span',
    '//*[@id="tv_shows-trending"]/div/a/div[1]/div/span',
    '//*[@id="reality_shows-trending"]/div/a/div[1]/div/span',
    '//*[@id="scientists-trending"]/div/a/div[1]/div/span',
    '//*[@id="basketball_players-trending"]/div/a/div[1]/div/span',
    '//*[@id="basketball_teams-top"]/div/a/div[1]/div/span',
    '//*[@id="us_governors-top"]/div/a/div[1]/div/span',
    '//*[@id="foods-top"]/div/a/div[1]/div/span',
    '//*[@id="energy_companies-top"]/div/a/div[1]/div/span',
    '//*[@id="medicines-top"]/div/a/div[1]/div/span',
    '//*[@id="soccer_players-trending"]/div/a/div[1]/div/span',
    '//*[@id="soccer_teams-trending"]/div/a/div[1]/div/span',
    '//*[@id="sports_cars-trending"]/div/a/div[1]/div/span',
    '//*[@id="programming_languages-top"]/div/a/div[1]/div/span',
    '//*[@id="athletes-trending"]/div/a/div[1]/div/span',
    '//*[@id="financial_companies-top"]/div/a/div[1]/div/span',
    '//*[@id="retail_companies-top"]/div/a/div[1]/div/span',
    '//*[@id="teen_pop_artists-trending"]/div/a/div[1]/div/span',
    '//*[@id="musicians-trending"]/div/a/div[1]/div/span',
    '//*[@id="beverages-top"]/div/a/div[1]/div/span',
    '//*[@id="colleges_universities-trending"]/div/a/div[1]/div/span',
    '//*[@id="cocktails-top"]/div/a/div[1]/div/span'
]

# Fetch information in each genre on date.
def getTrendsOnDate(month):
    url = URL + month
    driver = webdriver.Chrome() # PhantomJS can fail extracting second item. DKW.
    results = {}

    try:
        for genre in Genres:
            # Load page.
            driver.get(url)

            # Wait for completion.
            element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, genre)))

            # Find genre.
            element = driver.find_element_by_xpath(genre)
            if element != None:
                # Get genre text.
                genre_text = element.text

                # Scroll down to element
                driver.execute_script('return arguments[0].scrollIntoView();', element)

                # Open genre sub-page.
                element.click()

                # Wait for completion.
                first_item_xpath = '/html/body/div[23]/div[2]/div/div[1]/div/span/div/span[1]/div/a'
                WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, first_item_xpath)))

                # Extract information of top 10 items.
                items = []
                for i in range(1, 11):
                    item_xpath = '/html/body/div[23]/div[2]/div/div[%d]/div/span/div/span[1]/div/a' % (i)
                    element = driver.find_element_by_xpath(item_xpath)
                    items.append(element.text)

                # Store results.
                results[genre_text] = items
            else:
                # Genre not found, skip this genre.
                pass
    except:
        # Anything wrong happens, just output what we have till now.
        pass

    # Output results.
    outputResults(month, results)

    # Close driver.
    driver.quit()

def outputResults(month, results):
    filename = SaveFolder + month + '.txt'

    try:
        with open(filename, 'w', encoding= 'utf_8_sig') as file:
            for result in results:
                for item in results[result]:
                    line = '%s\t%s'%(result, item)
                    file.writelines(line + '\n')
        print('[ %s ] Completed.'%(month))
    except Exception as e:
        print('[ %s ] Error on writing file %s.\n           %s'%(month, filename, e.args))

# Program Entrance.
while len(Dates) > 0:
    # Get data on target date.
    target = Dates.pop()
    print('[ %s ] task started.' % (target))
    task = threading.Thread(target= getTrendsOnDate, args= {target, })
    task.start()

    # Limit concurrent thread number.
    while threading.activeCount() > ConcurrentNumber:
        time.sleep(0.2)

总结

这段代码还有很多待完善的地方，比如巨大的方法应该被拆分重构，对页面的解析容错度较小，性能有待优化，以及采用 PhantomJS 引擎时莫名其妙的信息丢失问题等。但是秉承着“先实现功能解决问题，再花精力想如何做好”的观念，有了能用的工具我就挺开心的了哈哈哈。

虽然 Python 解释器的 GIL 机制使多线程性能大打折扣，但聊胜于无，多开之后的执行效率还是有明显提升的。

一句心得：多花时间研究官方文档。

一点题外话：新学期刚开始，选了一门“计算社会学”课程作为选修，成功以经济学院学生身份打入信息学院内部，课后闲聊竟偶遇在 Programmer at RUC 群里认识的好友，也是缘分。比起我这三天打鱼两天晒网的懒散人士，人家对计算机科学学习的兴趣可是浓厚多了，谈起我没学过的数据结构和算法，真是惭愧不如。同学简书账号 CarbonCheney，写了不少深度技术文章，值得一看。

参考与引用

使用 Selenium 抓取 Google 趋势的热门搜索排行榜
本文以 Google 趋势为例，总结在抓取全动态网页信息时遇到的几个问题及对应的解决方法。包括如何等待动态获取的内...
Java Selenium抓取网页
最近自己在做一个项目使用Selenium抓取数据，发现升级Google Chrome84版本会出现被检测出来使用...
Python爬虫
介绍使用 Beautiful Soup抓取静态网页，使用Selenium-WebDriver抓取动态网页 Robo...
selenium抓取淘宝商品信息
利用selenium抓取淘宝商品搜索页的信息。试了一下调用chrome，速度确实不快。后续可以通过使用Phanto...
使用selenium&phantomjs+bs4抓取斗鱼
使用selenium&phantomjs+bs4抓取斗鱼直播房间信息
全球影迷的最爱，年度电影搜索排行top10
在接近年底的这个时间点，Google公布了2018年最热门搜寻排行榜。今天社长来主要讲讲其中的电影搜索榜TOP10...
python selenium的使用,自动化浏览器
selenium的使用零、获取chrome http://chromedriver.storage.google...
Google搜索语句
google常用搜索语句介绍简介 google常用搜索语句介绍在专业文献的相关文库文档资源中很热门哦，下面为大家展...
Python的某个任务
作业： 1. 使用 selenium 或者 requests 模块抓取华为商城荣耀9 https://w...
如何爬取 Google Trends 数据
Goole Trends介绍功能介绍谷歌趋势 (Google Trends)是Google推出的一款基于搜索日...