python递归爬取wiki分类树

作者: laotoutou | 来源:发表于2017-08-03 11:43 被阅读241次

python递归爬取wiki分类树
爬取wiki接口记录
接口测试脚本
Python爬虫分享【1】
爬虫案例
爬取Python教程博客并转成PDF
Python学习
0.Python 爬虫之Scrapy入门实践指南（Scrapy基
python爬取手机app
各类链接

使用python爬取维基百科的分类树，wiki的分类是对外开放的，甚至提供了整个网站的数据库供下载，不幸的是数据库结构比较复杂，还是爬虫简单、有效，但也粗暴。
wiki的每个条目下都有一个或多个分类，一直点击这些分类会到wiki的最顶层页面分类 ,给出链接：wiki页面分类（如果无法访问，很可能是还没有翻墙，而国内搜索引擎已屏蔽VPN关键字，所以简单的办法是搜索一个新的host文件，替换掉原来的host文件即可，详情请百度）
wiki的分类方式是“依学科分类”，包含常用的生物分类树等，该图片中的“页面分类”是我们本次爬虫的入口。

爬虫入口
由于wiki采用异步加载的方式处理该点击事件，不能直接解析html得到数据，所以我们使用selenium模拟点击事件。

页面结构
接下来分析html源码：每个结点为一个Section，包含一个Item和一个Children。Item中的a链接为类别名称，一个Children又包含多个结点(Section)。

一个结点
Section/Item/Bullet/Toggle有三种情况：▼、►、空格，这是递归退出的条件。

# coding: utf-8

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException



# 递归爬取wiki分类
def get_data_of_one(section, father, num):

    result = str(father) + '-' + str(num[0]) + '\t' + section.find_element_by_tag_name('a').text
    fp.write(result.encode('utf-8') + '\n')
    # 输出之后将num[0]自增
    num[0] = num[0] + 1

    bullet = section.find_element_by_class_name('CategoryTreeBullet')

    try:
        identifier = bullet.find_element_by_class_name('CategoryTreeToggle')
    except NoSuchElementException:
        return

    # 点击事件
    identifier.click()

    children = section.find_element_by_class_name('CategoryTreeChildren')
    # 显式等待  
    sections = WebDriverWait(children, 10).until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'CategoryTreeSection')))
    temp = num[0] - 1
    for item in sections:
        get_data_of_one(item, temp, num)




def main():
    
    tag_word = driver.find_element_by_link_text('页面分类')
    root_section = tag_word.find_element_by_xpath("//../../../div[@class='CategoryTreeSection']")
    # 入口
    get_data_of_one(root_section, -1, [0])

    



if __name__ == '__main__':
    
    fp = open('classification.txt', 'a')

    driver = webdriver.Firefox(executable_path='/Users/xiaoka/Desktop/geckodriver')
    driver.get('https://zh.wikipedia.org/wiki/Category:%E9%A0%81%E9%9D%A2%E5%88%86%E9%A1%9E')

    main()

    fp.close()
    
    driver.close()

爬取结果的存储问题：给每个结点一个标号num，结点存储自己的标号num和父结点father的标号，存储到txt文件中，每行一个结点，即father-num name \n
（另：这里我使用webdriver中的Firefox，首先本机需要安装Firefox浏览器，然后下载geckodriver来驱动Firefox浏览器，geckodriver下载地址）

网友评论

本文标题：python递归爬取wiki分类树

本文链接：https://www.haomeiwen.com/subject/pxoulxtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

python递归爬取wiki分类树

相关文章