美文网首页程序员
简易BeautifulSoup项目,爬取廖雪峰的Python教程

简易BeautifulSoup项目,爬取廖雪峰的Python教程

作者: 豌豆花下猫 | 来源:发表于2017-05-20 12:22 被阅读98次

    -- coding: utf-8 --

    """
    简易BeautifulSoup项目,爬取廖雪峰的Python教程阅读数

    @author: yunpoyue
    """
    import urllib.request
    from bs4 import BeautifulSoup

    备用url

    urlBegin = "http://www.liaoxuefeng.com"
    url = "http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000"

    读取网页代码

    html = urllib.request.urlopen(url).read()

    使用BS4处理后得到整个页面的soup和要找的部分soup2。

    soup = BeautifulSoup(html, 'html.parser')
    menu = soup.find_all(id="x-offcanvas-left") # 左侧目录列表
    values = ','.join(str(v) for v in menu)
    soup2 = BeautifulSoup(values, 'html.parser')
    soup2 = soup2.find_all("ul", "uk-nav uk-nav-side")
    soup2 = soup2[1].find_all('a') # 取出目录列表链接

    分别取目录、目录链接、阅读量

    bookMenu = []
    bookMenuUrl = []
    readnumber = []
    for i in range(0, len(soup2) - 1):
    bookMenu.append(soup2[i].get_text())
    bookMenuUrl.append(soup2[i].attrs['href'])
    for i in range(0, len(bookMenuUrl)):
    chapterCode = urllib.request.urlopen(urlBegin + bookMenuUrl[i]).read()
    chapterSoup = BeautifulSoup(chapterCode, 'html.parser')
    chapterResult = chapterSoup.find_all('span')
    readnumber.append(chapterResult[2].get_text())

    将结果写入本地文件

    f = open('c://dev/python教程.txt', 'a', encoding='utf8')
    for i in range(len(bookMenu)):
    f.write(bookMenu[i] + '-' + readnumber[i] + '\n')

    相关文章

      网友评论

        本文标题:简易BeautifulSoup项目,爬取廖雪峰的Python教程

        本文链接:https://www.haomeiwen.com/subject/tvfpxxtx.html