Python爬虫学习(2)显示wiki页面数据

作者: 语落心生 | 来源:发表于2017-06-18 22:20 被阅读0次

Python爬虫学习(2)显示wiki页面数据
Python爬虫(九)_非结构化数据与结构化数据
零基础如何高效的学习好Python爬虫技术？
Python学习
从零基础开始学习Python爬虫你需要注意的点以及如何学习爬虫
从零基础开始学习Python爬虫你需要注意的点以及如何学习爬虫
Python学习
Python3爬虫环境配置——请求库安装（附Firefox和Ch
python爬虫学习-day7-实战
Python 基础爬虫目录

当我们决定好构建的url连接之后，所需要的就是观察网页的html结构
我们找到的wiki百科内容为mw-cntent-text标签，由于我们只需要其中包含的p后的标签词条链接，构建url结构 mw-content-text -> p[0]

56565656.png

我们发现编辑链接的结构如下
所有词条连接的a标签位于词条连接的mp-tfa标签下
find层次结构为 mp-tfa -> a -> a href

56876586575.png

采集数据

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getlinks(pageUrl):
    global pages
    html=urlopen("http://en.wikipedia.org"+pageUrl)
    bsObj=BeautifulSoup(html,'html.parser')
    try:
        print(bsObj.h1.get_text())
        print(bsObj.find(id="mw-content-text").findAll("p")[0])
        print(bsObj.find(id="mp-tfa").find("a").attrs['href'])
    except AttributeError:
        print("页面缺少一些属性")

    for link in bsObj.findAll("a" , href=re.compile("^(/wiki/)")):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                newPage=link.attrs['href']
                print(newPage)
                pages.add(newPage)
                getlinks(newPage)
getlinks("")

console output

09809809.png

发现在找到a标签之后立即抛出异常
检查编辑链接的层次顺序，修改 mp-tfa -> p -> b -> a href

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getlinks(pageUrl):
    global pages
    html=urlopen("http://en.wikipedia.org"+pageUrl)
    bsObj=BeautifulSoup(html,'html.parser')
    try:
        print(bsObj.h1.get_text())
        print(bsObj.find(id="mw-content-text").findAll("p")[0])
        print(bsObj.find(id="mp-tfa",style="padding:2px 5px").find("p").find("b").find("a").attrs['href'])
    except AttributeError:
        print("页面缺少一些属性")

    for link in bsObj.findAll("a" , href=re.compile("^(/wiki/)")):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                newPage=link.attrs['href']
                print("--------\n"+newPage)
                pages.add(newPage)
                getlinks(newPage)
getlinks("")

console output

7978979.png

原因在于之前分析的页面仅在于Main_page页面，继续对跳转之后的页面进行解析，发现并没有mp-tfa标签

jhgjhgjh.png

修改url构造 mw-content-test -> p ->a href

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getlinks(pageUrl):
    global pages
    html=urlopen("http://en.wikipedia.org"+pageUrl)
    bsObj=BeautifulSoup(html,'html.parser')
    try:
        print(bsObj.h1.get_text())
        print(bsObj.find(id="mw-content-text").findAll("p")[0])
        print(bsObj.find(id="mw-content-text").find("p").find("a").attrs['href'])
    except AttributeError:
        print("页面缺少一些属性")

    for link in bsObj.findAll("a" , href=re.compile("^(/wiki/)")):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                newPage=link.attrs['href']
                print("--------\n"+newPage)
                pages.add(newPage)
                getlinks(newPage)
getlinks("")

console output
成功拿到词条链接

867867867.png