《Python网络数据采集》笔记一

作者: Maql | 来源:发表于2016-10-22 16:41 被阅读43次

本文为本人读《Python网络数据采集》写下的笔记。在第一章和第二章中，作者主要讲了BeautifulSoup这个第三方库的使用方法，以下为书中提到的比较有意思的示例(注：作者使用的是python3.x,而我使用的是python2.x;作者使用urllib库，我使用的是requests,但对学习BeautifulSoup并没有影响)：

第一章：

BeautifulSoup简单的使用：

import requests
from bs4 import BeautifulSoup as bs

resp = requests.get(url='http://www.pythonscraping.com/pages/page1.html')
soup = bs(resp.content, 'html.parser')
print soup.h1

上述代码是一个简单的demo。前两行导入了requests库和BeautifulSoup库，后面3行分别是：发送一个请求并返回一个response对象，使用BeautifulSoup构建一个BeautifulSoup对象并html.parser解析器解析response的返回值，最后打印h1。然而，这段代码完全没有可靠性，一旦发生异常则程序无法运行。

更好的做法是加入异常的捕获：

import requests
from bs4 import BeautifulSoup as bs
from requests.packages.urllib3.connection import HTTPConnection
def getTitle(url):
    try:
        resp = requests.get(url=url)
        soup = bs(resp.content, 'html.parser')
        title = soup.h1
    except HTTPConnection as e:
        print e
    except AttributeError as e:
        return None
    return title
title = getTitle('http://www.pythonscraping.com/pages/page1.html')
if title == None:
    print("title could not be found")
else:
    print(title)

上述代码使用了异常的捕获，一旦url写错或者属性寻找错误，程序都可以继续执行，并提示错误。

第二章(BeautifulSoup进价)

使用findAll查找标签包含class属性为green或red的所有标签

import requests
from bs4 import BeautifulSoup as bs

resp = requests.get(url='http://www.pythonscraping.com/pages/warandpeace.html')
soup = bs(resp.content, 'html.parser')
for name in soup.findAll('span': {'class': {'green', "red"}}):
    print name.get_text()

注意上述中字典的使用方法，soup.findAll('span': {'class': {'green'}})也可以使用soup.findAl(_class='green')来代替

使用children和descendants来寻找孩子节点和子孙节点

resp = requests.get(url='http://www.pythonscraping.com/pages/page3.html')
soup = bs(resp.content, 'html.parser')
for child in soup.find("table",{"id":"gitfList"}).children:
    print child

注意孩子节点只为table下一层结点，如table > tr，而table > tr > img则不包含

for child in soup.find("table",{"id":"giftList"}).descendants:
    print child

包含table下的所有节点，即子孙结点

使用兄弟结点next_siblings过滤table下的th标签:

resp = requests.get(url='http://www.pythonscraping.com/pages/page3.html')
soup = bs(resp.content, 'html.parser')
for child in soup.find("table",{"id":"giftList"}).tr.next_siblings:
    print child

注意：为何next_siblings能过滤th标签呢？原因是next_siblings找到的是当前节点的后面的兄弟标签，而不包括标签本身。

如果文章有什么写的不好或者不对的地方，麻烦留言哦！！！

网友评论

python遨游记

本文标题：《Python网络数据采集》笔记一

本文链接：https://www.haomeiwen.com/subject/thdhuttx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

《Python网络数据采集》笔记一

第一章：

BeautifulSoup简单的使用：

更好的做法是加入异常的捕获：

第二章(BeautifulSoup进价)

使用findAll查找标签包含class属性为green或red的所有标签

使用children和descendants来寻找孩子节点和子孙节点

使用兄弟结点next_siblings过滤table下的th标签:

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

python遨游记