Python爬虫学习（六）BeautifulSoup来袭

作者: 弃用中 | 来源:发表于2017-08-20 23:03 被阅读496次

Python爬虫学习（六）BeautifulSoup来袭
Pyhton爬虫之requests与BeautifulSoup
bs4是非常牛逼的爬虫库！深度解析爬虫利器，轻松获得网站信息！
BeautifulSoup requests 爬虫初体验
Python 爬虫
Python+PhantomJS+selenium+Beauti
男子大学生的無駄日常
爬虫练手项目：获取豆瓣评分最高的电影并下载
bs4
Python学习——用BeautifulSoup爬虫

在之前的实例中，我们都是用正则表达式来提取我们想要的信息，尤其是在上一节，我们可能写了一个比较长的正则表达式，那有没有更加方便的方式以供我们提取信息呢？
当然，BeautifulSoup就是一个很好的工具。

牛刀小试

为了让大家对BeautifulSoup的方便有个感性的认识，直接上代码比较。
以下是爬取豆瓣电影Top250的代码：

import urllib.request
from urllib.request import Request
from urllib.parse import urlencode
import re
import random

base_url = 'https://www.douban.com/doulist/3936288/'
pattern = re.compile('<div\sclass="title">\s.*?<a.*?>(.*?)</a>',re.S)
user_agent = [
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
]
for i in range(0,250,25):
    data = {
        'start':i
    }
    data = bytes(urlencode(data),'utf-8')
    headers = {'User-Agent':random.choice(user_agent)}
    requ = Request(base_url,data)
    html = urllib.request.urlopen(requ).read().decode('utf-8')
    results = re.findall(pattern,html)
    for result in results:
        result = re.sub('\n','',result)
        print(result)

下面是使用BeautifulSoup的代码：

import urllib.request
from urllib.request import Request
from urllib.parse import urlencode
import re
import random
from bs4 import BeautifulSoup

base_url = 'https://www.douban.com/doulist/3936288/'
pattern = re.compile('<div\sclass="title">\s.*?<a.*?>(.*?)</a>',re.S)
user_agent = [
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
]
for i in range(0,250,25):
    data = {
        'start':i
    }
    data = bytes(urlencode(data),'utf-8')
    headers = {'User-Agent':random.choice(user_agent)}
    requ = Request(base_url,data)
    html = urllib.request.urlopen(requ).read().decode('utf-8')
    # 以下为更改内容
    soup = BeautifulSoup(html,'lxml')
    for item in soup.select('.title'):
        a = item.select('a')[0]
        title = a.get_text()
        print(title)

以下为运行结果：

太长，略去后面的电影

可以清楚地看到，这碗“靓汤”果然挺好用的！

下面简单介绍一下BeautifulSoup的使用方法。

基本使用

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

安装

打开命令行工具，输入pip install beautifulsoup4即可安装。

已安装显示如图

安装解析器

BeautifulSoup除了支持Python标准库中的HTML解析器,还支持一些第三方的解析器。可以使用pip命令安装。

解析器

如何使用

将一段文档传入BeautifulSoup 的构造方法,就能得到一个文档的对象, 可以传入一段字符串或一个文件句柄.
如演示代码中的soup = BeautifulSoup(html,'lxml')。
我们指定使用lxml解析器解析文档，Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构。

树形结构
BeautifulSoup官方文档中有很多详细的资料，这里，只介绍一个搜索文档树的方法find_all()和CSS选择器。

find_all()

该方法搜索出符合过滤器条件的元素，查找到想要查找的文档内容。
过滤器有很多种，逐一介绍：
以爱丽丝文档作为例子，

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

字符串

最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的标签:

soup.find_all('b')
# [<b>The Dormouse's story</b>]

正则表达式

如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 match() 来匹配内容.下面例子中找出所有以b开头的标签,这表示<body>和标签都应该被找到:

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b

列表

如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有<a>标签和标签:

soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

True

True可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串

for tag in soup.find_all(True):
    print(tag.name)
# html
# head
# title
# body
# p
# b
# p
# a
# a
# a
# p

方法

如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数 ,如果这个方法返回 True表示当前元素匹配并且被找到,如果不是则反回 False

下面方法校验了当前元素,如果包含 class属性却不包含id
属性,那么将返回True

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

将这个方法作为参数传入 find_all()方法,将得到所有标签:

soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
#  <p class="story">Once upon a time there were...</p>,
#  <p class="story">...</p>]

返回结果中只有标签没有<a>标签,因为<a>标签还定义了”id”,没有返回<html>和<head>,因为<html>和<head>中没有定义”class”属性.

CSS选择器

BeautifulSoup支持大部分的CSS选择器 , 在 Tag或 BeautifulSoup对象的 .select()
方法中传入字符串参数, 即可使用CSS选择器的语法找到tag:

soup.select("title")
# [<title>The Dormouse's story</title>]

soup.select("p nth-of-type(3)")
# [<p class="story">...</p>]

通过tag标签逐层查找:

soup.select("body a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("html head title")
# [<title>The Dormouse's story</title>]

更多CSS选择器语法，查看http://www.w3school.com.cn/css/css_selector_type.asp。

以上内容只是官方文档中的一小部分内容，详情请查看：https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/#

小技巧

审查元素

选择copy selector，我们就把元素的CSS选择复制下来了，也就是说……你再也不用费心思写CSS选择器了！！！

致谢

截至到现在，已有1430个人关注了viljw，在此表示万分的感谢，作为一枚24K纯屌丝，从小到大的各种账号都没见过这么多同胞，惶恐惶恐！

以上，感谢大家的关注！

Python爬虫学习（六）BeautifulSoup来袭
在之前的实例中，我们都是用正则表达式来提取我们想要的信息，尤其是在上一节，我们可能写了一个比较长的正则表达式，那有...
Pyhton爬虫之requests与BeautifulSoup
requests与BeautifulSoup基础入门 1. 前言最近在学习python爬虫，以前实现python...
bs4是非常牛逼的爬虫库！深度解析爬虫利器，轻松获得网站信息！
爬虫介绍学习Python爬虫过程中，一般使用的库主要是：requests 和BeautifulSoup 。其中 ...
BeautifulSoup requests 爬虫初体验
BeautifulSoup requests 爬虫初体验说爬虫不得不提python 常用的Python爬虫库(摘...
Python 爬虫
Python 爬虫 urllib BeautifulSoup re datetime random json
Python+PhantomJS+selenium+Beauti
Python+PhantomJS+selenium+BeautifulSoup实现简易网络爬虫简易网络小爬虫，目...
男子大学生的無駄日常
关键词：Python,爬虫,requests,BeautifulSoup,opencv,python多线程,正则表...
爬虫练手项目：获取豆瓣评分最高的电影并下载
前期回顾上篇博文我们学习了Python爬虫的四大库urllib ，requests ，BeautifulSoup...
bs4
Python爬虫常用模块，BeautifulSoup笔记 – 麦穗技术 Beautiful Soup 4.2.0 ...
Python学习——用BeautifulSoup爬虫
突然对爬虫有了兴趣，记录一下学习爬虫用到的BeautifulSoup工具。之前对python只限于安装了编译环境...