最近更新:2018-02-05
1.BeautifulSoup库的了解
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

1.1BeautifulSoup库的安装方法:
在cmd中输入:pip install beautifulsoup4
1.2Beautiful Soup库的理解


简单的说,BeautifulSoup库可以将一个html文档转换成一个BeautifulSoup类,然后我们就可以使用BeautifulSoup的各种方法提取出我们所需要的元素
注意:Beautiful Soup库是解析、遍历、维护“标签树”的功能库
1.3Beautiful Soup库的引用
Beautiful Soup库,也叫beautifulsoup4 或 bs4
from bs4 import BeautifulSoup
import bs4

1.4BeautifulSoup类的基本元素

2.Beautiful Soup库解析器

3.Beautiful Soup基本使用
因以下代码是不全的,body/html标签没有闭合,我们用BeautifulSoup来解析代码.soup.prettify()这个是格式化代码,自动将缺失的标签补全.
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup.prettify()已自动将缺失的标签补全.解析代码如下:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.prettify())
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title" name="dromouse">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<!-- Elsie -->
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
>>> print(soup.title.string)
The Dormouse's story
4.Beautiful Soup标签选择器
4.1选择元素-Tag 标签

注意:
- 任何存在于HTML语法中的标签都可以用soup.<tag>访问获得
- 当HTML文档中存在多个相同<tag>对应内容时,soup.<tag>返回第一个
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup(html,"lxml")
#title标签
>>> print(soup.title)
<title>The Dormouse's story</title>
#head标签
>>> print(soup.head)
<head><title>The Dormouse's story</title></head>
#title标签
>>> print(type(soup.title))
<class 'bs4.element.Tag'>
#p标签,多个同样的标签,返回第一个
>>> print(soup.p)
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
4.2获取名称-Tag的name(名字)
每个<tag>都有自己的名字,通过<tag>.name获取,字符串类型

>>> html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.title.name)#获取title的名字
title
4.3获取属性-Tag的attrs(属性)

有两种方法,如下:
>>> html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.p.attrs["name"])#方法一
dromouse
>>> print(soup.p["name"])#方法二
dromouse
4.4获取内容-Tag的NavigableString

>>> html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> soup.p.string
"The Dormouse's story"
4.5Tag的Comment


4.6嵌套选择
可以选择标签树的嵌套选择
>>> html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.head.title.string)
The Dormouse's story
4.7子节点和子孙节点-标签树的下行遍历


>>> html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
4.7.1content用法,获取p标签的子节点所有的内容,并以列表的方式返回
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.p.contents)
['\n Once upon a time there were three little sisters; and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n and\n ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n and they lived at the bottom of a well.\n
4.7.2children用法,是个迭代器的内容,用循环的方式才能获取,获得所有子节点
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.p.children)
<list_iterator object at 0x0000000003789EF0>
>>> for i, child in enumerate(soup.p.children):
print(i, child)#i接收索引,child接收内容
0
Once upon a time there were three little sisters; and their names were
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2
3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4
and
5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6
and they lived at the bottom of a well.
4.7.3descendants用法
返回的类型也是一个迭代器,是获取所有的子孙节点,包含p标签的子孙节点span标签,也是作为其中的一部分,这是与children的区别.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.p.descendants)
<generator object descendants at 0x000000000379C150>
>>> for i, child in enumerate(soup.p.descendants):
print(i, child)
0
Once upon a time there were three little sisters; and their names were
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2
3 <span>Elsie</span>
4 Elsie
5
6
7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9
and
10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12
and they lived at the bottom of a well.
4.8父节点和祖先节点-标签树的上行遍历


html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
4.8.1parent
a标签的父亲节点,就是p标签
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.a.parent)
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
4.8.2parents
输出a的所有父亲节点,0p>1body>2html>3所有的文档(倒数第1个和倒数第2个其实是一样的)
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(list(enumerate(soup.a.parents)))
[(0, <p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>), (1, <body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body>), (2, <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body></html>), (3, <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body></html>)]
4.9兄弟节点-标签树的平行遍历


>>> html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
4.9.1next_siblings后续所有的平行节点
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(list(enumerate(soup.a.next_siblings)))
[(0, '\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \n and\n '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n and they lived at the bottom of a well.\n ')]
4.9.2previous_siblings前续所有的平行节点
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(list(enumerate(soup.a.previous_siblings)))
[(0, '\n Once upon a time there were three little sisters; and their names were\n ')]
4.10总结
单纯用以上的方法是远远不够的,因为实际匹配的时候还涉及到其他属性的匹配等状况.


5.Beautiful Soup标准选择器
5.1find_all(name, attrs, recursive, string, **kwargs)
- name : 对标签名称的检索字符串
- attrs: 对标签属性值的检索字符串,可标注属性检索
- recursive: 是否对子孙全部检索,默认True
- string: <>…</>中字符串区域的检索字符串
注意:
- 可根据标签名、属性、内容查找文档
- <tag>(..) 等价于 <tag>.find_all(..)
soup(..) 等价于 soup.find_all(..)
注意:返回的是列表
>>> html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
5.1.1 name
获取基础方法,返回的是列表
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.find_all('ul'))
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
>>> print(type(soup.find_all('ul')[0]))
<class 'bs4.element.Tag'>
用遍历的方式提取数据
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> for ul in soup.find_all("ul"):
print(ul.find_all("li"))
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
5.1.2 attrs
>>> html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
属性用字典类型进行查找
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.find_all(attrs={'id': 'list-1'}))
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
>>> print(soup.find_all(attrs={'name': 'elements'}))
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
特殊的attrs可以不用字典的方式进行查找.比如属性是id/class,class比较特殊,要加下划线_,变成class_
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.find_all(id='list-1'))
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
>>> print(soup.find_all(class_='element'))
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
5.1.3 string
两个li标签里面的文本是Foo,运行的结果是之间把string打印出来,而不是把标签打印出来,匹配内容相对比较有用,但是用在元素查找并没有那么方便.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.find_all(text='Foo'))
['Foo', 'Foo']
5.2find( name , attrs , recursive , text , **kwargs )
find与find_all方法是一样的.find返回单个元素(匹配结果的第一个值),find_all返回所有元素.
>>> html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.find('ul'))
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
>>> print(type(soup.find('ul')))
<class 'bs4.element.Tag'>
>>> print(soup.find('page'))#不存在的标签
None
5.3find_parents() find_parent()
find_parents()返回所有祖先节点,find_parent()返回直接父节点。
5.4find_next_siblings() find_next_sibling()
find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点。
5.5find_previous_siblings() find_previous_sibling()
find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点。
5.6find_all_next() find_next()
find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点
5.7find_all_previous() 和 find_previous()
find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点
6.CSS选择器
- 通过select()直接传入CSS选择器即可完成选择
- 标签名不加任何修饰,类名前加点,id名前加 #,在这里我们也可以利用类似的方法来筛选元素,用到的方法是 soup.select(),返回类型是 list.
- 注意属性和标签属于同一节点,所以中间不能加空格,否则会无法匹配到。
- 学习参考链接:https://www.cnblogs.com/yizhenfeng168/p/6979339.html
html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.select('.panel .panel-heading'))
[<div class="panel-heading">
<h4>Hello</h4>
</div>]
>>> print(soup.select('ul li'))
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
>>> print(soup.select('#list-2 .element'))
[<li class="element">Foo</li>, <li class="element">Bar</li>]
>>> print(type(soup.select('ul')[0]))
<class 'bs4.element.Tag'>
遍历的方式查找标签名
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> for ul in soup.select('ul'):
print(ul.select('li'))
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
6.1获取属性
有两种方法,获取的结果是一致的
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> for ul in soup.select('ul'):
print(ul['id'])#方法一
print(ul.attrs['id'])#方法二
list-1
list-1
list-2
list-2
6.2获取内容
获取标签里面的文本
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> for li in soup.select('li'):
print(li.get_text())
Foo
Bar
Jay
Foo
Bar
7. 总结
- 推荐使用lxml解析库,必要时使用html.parser
- 标签选择筛选功能弱但是速度快
- 建议使用find()、find_all() 查询匹配单个结果或者多个结果
- 如果对CSS选择器熟悉建议使用select()
-
记住常用的获取属性和文本值的方法
网友评论