Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.
今天就简单的介绍一下beautiful soup的使用。
安装库:
pip install bs4
bs4 解析器的方法以及优缺点
只介绍lxml解析器的使用方法,因为比较好用而且解析速度适中,难易程度适中。如果你使用的是anaconda的环境的话,lxml已经装好了,我们直接使用就可以了。
html_doc =
"""
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
1.初始化,接受的参数是字符串。
soup = BeautifulSoup(html_doc,"lxml")
2.将代码格式化
print(soup.prettify())
#使用此方法会使代码更加的美观并且有了明显的层次化。
3.tag实例soup.标签a/p,代表遍历到第一个a标签或者p标签
print(type(soup.a))
#匹配即停止
print(soup.a)
结果:
<class 'bs4.element.Tag'>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
4.获取文本内容:
title_text = soup.title.string
title_text1 = soup.title.get_text()
title_text2 = soup.title.text
print(title_text)
print(title_text1)
print(title_text2)
#结果
The Dormouse's story
The Dormouse's story
The Dormouse's story
5.选取属性
b = soup.a.attrs
print(b['href'])
print(b['class'][0])
print(b['id'])
#结果:
http://example.com/elsie
sister
link1
6.选取多个属性
#返回值是一个列表,将所有的a标签返回
c = soup.find_all('a')
print(c)
#结果:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
c = soup.find_all(name='a',attrs={'href':'http://example.com/tillie'})
print(c)
#结果:
[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
7.contents属性
# 返回的结果是一个列表形式。可以将元素中的节点以及文本都匹配出来。
print(soup.p.contents)
#结果:
[<b>The Dormouse's story</b>]
8.父节点以及祖先节点parent,patents
p_parent = soup.p.parent
print(p_parent)
#结果:
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and
their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
p_parents = soup.p.parents
print(p_parents)
print(list(enumerate(p_parents)))
# 查找兄弟节点
# next_sibling 查找下一个兄弟节点
# previous_sibling 查找上一个兄弟节点
# next_siblings 查找所有后面的节点
# previous_siblings 查找所有的前面节点
9.css选择器
print(soup.select('p a'))
#结果:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
print(soup.select('.sister'))
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
网友评论