[spider]网页内容提取之Bs4

作者: Franckisses | 来源:发表于2019-03-08 09:16 被阅读0次

[spider]网页内容提取之Bs4
[spider]网页内容提取之pyquery
[spider]网页内容提取之xpath
提取html网页内容
Python｜三个例子，一步步教你学会爬虫
【Python爬虫】三个例子，一步步教你学会python爬虫
python爬虫系列（3）- 网页数据解析（bs4、lxml、J
【三】关于PythonSpider# 解析网页中的元素
2019-02.24（review）
Python爬虫入门：以东方财富网为例

Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.
今天就简单的介绍一下beautiful soup的使用。

安装库：

pip install bs4

bs4 解析器的方法以及优缺点

只介绍lxml解析器的使用方法，因为比较好用而且解析速度适中，难易程度适中。如果你使用的是anaconda的环境的话，lxml已经装好了，我们直接使用就可以了。

html_doc = 
"""
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""

1.初始化，接受的参数是字符串。

soup = BeautifulSoup(html_doc,"lxml")

2.将代码格式化

print(soup.prettify())
#使用此方法会使代码更加的美观并且有了明显的层次化。

3.tag实例soup.标签a/p,代表遍历到第一个a标签或者p标签

print(type(soup.a))
#匹配即停止
print(soup.a)
结果：
<class 'bs4.element.Tag'>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

4.获取文本内容：

title_text = soup.title.string
title_text1 = soup.title.get_text()
title_text2 = soup.title.text

print(title_text)
print(title_text1)
print(title_text2)
#结果
The Dormouse's story
The Dormouse's story
The Dormouse's story

5.选取属性

b = soup.a.attrs
print(b['href'])
print(b['class'][0])
print(b['id'])
#结果：
http://example.com/elsie
sister
link1

6.选取多个属性

#返回值是一个列表，将所有的a标签返回
c = soup.find_all('a') 
print(c)
#结果：
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

c = soup.find_all(name='a',attrs={'href':'http://example.com/tillie'})
print(c)
#结果：
[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

7.contents属性

# 返回的结果是一个列表形式。可以将元素中的节点以及文本都匹配出来。
print(soup.p.contents)
#结果：
[<b>The Dormouse's story</b>]

8.父节点以及祖先节点parent，patents

p_parent = soup.p.parent
print(p_parent)
#结果：
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and 
their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>

p_parents = soup.p.parents
print(p_parents)
print(list(enumerate(p_parents)))

# 查找兄弟节点
# next_sibling 查找下一个兄弟节点
# previous_sibling 查找上一个兄弟节点
# next_siblings  查找所有后面的节点
# previous_siblings 查找所有的前面节点

9.css选择器

print(soup.select('p a'))
#结果：
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
<a class="sister" href="http://example.com/lacie"     id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

 print(soup.select('.sister'))
 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]