BeautifulSoup4解析器(css选择器)

作者: IT的咸鱼 | 来源:发表于2018-10-24 13:39 被阅读0次

BeautifulSoup4解析器(css选择器)
详解BeautifulSoup4
2018-11-22
Python Beautifulsoup模块使用
Python爬虫(十四)_BeautifulSoup4 解析器
Python中BeautifulSoup4的基本使用
网页爬虫Jsoup使用简介
CSS选择器
CSS选择器
BeautifulSoup4库

CSS 选择器：BeautifulSoup4

官方文档：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0

和 lxml 一样，Beautiful Soup 也是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据。

lxml 只会局部遍历，而Beautiful Soup 是基于HTML DOM的，会载入整个文档，解析整个DOM树，因此时间和内存开销都会大很多，所以性能要低于lxml。
BeautifulSoup 用来解析 HTML 比较简单，API非常人性化，支持CSS选择器、Python标准库中的HTML解析器，也支持 lxml 的 XML解析器。
Beautiful Soup 3 目前已经停止开发，推荐现在的项目使用Beautiful Soup 4。使用 pip 安装即可：pip3 install beautifulsoup4

抓取工具	速度	使用难度	安装难度
正则	最快	困难	无（内置）
BeautifulSoup	慢	最简单	简单
lxml	快	简单	一般

示例：首先必须要导入 bs4 库

# beautifulsoup4_test.py

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#创建 Beautiful Soup 对象
soup = BeautifulSoup(html)

#也可以以打开本地 HTML 文件的方式来创建对象
#soup = BeautifulSoup(open('index.html'))

#格式化输出 soup 对象的内容
print(soup.prettify())

运行结果：

<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title" name="dromouse">
<b>
    The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>

如果我们在执行时，看到这样一段警告：

屏幕快照 2018-05-06 下午2.05.50.png

意思是，如果我们没有显式地指定解析器，所以默认使用这个系统的最佳可用HTML解析器(“lxml”)。如果你在另一个系统中运行这段代码，或者在不同的虚拟环境中，使用不同的解析器造成行为不同。
但是我们可以通过soup = BeautifulSoup(html,“lxml”)方式指定lxml解析器,如果'lxml'失败,使用soup = BeautifulSoup(html,“html.parser”)

四大对象种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

Tag

NavigableString

BeautifulSoup

Comment

Tag 通俗点讲就是 HTML 中的一个个标签，例如：

<head><title>The Dormouse's story</title></head>
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

上面的 title head a p等等 HTML 标签加上里面包括的内容就是 Tag，那么试着使用 Beautiful Soup 来获取 Tags:

from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
#创建 Beautiful Soup 对象
soup = BeautifulSoup(html,'lxml')
print(soup.title)
# <title>The Dormouse's story</title>
print(soup.head)
# <head><title>The Dormouse's story</title></head>
print(soup.a)
# <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
print(soup.p)
# <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
print(type(soup.p))
# <class 'bs4.element.Tag'>

我们可以利用 soup 加标签名轻松地获取这些标签的内容，这些对象的类型是bs4.element.Tag。但是注意，它查找的是在所有内容中的第一个符合要求的标签。如果要查询所有的标签，后面会进行介绍。
对于 Tag，它有两个重要的属性，是 name 和 attrs

print(soup.name)
# [document] #soup 对象本身比较特殊，它的 name 即为 [document]

print (soup.head.name)
# head #对于其他内部标签，输出的值便为标签本身的名称

print (soup.p.attrs)
# {'class': ['title'], 'name': 'dromouse'}
# 在这里，我们把 p 标签的所有属性打印输出了出来，得到的类型是一个字典。

print (soup.p['class'] # soup.p.get('class'))
# ['title'] #还可以利用get方法，传入属性的名称，二者是等价的

soup.p['class'] = "newClass"
print soup.p # 可以对这些属性和内容等等进行修改
# <p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>

2. NavigableString

既然我们已经得到了标签的内容，那么问题来了，我们要想获取标签内部的文字怎么办呢？很简单，用 .string 即可，例如

print (soup.p.string)
# The Dormouse's story

print (type(soup.p.string))
# In [13]: <class 'bs4.element.NavigableString'>

3. BeautifulSoup BeautifulSoup 对象表示的是一个文档的内容。大部分时候,可以把它当作 Tag 对象，是一个特殊的 Tag，我们可以分别获取它的类型，名称，以及属性来感受一下

print type(soup.name)
# <type 'unicode'>

print soup.name 
# [document]

print soup.attrs # 文档本身的属性为空
# {}

4. Comment Comment 对象是一个特殊类型的 NavigableString 对象，其输出的内容不包括注释符号。

print soup.a
# <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

print soup.a.string
# Elsie 

print type(soup.a.string)
# <class 'bs4.element.Comment'>

a 标签里的内容实际上是注释，但是如果我们利用 .string 来输出它的内容时，注释符号已经去掉了

CSS选择器

写 CSS 时，标签名不加任何修饰，类名前加 '.' ，id名前加 '#'
在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list

(1）通过标签名查找

print(soup.select('title'))
#[<title>The Dormouse's story</title>]

print soup.select('a')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.select('b')
#[<b>The Dormouse's story</b>]

(2）通过类名查找

print soup.select('.sister')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

(3）通过 id 名查找

print soup.select('#link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

(4）组合查找

组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的，例如查找 p 标签中，id 等于 link1的内容，二者需要用空格分开

print soup.select('p #link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

直接子标签查找，则使用 > 分隔

print soup.select("head > title")
#[<title>The Dormouse's story</title>]

(5）属性查找

查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。

print(soup.select('a[class="sister"]'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print (soup.select('a[href="http://example.com/elsie"]'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
同样，属性仍然可以与上述查找方式组合，不在同一节点的空格隔开，同一节点的不加空格

print (soup.select('p a[href="http://example.com/elsie"]'))

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

(6) 获取内容

以上的 select 方法返回的结果都是列表形式，可以遍历形式输出，然后用 get_text() 方法来获取它的内容。

soup = BeautifulSoup(html, 'lxml')
print (type(soup.select('title')))
print (soup.select('title')[0].get_text())

for title in soup.select('title'):
    print (title.get_text())

result = BeautifulSoup(response.text,'lxml')
# print(result)
tr = result.select('a')
# print(tr)
for i in tr:
    print(i["href"])#获取href属性

BeautifulSoup4解析器(css选择器)
CSS 选择器：BeautifulSoup4 官方文档：http://beautifulsoup.readthed...
详解BeautifulSoup4
CSS 选择器：BeautifulSoup4 和 lxml 一样，Beautiful Soup 也是一个HTML/...
2018-11-22
CSS 选择器：BeautifulSoup4 和 lxml 一样，Beautiful Soup 也是一个HTML/...
Python Beautifulsoup模块使用
1 CSS 选择器：BeautifulSoup4的介绍和安装和 lxml 一样，Beautiful Soup 也...
Python爬虫(十四)_BeautifulSoup4 解析器
CSS选择器：BeautifulSoup4 和lxml一样，Beautiful Soup也是一个HTML/XML的...
Python中BeautifulSoup4的基本使用
CSS 选择器：BeautifulSoup4 from: Mpps: 下文的使用方法只列出了常用的，详细请看官方文...
网页爬虫Jsoup使用简介
Android程序员面试宝典 jsoup 是一款 Java 的HTML 解析器，可通过DOM，CSS选择器以及类似...
CSS选择器
CSS 元素选择器CSS 选择器分组CSS 类选择器详解CSS ID 选择器详解CSS 属性选择器详解CSS 后代...
CSS选择器
目录： CSS派生选择器 CSS元素选择器 CSS Id 和 Class选择器 CSS 属性选择器 CSS 派生选...
BeautifulSoup4库
BeautifulSoup4库和 lxml 一样，Beautiful Soup 也是一个HTML/XML的解析器...

BeautifulSoup4解析器(css选择器)

但是我们可以通过`soup = BeautifulSoup(html,“lxml”)`方式指定lxml解析器,如果`'lxml'`失败,使用`soup = BeautifulSoup(html,“html.parser”)`

四大对象种类

Tag 通俗点讲就是 HTML 中的一个个标签，例如：

上面的 title head a p等等 HTML 标签加上里面包括的内容就是 Tag，那么试着使用 Beautiful Soup 来获取 Tags:

2. NavigableString

3. BeautifulSoup BeautifulSoup 对象表示的是一个文档的内容。大部分时候,可以把它当作 Tag 对象，是一个特殊的 Tag，我们可以分别获取它的类型，名称，以及属性来感受一下

4. Comment Comment 对象是一个特殊类型的 NavigableString 对象，其输出的内容不包括注释符号。

CSS选择器

(1）通过标签名查找

(2）通过类名查找

(3）通过 id 名查找

(4）组合查找

(5）属性查找

(6) 获取内容

相关文章

BeautifulSoup4解析器(css选择器)

详解BeautifulSoup4

2018-11-22

Python Beautifulsoup模块使用

Python爬虫(十四)_BeautifulSoup4 解析器

Python中BeautifulSoup4的基本使用

网页爬虫Jsoup使用简介

CSS选择器

CSS选择器

BeautifulSoup4库

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

BeautifulSoup4解析器(css选择器)

但是我们可以通过soup = BeautifulSoup(html,“lxml”)方式指定lxml解析器,如果'lxml'失败,使用soup = BeautifulSoup(html,“html.parser”)

四大对象种类

Tag 通俗点讲就是 HTML 中的一个个标签，例如：

上面的 title head a p等等 HTML 标签加上里面包括的内容就是 Tag，那么试着使用 Beautiful Soup 来获取 Tags:

2. NavigableString

3. BeautifulSoup BeautifulSoup 对象表示的是一个文档的内容。大部分时候,可以把它当作 Tag 对象，是一个特殊的 Tag，我们可以分别获取它的类型，名称，以及属性来感受一下

4. Comment Comment 对象是一个特殊类型的 NavigableString 对象，其输出的内容不包括注释符号。

CSS选择器

(1）通过标签名查找

(2）通过类名查找

(3）通过 id 名查找

(4）组合查找

(5）属性查找

(6) 获取内容

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

但是我们可以通过`soup = BeautifulSoup(html,“lxml”)`方式指定lxml解析器,如果`'lxml'`失败,使用`soup = BeautifulSoup(html,“html.parser”)`