28、BeautifulSoup之搜索

作者: 魔方宫殿 | 来源:发表于2022-04-11 23:14 被阅读0次

28、BeautifulSoup之搜索
BeautifulSoup 文档的搜索
爬虫2
爬虫
29、BeautifulSoup实例1：网文追更
爬虫任务二
Python之BeautifulSoup
BeautifulSoup(BS4)的基本使用
BeautifulSoup基础使用
beautifulsoup教程

Life is short, you need Python!

上集回顾：

Tag的name
.contents 和 .children
.parent
.next_sibling 和 .previous_sibling

上集学习了通过name获取目标Tag和遍历文档树。
本集学习如何搜索文档树。

Beautiful Soup定义了很多搜索方法，这里着重介绍2个：find() 和 find_all() 。
再以“爱丽丝”文档作为例子:

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

一、过滤器
介绍 find_all() 方法前,先介绍一下过滤器的类型，这些过滤器贯穿整个搜索的API。过滤器可以被用在tag的name中，节点的属性中，字符串中或他们的混合中。

1.1、字符串
最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,BeautifulSoup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的标签:

soup.find_all('b')
# [<b>The Dormouse's story</b>]

如果传入字节码参数，BeautifulSoup会当作UTF-8编码，可以传入一段Unicode 编码来避免BeautifulSoup解析编码出错。

1.2、正则表达式
如果传入正则表达式作为参数，BeautifulSoup会通过正则表达式的 match() 来匹配内容。下面例子中找出所有以b开头的标签，这表示<body>和标签都应该被找到：

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b

下面代码找出所有名字中包含”t”的标签:

for tag in soup.find_all(re.compile("t")):
    print(tag.name)
# html
# title

1.3、列表
如果传入列表参数，Beautiful Soup会将与列表中任一元素匹配的内容返回。下面代码找到文档中所有<a>标签和标签:

soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

1.4、方法
如果没有合适过滤器,那么还可以定义一个方法，方法只接受一个元素参数，如果这个方法返回 True 表示当前元素匹配并且被找到，如果不是则反回 False。
下面方法校验了当前元素，如果包含 class 属性却不包含 id 属性，那么将返回 True:

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

将这个方法作为参数传入 find_all() 方法，将得到所有标签:

soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
#  <p class="story">Once upon a time there were...</p>,
#  <p class="story">...</p>]

返回结果中只有标签没有<a>标签，因为<a>标签还定义了”id”，没有返回<html>和<head>，因为<html>和<head>中没有定义”class”属性。

通过一个方法来过滤一类标签属性的时候，这个方法的参数是要被过滤的属性的值，而不是这个标签。下面的例子是找出 href 属性不符合指定正则的 a 标签。

def not_lacie(href):
    return href and not re.compile("lacie").search(href)
soup.find_all(href=not_lacie)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

标签过滤方法可以使用复杂方法。下面的例子可以过滤出前后都有文字的标签。

from bs4 import NavigableString
def surrounded_by_strings(tag):
    return (isinstance(tag.next_element, NavigableString)
            and isinstance(tag.previous_element, NavigableString))

for tag in soup.find_all(surrounded_by_strings):
    print tag.name
# p
# a
# a
# a
# p

1.5、True
True可以匹配任何值，下面代码查找到所有的tag，但是不会返回字符串节点

for tag in soup.find_all(True):
    print(tag.name)
# html
# head
# title
# body
# p
# b
# p
# a
# a
# a
# p

二、find_all()
find_all( name , attrs , recursive , string , **kwargs )
find_all() 方法搜索当前tag的所有tag子节点，并判断是否符合过滤器的条件。这里有几个例子:

soup.find_all("title")
# [<title>The Dormouse's story</title>]

soup.find_all("p", "title")
# [<p class="title"><b>The Dormouse's story</b></p>]

soup.find_all("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all(id="link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

import re
soup.find(string=re.compile("sisters"))
# u'Once upon a time there were three little sisters; and their names were\n'

2.1、 name 参数
name 参数可以查找所有名字为 name 的tag，字符串对象会被自动忽略掉。
简单的用法如下:

soup.find_all("title")
# [<title>The Dormouse's story</title>]

重申: 搜索 name 参数的值可以使任一类型的过滤器：字符串、正则表达式、列表或是方法或是 True。

2.2、keyword 参数
如果一个指定名字的参数不是搜索内置的参数名，搜索时会把该参数当作指定名字tag的属性来搜索，如果包含一个名字为 id 的参数，Beautiful Soup会搜索每个tag的”id”属性.

soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

如果传入 href 参数，Beautiful Soup会搜索每个tag的”href”属性:

soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

搜索指定名字的属性时可以使用的参数值包括：字符串、正则表达式、列表和 True。
下面的例子在文档树中查找所有包含 id 属性的tag，无论 id 的值是什么:

soup.find_all(id=True)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

使用多个指定名字的参数可以同时过滤tag的多个属性:

soup.find_all(href=re.compile("elsie"), id='link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

有些tag属性在搜索不能使用，比如HTML5中的 data-* 属性:

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression

但是可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag:

data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

2.3、按CSS搜索
按照CSS类名搜索tag的功能非常实用，但标识CSS类名的关键字 class 在Python中是保留字，使用 class 做参数会导致语法错误。从BeautifulSoup的4.1.1版本开始，可以通过 class_ 参数搜索有指定CSS类名的tag:

soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

class_ 参数同样接受不同类型的 过滤器 、字符串、正则表达式、方法或 True :

soup.find_all(class_=re.compile("itl"))
# [<p class="title"><b>The Dormouse's story</b></p>]

def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6

soup.find_all(class_=has_six_characters)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

tag的 class 属性是多值属性。按照CSS类名搜索tag时，可以分别搜索tag中的每个CSS类名:

css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.find_all("p", class_="strikeout")
# [<p class="body strikeout"></p>]

css_soup.find_all("p", class_="body")
# [<p class="body strikeout"></p>]

搜索 class 属性时也可以通过CSS值完全匹配:

css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>]

完全匹配 class 的值时，如果CSS类名的顺序与实际不符，将搜索不到结果:

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

2.4、string 参数
通过 string 参数可以搜搜文档中的字符串内容。与 name 参数的可选值一样，string 参数接受：字符串、正则表达式、列表或 True：

soup.find_all(string="Elsie")
# [u'Elsie']

soup.find_all(string=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(string=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]

def is_the_only_string_within_a_tag(s):
    ""Return True if this string is the only child of its parent tag.""
    return (s == s.parent.string)

soup.find_all(string=is_the_only_string_within_a_tag)
# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']

虽然 string 参数用于搜索字符串,还可以与其它参数混合使用来过滤tag。Beautiful Soup会找到 .string 方法与 string 参数值相符的tag。下面代码用来搜索内容里面包含“Elsie”的<a>标签:

soup.find_all("a", string="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

2.5、limit 参数
find_all()方法返回全部的搜索结构，如果文档树很大那么搜索会很慢。如果我们不需要全部结果，可以使用 limit 参数限制返回结果的数量。效果与SQL中的limit关键字类似，当搜索到的结果数量达到 limit 的限制时，就停止搜索返回结果。

文档树中有3个tag符合搜索条件，但结果只返回了2个，因为我们限制了返回数量:

soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

2.6、recursive 参数
调用tag的find_all()方法时，Beautiful Soup会检索当前tag的所有子孙节点，如果只想搜索tag的直接子节点，可以使用参数 recursive=False 。

一段简单的文档:

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
...

是否使用 recursive 参数的搜索结果:

soup.html.find_all("title")
# [<title>The Dormouse's story</title>]

soup.html.find_all("title", recursive=False)
# []

<title>标签在 <html> 标签下，但并不是直接子节点，<head> 标签才是直接子节点。在允许查询所有后代节点时 Beautiful Soup 能够查找到 <title> 标签。但是使用了 recursive=False 参数之后，只能查找直接子节点，这样就查不到 <title> 标签了。

2.7、find_all()简写
find_all()几乎是BeautifulSoup中最常用的搜索方法，所以我们定义了它的简写方法。BeautifulSoup 对象和 tag 对象可以被当作一个方法来使用，这个方法的执行结果与调用这个对象的find_all()方法相同，下面两行代码是等价的:

soup.find_all("a")
soup("a")

这两行代码也是等价的:

soup.title.find_all(string=True)
soup.title(string=True)

三、find()
find( name , attrs , recursive , string , **kwargs )
find_all() 方法将返回文档中符合条件的所有tag，尽管有时候我们只想得到一个结果。比如文档中只有一个<body>标签，那么使用 find_all() 方法来查找<body>标签就不太合适，使用 find_all 方法并设置 limit=1 参数不如直接使用 find() 方法。下面两行代码是等价的:

soup.find_all('title', limit=1)
# [<title>The Dormouse's story</title>]

soup.find('title')
# <title>The Dormouse's story</title>

唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表，而 find() 方法直接返回结果。

find_all() 方法没有找到目标是返回空列表，find() 方法找不到目标时,返回 None。

print(soup.find("nosuchtag"))
# None

soup.head.title 是 tag的名字方法的简写。这个简写的原理就是多次调用当前tag的 find() 方法:

soup.head.title
# <title>The Dormouse's story</title>

soup.find("head").find("title")
# <title>The Dormouse's story</title>

本集总结：

过滤器
find_all()
find()

下集见

28、BeautifulSoup之搜索
上集回顾： Tag的name .contents 和 .children .parent .next_siblin...
BeautifulSoup 文档的搜索
find_all() 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件 name 参数name 参...
爬虫2
爬虫之 beautifulsoup BeautifulSoup3目前已经停止开发，推荐现在的项目使用Beautif...
爬虫
爬虫之 beautifulsoup BeautifulSoup3目前已经停止开发，推荐现在的项目使用Beautif...
29、BeautifulSoup实例1：网文追更
上集回顾：过滤器 find_all() find() 上集学习了BeautifulSoup的搜索功能，Beaut...
爬虫任务二
2.1 学习beautifulsoup 学习beautifulsoup，并使用beautifulsoup提取内容。...
Python之BeautifulSoup
BeautifulSoup是什么一个灵活方便的网页解析库，处理高效，支持多种解析器利用他不用编写正则表达式即可...
BeautifulSoup(BS4)的基本使用
一、BeautifulSoup简介二、BeautifulSoup安装三、BeautifulSoup基本使用导...
BeautifulSoup基础使用
1. 安装BeautifulSoup BeautifulSoup官方文档 BeautifulSoup安装命令:co...
beautifulsoup教程
beautifulsoup教程 BeautifulSoup4是爬虫必学的技能。BeautifulSoup最主要的功...