Python学习之旅-07-BeautifulSoup使用操作

作者: b861a75d2a7d | 来源:发表于2018-06-30 02:02 被阅读10次

Python学习之旅-07-BeautifulSoup使用操作
PYTHON学习之旅（一）
[0]Python实践之旅-序
Python的高级特性,模块和IO操作
Python的高级特性,模块和IO操作
Python之旅-4
封面
使用python替换文件内容
萌新小编用Python做网页爬虫！这个一看就像个老司机做的！
Python数据持久化-csv、excel篇

0. Beautiful Soup支持的解析器

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, "html.parser")	Python的内置标准库、执行速度适中、文档容错能力强	Python 2.7.3及Python 3.2.2之前的版本文档容错能力差
lxml HTML解析器	BeautifulSoup(markup, "lxml")	速度快、文档容错能力强	需要安装C语言库
lxml XML解析器	BeautifulSoup(markup, "xml")	速度快、唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, "html5lib")	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

1.安装

pip3 install bea

2.基本使用方法

2.1 查找元素

html = """<div><title>我是html文件</title><p>我是 p 标签</p></div>"""

#导入方法
from bs4 import BeautifulSoup
#初始化 html 文件
soup = BeautifulSoup(html, 'lxml')

# 1 获取 title 标签
title = soup.title
print(title)
#输出结果：<title>我是html文件</title>

# 2 获取 p 标签
p = soup.p
print(p)
#输出结果：<p>我是 p 标签</p>

2.2 获取属性

html = ''' <p name = "hello">我是 p 标签</p>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

# 1 直接传入中括号和属性名，获取 p 标签 name 的属性
name_1 = soup.p['name']
#输出结果：hello

# 2 使用attrs属性获取 p 标签 name 的属性
name_2 = soup.p.attrs['name']
#输出结果：hello

2.3 获取内容

html = ''' <p name = "hello">我是 p 标签</p>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

# 1 使用string属性获取 p 标签的文本信息
text_1 = soup.p.string
print(text_1)
#输出结果：我是 p 标签

# 2 使用 get_text()方法获取 p 标签的文本信息
text_2 = soup.p.get_text()
print(text_2)
#输出结果：我是 p 标签

2.4 嵌套选择

html = ''' <title><p name = "hello">我是 p 标签</p></title>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

# 1 使用string属性获取 title 标签里面 p 标签的文本内容
text_1 = soup.title.p.string
print(text_1)
#输出结果：我是 p 标签

# 2 使用 get_text()方法获取 title 标签里面 p 标签的文本内容
text_2 = soup.title.p.get_text()
print(text_2)
#输出结果：我是 p 标签

3.关联选择

3.1 子节点和子孙节点

html = """
    <ul>
        <li>水果菜单
            <p class='banner'>香蕉
                <a>小香蕉</a></p></li></ul>"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

# 1 调用 contents属性获取直接子节点,返回结果会是列表形式
#   把直接节点的所有文本信息和节点都一起返回，不会单独处理。
text_1 = soup.ul.contents
print(text_1)
#输出结果：['\n', <li>水果菜单<p class="banner">香蕉<a>小香蕉</a></p></li>]

# 2 调用 children 属性获取子节点；
#   返回结果是生成器类型，可以用for循环输出相应的内容。
text_2 = soup.ul.children
for i in text_2:
    print(i)
#输出结果：<li>水果菜单  <p class="banner">香蕉  <a>小香蕉</a></p></li>

# 2 调用 descendants属性获取所有子孙节点：
#   返回结果也是生成器，descendants会递归查询所有子节点，得到所有的子孙节点。
text_3 = soup.ul.descendants
for i in text_3:
    print(i)

#输出结果：
# <li>水果菜单  <p class="banner">香蕉  <a>小香蕉</a></p></li>
#水果菜单 <p class="banner">香蕉<a>小香蕉</a></p>
#香蕉 <a>小香蕉</a>
#小香蕉

3.2 父节点和祖先节点

html = """
    <body>
        <p class="story">我的第1个p标签
            <a href="http://www.baidu.com" class="sister" id="link1">百度</a>
            <a href="http://www.qq.com" class="sister" id="link2">腾讯</a>
        </p>
        <p class="story">我的第2个p标签</p>
    </body>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

# 调用 parent属性获取父节点
text = soup.a.parent

# 调用 parents属性获取父节点及以上的所有节点
text = soup.a.parents

3.3 兄弟节点

html = """
    <body>
        <p class="story">我的第1个p标签
            <a href="http://www.baidu.com" class="sister" id="link1">百度</a>
            <a href="http://www.qq.com" class="sister" id="link2">腾讯</a>
        </p>
        <p class="story">我的第2个p标签</p>
    </body>"""
#导入方法
from bs4 import BeautifulSoup
#初始化 html 文件
soup = BeautifulSoup(html, 'lxml')

# next_sibling 属性获取节点下一个兄弟元素
text_1 = soup.a.next_sibling

# next_siblings 属性获取所有前面的节点元素
text_2 = list(enumerate(soup.a.next_siblings)))

# previous_sibling 属性获取上一个兄弟元素
text_3 = soup.a.previous_sibling)

# previous_siblings 属性获取后面的所有节点元素
text_4 = list(enumerate(soup.a.previous_siblings)))

4.find_all()方法选择器

find_all，顾名思义，就是查询所有符合条件的元素。给它传入一些属性或文本，就可以得到符合条件的元素，它的功能十分强大
它的API如下：

find_all(name , attrs , recursive , text , **kwargs)

4.1 name
根据节点名来查询元素，示例如下：

html = """
<a  class="apple" id="link1">苹果</a>
<a  class="banana" id="link2">香蕉<span>皇帝蕉</span></a>
<p  class="cole" id="link3">可乐</p>

"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

# 查找 a 标签,返回结果是列表类型
print(soup.find_all(name="a"))
# 输出结果： [<a class="apple" id="link1">苹果</a>, <a class="banana" id="link2">香蕉<span>皇帝蕉</span></a>]

# 查找 p 标签
print(soup.find_all(name="p"))
# 输出结果：[<p class="cole" id="link3">可乐</p>]

# 因为都是Tag类型，所以依然可以进行嵌套查询。
# 这里查询出所有a节点后，再继续查询其内部的 span 节点,
# 返回结果也是列表类型,列表中的每个元素依然还是Tag类型.
for a in soup.find_all(name="a"):
    span = a.find_all(name='span')
    print(span)
# 输出结果： [<span>皇帝蕉</span>]

# - 查询span 节点返回结果是列表类型，列表中的每个元素依然还是Tag类型。
# 接下来，就可以遍历每个span，获取它的文本了：
for a in soup.find_all(name="a"):
    span = a.find_all(name='span')
    for i in span:
        print(i.string)
        print(i.get_text())
#输出结果：
#皇帝蕉
#皇帝蕉

4.2 attrs
根据传的属性来查询，示例如下：

html = """
<p class="apple" id="item_1">苹果</p>
<p class="coffee" id="item_2">咖啡</p>
<p class="cole" id="item_3">可乐</p>"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

#根据 id 的值来查找

# 1 使用attrs=属性来查找
id_1 = soup.find_all(attrs={'id':'item_1'})

# 2 也可以直接传入 id 这个参数来查找
id_2 = soup.find_all(id="item_1")

print(id_1)
print(id_2)
#输出结果：
#[<p class="apple" id="item_1">苹果</p>]
#[<p class="apple" id="item_1">苹果</p>]

#根据 class 的值来查找

# 1 使用attrs=属性来查找
class_1 = soup.find_all(attrs={'class':'coffee'})

# 2 也可以直接传入 class 这个参数来查找；
#   由于class在Python里是一个关键字，所以后面需要加一个下划线。
class_2 = soup.find_all(class_="coffee")

print(class_1)
print(class_2)
#输出结果：
#[<p class="coffee" id="item_2">咖啡</p>]
#[<p class="coffee" id="item_2">咖啡</p>]

注意：

1 传入的attrs参数，参数的类型是字典类型。

2 查询后得到的结果是列表形式。

3 直接传入 class 这个参数来查找，后面需要加一个下划线。

4.3 text
text参数可用来匹配节点的文本，传入的形式可以是字符串，可以是正则表达式对象，示例如下：

html = """
<p class="apple" id="item_1">苹果apple</p>
<p class="coffee" id="item_2">咖啡coffee</p>
<p class="cole" id="item_3">可乐</p>"""

import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

# 查找文本信息中含有apple的元素
apple =  re.compile('apple')
print(soup.find_all(text=apple))
#输出结果：['苹果apple']

# 查找文本信息中含有 咖啡 的元素
coffee =  re.compile('咖啡')
print(soup.find_all(text=coffee))
#输出结果：['咖啡coffee']

# 查找文本信息是可乐 的元素
print(soup.find_all(text="可乐"))
#输出结果：['可乐']

4.4 find()

find()方法：返回的是单个元素，也就是第一个匹配的元素。

find_all()方法：返回的是所有匹配的元素组成的列表。

html = """<p class="item">苹果apple</p><p class="item">可乐</p>"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

# 1 find()方法：返回的是单个元素，也就是第一个匹配的元素。

item = soup.find(class_ = 'item')
print(item)
#输出结果：<p class="item">苹果apple</p>

# 2 find_all()方法：返回的是所有匹配的元素组成的列表。

items = soup.find_all(class_ = 'item')
print(items)
#输出结果：[<p class="item">苹果apple</p>, <p class="item">可乐</p>]

4.5 另外，还有许多查询方法，其用法与前面介绍的find_all()、find()方法完全相同，只不过查询范围不同。

> - find_parents()和find_parent()：前者返回所有祖先节点，后者返回直接父节点。

> - find_next_siblings()和find_next_sibling()：前者返回后面所有的兄弟节点，后者返回后面第一个兄弟节点。

> - find_previous_siblings()和find_previous_sibling()：前者返回前面所有的兄弟节点，后者返回前面第一个兄弟节点。

> - find_all_next()和find_next()：前者返回节点后所有符合条件的节点，后者返回第一个符合条件的节点。

> - find_all_previous()和find_previous()：前者返回节点后所有符合条件的节点，后者返回第一个符合条件的节点。

5. CSS选择器

5.1 使用CSS选择器查找元素时，只需要调用select()方法，传入相应的CSS选择器即可，示例如下：

html = '''
<p class="fruits">水果套餐
    <a class="fruits" id="apple">苹果</a>
    <a class="fruits" id="banana">香蕉</a></p>
<p class="drink">饮料套餐
    <a class="drink" id="coffee">咖啡</a></p>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

# 1 查找 class=fruits 元素里面 id=apple 的元素
print(soup.select('.fruits #apple'))
#输出结果：[<a class="fruits" id="apple">苹果</a>]

# 2 查找 id =coffee 的元素
print(soup.select('#coffee'))
#输出结果：[<a class="fruits" id="coffee">咖啡</a>]

# 3 获取 p 标签下面的 a 标签的所有元素
print(soup.select('p a'))
#输出结果：[<a class="fruits" id="apple">苹果</a>, <a class="fruits" id="banana">香蕉</a>, <a class="drink" id="coffee">咖啡</a>]

5.2 获取属性

html = '''
<p class="fruits">水果套餐
    <a class="fruits" id="apple">苹果</a>
    <a class="fruits" id="banana">香蕉</a></p>
<p class="drink">饮料套餐
    <a class="drink" id="coffee">咖啡</a></p>'''


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

#获取 p 标签元素中 a 标签的所有元素
a = soup.select('p a')
#遍历获取所有元素的 id属性
for i in a:
    # 1 直接传入中括号和属性名
    
    print(i['id'])
    #输出结果：
    # apple
    # banana
    # coffee

    # 2 使用attrs属性获取属性名
    
    print(i.attrs['id'])
    # 输出结果：
    # apple
    # banana
    # coffee

5.3 获取文本
使用string属性和get_text()的方法

html = '''
<p class="fruits">水果套餐
    <a class="fruits" id="apple">苹果</a>
    <a class="fruits" id="banana">香蕉</a></p>
<p class="drink">饮料套餐
    <a class="drink" id="coffee">咖啡</a></p>'''


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

#获取 p 标签元素中 a 标签的所有元素
a = soup.select('p a')
#遍历获取所有元素的 id属性
for i in a:
    # 1 使用string属性获取文本信息

    print(i.string)
    #输出结果：
    # 苹果
    # 香蕉
    # 咖啡

    # 2 使用get_text()方法获取文本信息

    print(i.get_text())
    # 输出结果：
    # 苹果
    # 香蕉
    # 咖啡



可以看到，二者的效果完全一致。

CSS选择器的使用，可以参考：http://www.w3school.com.cn/cssref/css_selectors.asp

参考资料：

Python学习之旅-07-BeautifulSoup使用操作
0. Beautiful Soup支持的解析器 1.安装 2.基本使用方法 2.1 查找元素 2.2 获取属性 2...
PYTHON学习之旅（一）
PYTHON学习之旅（一）文|一本正经胡说八道的猫 PYTHON自学之旅，为了更好的学习python，这几天使用...
[0]Python实践之旅-序
Python 实践之旅，主要记录从零开始，学习和使用 python 过程中，遇到的问题，解决方法。本系列的默认使用...
Python的高级特性,模块和IO操作
今天我们学习Python的高级特性、模块和IO操作,通过学习这些,我们可以更快的了解Python,使用Python...
Python的高级特性,模块和IO操作
今天我们学习Python的高级特性、模块和IO操作,通过学习这些,我们可以更快的了解Python,使用Python...
Python之旅-4
前言：本篇文章是《Python之旅》系列的第四篇，在上一篇文章中主要学习了：Python的算术操作符、比较操作符...
封面
python学习之旅
使用python替换文件内容
最新学习python使用到的替换文件文本的操作。
萌新小编用Python做网页爬虫！这个一看就像个老司机做的！
在本教程中，您将学习如何从web提取数据、使用Python的Pandas库操作和清理数据，以及使用Python的M...
Python数据持久化-csv、excel篇
2018年7月4日笔记学习目标：1.会使用Python第三方模块操作CSV文件2.会使用Python第三方模块操作...

Python学习之旅-07-BeautifulSoup使用操作

0. Beautiful Soup支持的解析器

1.安装

2.基本使用方法

2.1 查找元素

2.2 获取属性

2.3 获取内容

2.4 嵌套选择

3.关联选择

3.1 子节点和子孙节点

3.2 父节点和祖先节点

3.3 兄弟节点

4.find_all()方法选择器

4.1 name

4.2 attrs

注意：

4.3 text

4.4 find()

4.5 另外，还有许多查询方法，其用法与前面介绍的find_all()、find()方法完全相同，只不过查询范围不同。

5. CSS选择器

5.1 使用CSS选择器查找元素时，只需要调用select()方法，传入相应的CSS选择器即可，示例如下：

5.2 获取属性

5.3 获取文本

相关文章

Python学习之旅-07-BeautifulSoup使用操作

PYTHON学习之旅（一）

[0]Python实践之旅-序

Python的高级特性,模块和IO操作

Python的高级特性,模块和IO操作

Python之旅-4

封面

使用python替换文件内容

萌新小编用Python做网页爬虫！这个一看就像个老司机做的！

Python数据持久化-csv、excel篇

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读