Beautiful Soup库

作者: 天道酬勤_FUN | 来源:发表于2017-04-18 13:37 被阅读0次

python beautiful soup库入门
Python爬虫——Beautiful Soup
Python读取网页并获取某节点
初试爬虫-爬取图片
Beautiful Soup 采坑之旅
Beautiful Soup库入门
【Python爬虫】Beautiful Soup
Python爬虫--BeautifulSoup(三)
爬虫组队学习——task2
1. Beautiful Soup的简介

安装

win+X 命令提示符（使用管理员权限启动控制台）
输入安装命令

pip install beautifulsoup4

Beautiful Soup库的安装小测

演示HTML页面地址：http://python123.io/ws/demo.html

demo = r.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(demo, "html.parser")
print(soup.prettify())

BeautifulSoup库的基本元素

Beaufitul Soup库的引用
Beautiful Soup库，也叫beautifulsoup4或bs4

from bs4 import BeautifulSoup

from bs4 import BeautifulSoup
soup = BeautifulSoup("<html>data</html>","html.parser")
soup2 = BeautifulSoup(open("D://demo.html"), "html.parser")

BeautifulSoup对应一个HTML/XML文档的全部内容

Beautiful Soup库解析器

解析器	使用方法	条件
bs4的HTML解析器	BeautifulSoup(mk, 'html.parser')	安装bs4库
lxml的HTML解析器	BeautifulSoup(mk, 'lxml')	pip install lxml
lxml的XML解析器	BeautifulSoup(mk, 'xml')	pip install lxml
html5lib的解析器	BeautifulSoup(mk, 'html5lib')	pip install html5lib

Beautiful Soup类的基本元素

基本元素	说明
Tag	标签，最基本的信息组织党员，分别用<>和</>标明开头和结尾
Name	标签的名字，<p>...</p>的名字是‘p’，格式：<tag>.name
Attributes	标签的属性，字典形式组织，格式：<tag>.attrs
NavigableString	标签内非属性字符串，<>...</>中字符串，格式：<tag>.string
Comment	标签内字符串的注释部分，一种特殊的Comment类型

基于bs4库的HTML内容遍历方法

回顾demo.html

>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> demo
'<html><head><title>This is a python demo page</title></head><body><p class="title"><b>The demo python introduces several python courses.</b></p><p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p></body></html>'

HTML基本格式

<html>
    <head>
        <title>This is a python demo page</title>
    </head>
    <body>
        <p class="title">
            <b>The demo python introduces several python courses.</b>
        </p>
        <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
            <a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a>
             and 
            <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>
            .
        </p>
    </body>
</html>

标签树的下行遍历

属性	说明
.contents	子节点的列表，将<tag>所有儿子节点存入列表
.children	子节点的迭代类型，与.contents类似，用于循环遍历儿子节点
.descendants	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

for child in soup.body.children:
    print(child)

标签树的上行遍历

属性	说明
.parent	节点的父亲标签
.parents	节点先辈标签的迭代类型，用于循环遍历先辈节点

>>> soup = BeautifulSoup(demo, "html.parser")
>>> for parent in soup.a.parents:
           if parent is None:
               print(parent)
           else:
               print(parent.name)

标签树的平行遍历

属性	说明
.next_sibling	返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling	返回按照HTML文本顺序的上一个平行节点标签
.next_siblings	迭代类型，返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings	迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

for sibling in soup.a.next_siblings:
    print(sibling)
for sibling in soup.a.previous_siblings:
    print(sibling)

基于bs4库的HTML格式输出

bs4库的prettify()方法

网友评论

本文标题：Beautiful Soup库

本文链接：https://www.haomeiwen.com/subject/raufzttx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Beautiful Soup库

安装

Beautiful Soup库的安装小测

BeautifulSoup库的基本元素

Beautiful Soup库解析器

Beautiful Soup类的基本元素

基于bs4库的HTML内容遍历方法

标签树的下行遍历

标签树的上行遍历

标签树的平行遍历

基于bs4库的HTML格式输出

相关文章

python beautiful soup库入门

Python爬虫——Beautiful Soup

Python读取网页并获取某节点

初试爬虫-爬取图片

Beautiful Soup 采坑之旅

Beautiful Soup库入门

【Python爬虫】Beautiful Soup

Python爬虫--BeautifulSoup(三)

爬虫组队学习——task2

1. Beautiful Soup的简介

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读