python bs4 的使用

作者: BerL1n | 来源:发表于2018-04-02 15:18 被阅读0次

Python 3.7 使用bs4提取遇到的问题
提取网页正文的主要内容 BeautifulSoup
ruquests、bs4库安装及应用
Python爬虫(十五)_案例：使用bs4的爬虫
python bs4 的使用
python bs4解析网页时 bs4.FeatureNotFo
python的bs4
初见之Vapor后端服务,Python抓取数据,React Na
Python3.5下的bs4安装错误
Python 《Python 实现高德地图找房》实验报告

导入
from bs4 import BeautifulSoup

我们创建一个字符串，后面的例子我们便会用它来演示

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
The Dormouse's story
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
...
"""

创建 beautifulsoup 对象

soup = BeautifulSoup(html)

另外，我们还可以用本地 HTML 文件来创建对象，例如

soup = BeautifulSoup(open('index.html'))

上面这句代码便是将本地 index.html 文件打开，用它来创建 soup 对象。下面我们来打印一下 soup 对象的内容，格式化输出

print soup.prettify()

指定编码：当html为其他类型编码（非utf-8和asc ii），比如GB2312的话，则需要指定相应的字符编码，BeautifulSoup才能正确解析。

htmlCharset = "GB2312"
soup = BeautifulSoup(respHtml, fromEncoding=htmlCharset)
!/usr/bin/python
-- coding: UTF-8 --
from bs4 import BeautifulSoup
import re
待分析字符串：

html_doc = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>


The Dormouse's story

Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
.

...
"""

html字符串创建BeautifulSoup对象：
soup = BeautifulSoup(html_doc, 'html.parser', from_encoding='utf-8')

输出第一个 title 标签：
print soup.title

输出第一个 title 标签的标签名称：
print soup.title.name

输出第一个 title 标签的包含内容：
print soup.title.string

输出第一个 title 标签的父标签的标签名称：
print soup.title.parent.name

输出第一个 p 标签
print soup.p

输出第一个 p 标签的 class 属性内容：
print soup.p['class']

输出第一个 a 标签的 href 属性内容：
print soup.a['href']
'''''
soup的属性可以被添加,删除或修改. 再说一次, soup的属性操作方法与字典一样
'''
修改第一个 a 标签的href属性为 http://www.baidu.com/
soup.a['href'] = 'http://www.baidu.com/'

给第一个 a 标签添加 name 属性 :
soup.a['name'] = u'百度'

删除第一个 a 标签的 class 属性为 :
del soup.a['class']

输出第一个 p 标签的所有子节点 :
print soup.p.contents

输出第一个 a 标签 :
print soup.a

输出所有的 a 标签，以列表形式显示 :
print soup.find_all('a')

输出第一个 id 属性等于 link3 的 a 标签 :
print soup.find(id="link3")

获取所有文字内容 :
print(soup.get_text())

输出第一个 a 标签的所有属性信息 :
print soup.a.attrs

for link in soup.find_all('a'):
获取 link 的 href 属性内容
print(link.get('href'))

对soup.p的子节点进行循环输出 :
for child in soup.p.children:
print(child)

正则匹配，名字中带有b的标签 :
for tag in soup.find_all(re.compile("b")):
print(tag.name)

import bs4#导入BeautifulSoup库
Soup = BeautifulSoup(html)#其中html 可以是字符串，也可以是句柄
需要注意的是，BeautifulSoup会自动检测传入文件的编码格式，然后转化为Unicode格式
通过如上两句话，BS自动把文档生成为如上图中的解析树。

网友评论

本文标题：python bs4 的使用

本文链接：https://www.haomeiwen.com/subject/hxfkqftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

python bs4 的使用

我们创建一个字符串，后面的例子我们便会用它来演示

创建 beautifulsoup 对象

另外，我们还可以用本地 HTML 文件来创建对象，例如

上面这句代码便是将本地 index.html 文件打开，用它来创建 soup 对象。下面我们来打印一下 soup 对象的内容，格式化输出

指定编码：当html为其他类型编码（非utf-8和asc ii），比如GB2312的话，则需要指定相应的字符编码，BeautifulSoup才能正确解析。

待分析字符串：

相关文章

Python 3.7 使用bs4提取遇到的问题

提取网页正文的主要内容 BeautifulSoup

ruquests、bs4库安装及应用

Python爬虫(十五)_案例：使用bs4的爬虫

python bs4 的使用

python bs4解析网页时 bs4.FeatureNotFo

python的bs4

初见之Vapor后端服务,Python抓取数据,React Na

Python3.5下的bs4安装错误

Python 《Python 实现高德地图找房》实验报告

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

python bs4 的使用

我们创建一个字符串，后面的例子我们便会用它来演示

创建 beautifulsoup 对象

另外，我们还可以用本地 HTML 文件来创建对象，例如

上面这句代码便是将本地 index.html 文件打开，用它来创建 soup 对象。下面我们来打印一下 soup 对象的内容，格式化输出

指定编码：当html为其他类型编码（非utf-8和asc ii），比如GB2312的话，则需要指定相应的字符编码，BeautifulSoup才能正确解析。

待分析字符串 ：

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

待分析字符串：