BeautifulSoup

作者: BigBigTang | 来源:发表于2019-02-25 21:55 被阅读0次

爬虫任务二
BeautifulSoup(BS4)的基本使用
BeautifulSoup基础使用
beautifulsoup教程
Python中HTML解析
beautifulsoup4 标签选择器
用beautifulsoup剖析网页元素
Python 抓取花瓣图片地址
HTML 解析
Python 爬虫基础｜Python网络数据采集笔记

导入使用

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

解析器使用方法优势劣势

Python标准库BeautifulSoup(markup, "html.parser")Python的内置标准库、执行速度适中、文档容错能力强Python 2.7.3 or 3.2.2)前的版本中文容错能力差

lxml HTML 解析器BeautifulSoup(markup, "lxml")速度快、文档容错能力强需要安装C语言库

lxml XML 解析器BeautifulSoup(markup, "xml")速度快、唯一支持XML的解析器需要安装C语言库

html5libBeautifulSoup(markup, "html5lib")最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档速度慢、不依赖外部扩展

例子1

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

The Dormouse's story

Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.

...

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

print(soup.title)

print(soup.title.string)

print(type(soup.title))

print(soup.head)

print(soup.p)

<title>The Dormouse's story</title>

The Dormouse's story

<class 'bs4.element.Tag'>

<head><title>The Dormouse's story</title></head>

The Dormouse's story

基础使用

soup.title.string和soup.title的区别在于soup.title将整个title标签获取

.string就显示text内容

获取属性

print(soup.p.attrs['name'])

print(soup.p['name'])

dromouse

dromouse

嵌套使用

print(soup.head.title.string)

The Dormouse's story

子节点和子孙节点

.contents会获取标签下所有的子节点

print(soup.p.contents)

['\n Once upon a time there were three little sisters; and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1">

Elsie

</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n and\n ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n and they lived at the bottom of a well.\n ']

children获取所有的子节点和contents类似，但是返回类型是迭代器，需要迭代出来

print(soup.p.children)

for i, child in enumerate(soup.p.children):

print(i, child)

结果

<list_iterator object at 0x1064f7dd8>

0

Once upon a time there were three little sisters; and their names were

1 <a class="sister" href="http://example.com/elsie" id="link1">

Elsie

</a>

2

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

4

and

5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

6

and they lived at the bottom of a well.

descendants是获取子孙节点（不仅仅是子节点，可以和上面对比），返回类型也是迭代器

print(soup.p.descendants)

for i, child in enumerate(soup.p.descendants):

print(i, child)

<generator object descendants at 0x10650e678>

0

Once upon a time there were three little sisters; and their names were

1 <a class="sister" href="http://example.com/elsie" id="link1">

Elsie

</a>

2

3 Elsie

4 Elsie

5

6

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

8 Lacie

9

and

10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

11 Tillie

12

and they lived at the bottom of a well.