美文网首页
BeautifulSoup

BeautifulSoup

作者: BigBigTang | 来源:发表于2019-02-25 21:55 被阅读0次

导入使用

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

解析器使用方法优势劣势

Python标准库BeautifulSoup(markup, "html.parser")Python的内置标准库、执行速度适中 、文档容错能力强Python 2.7.3 or 3.2.2)前的版本中文容错能力差

lxml HTML 解析器BeautifulSoup(markup, "lxml")速度快、文档容错能力强需要安装C语言库

lxml XML 解析器BeautifulSoup(markup, "xml")速度快、唯一支持XML的解析器需要安装C语言库

html5libBeautifulSoup(markup, "html5lib")最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档速度慢、不依赖外部扩展

例子1

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

print(soup.title)

print(soup.title.string)

print(type(soup.title))

print(soup.head)

print(soup.p)

<title>The Dormouse's story</title>

The Dormouse's story

<class 'bs4.element.Tag'>

<head><title>The Dormouse's story</title></head>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

基础使用

soup.title.string和soup.title的区别在于soup.title将整个title标签获取

.string就显示text内容

获取属性

print(soup.p.attrs['name'])

print(soup.p['name'])

dromouse

dromouse

嵌套使用

print(soup.head.title.string)

The Dormouse's story

子节点和子孙节点

.contents会获取标签下所有的子节点

print(soup.p.contents)

['\n Once upon a time there were three little sisters; and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1">

<span>Elsie</span>

</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n            and\n            ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n            and they lived at the bottom of a well.\n        ']

children获取所有的子节点和contents类似,但是返回类型是迭代器,需要迭代出来

print(soup.p.children)

for i, child in enumerate(soup.p.children):

    print(i, child)

结果

<list_iterator object at 0x1064f7dd8>

0

            Once upon a time there were three little sisters; and their names were

1 <a class="sister" href="http://example.com/elsie" id="link1">

<span>Elsie</span>

</a>

2

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

            and

5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

6

            and they lived at the bottom of a well.

descendants是获取子孙节点(不仅仅是子节点,可以和上面对比),返回类型也是迭代器

print(soup.p.descendants)

for i, child in enumerate(soup.p.descendants):

    print(i, child)

<generator object descendants at 0x10650e678>

0

            Once upon a time there were three little sisters; and their names were

1 <a class="sister" href="http://example.com/elsie" id="link1">

<span>Elsie</span>

</a>

2

3 <span>Elsie</span>

4 Elsie

5

6

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

8 Lacie

            and

10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

11 Tillie

12

            and they lived at the bottom of a well.

相关文章

网友评论

      本文标题:BeautifulSoup

      本文链接:https://www.haomeiwen.com/subject/ulenyqtx.html