美文网首页Python3网络爬虫开发实战
(四)BeautifulSoup库基础学习|Python3网络爬

(四)BeautifulSoup库基础学习|Python3网络爬

作者: 努力奋斗的durian | 来源:发表于2018-02-05 18:22 被阅读18次

最近更新:2018-02-05

1.BeautifulSoup库的了解

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

1.1BeautifulSoup库的安装方法:

在cmd中输入:pip install beautifulsoup4

1.2Beautiful Soup库的理解



简单的说,BeautifulSoup库可以将一个html文档转换成一个BeautifulSoup类,然后我们就可以使用BeautifulSoup的各种方法提取出我们所需要的元素
注意:Beautiful Soup库是解析、遍历、维护“标签树”的功能库

1.3Beautiful Soup库的引用

Beautiful Soup库,也叫beautifulsoup4 或 bs4

from bs4 import BeautifulSoup
import bs4

1.4BeautifulSoup类的基本元素

2.Beautiful Soup库解析器

3.Beautiful Soup基本使用

因以下代码是不全的,body/html标签没有闭合,我们用BeautifulSoup来解析代码.soup.prettify()这个是格式化代码,自动将缺失的标签补全.

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup.prettify()已自动将缺失的标签补全.解析代码如下:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.prettify())
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
>>> print(soup.title.string)
The Dormouse's story

4.Beautiful Soup标签选择器

4.1选择元素-Tag 标签


注意:

  • 任何存在于HTML语法中的标签都可以用soup.<tag>访问获得
  • 当HTML文档中存在多个相同<tag>对应内容时,soup.<tag>返回第一个
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup(html,"lxml")
#title标签
>>> print(soup.title)
<title>The Dormouse's story</title>
#head标签
>>> print(soup.head)
<head><title>The Dormouse's story</title></head>
#title标签
>>> print(type(soup.title))
<class 'bs4.element.Tag'>
#p标签,多个同样的标签,返回第一个
>>> print(soup.p)
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

4.2获取名称-Tag的name(名字)

每个<tag>都有自己的名字,通过<tag>.name获取,字符串类型


>>> html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.title.name)#获取title的名字
title

4.3获取属性-Tag的attrs(属性)


有两种方法,如下:

>>> html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.p.attrs["name"])#方法一
dromouse
>>> print(soup.p["name"])#方法二
dromouse

4.4获取内容-Tag的NavigableString

>>> html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> soup.p.string
"The Dormouse's story"

4.5Tag的Comment


4.6嵌套选择

可以选择标签树的嵌套选择

>>> html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.head.title.string)
The Dormouse's story

4.7子节点和子孙节点-标签树的下行遍历


>>> html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

4.7.1content用法,获取p标签的子节点所有的内容,并以列表的方式返回

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.p.contents)
['\n            Once upon a time there were three little sisters; and their names were\n            ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n            and\n            ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n            and they lived at the bottom of a well.\n  

4.7.2children用法,是个迭代器的内容,用循环的方式才能获取,获得所有子节点

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.p.children)
<list_iterator object at 0x0000000003789EF0>
>>> for i, child in enumerate(soup.p.children):
    print(i, child)#i接收索引,child接收内容

    
0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4  
            and
            
5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 
            and they lived at the bottom of a well.

4.7.3descendants用法
返回的类型也是一个迭代器,是获取所有的子孙节点,包含p标签的子孙节点span标签,也是作为其中的一部分,这是与children的区别.

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.p.descendants)
<generator object descendants at 0x000000000379C150>
>>> for i, child in enumerate(soup.p.descendants):
     print(i, child)

     
0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <span>Elsie</span>
4 Elsie
5 

6 

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9  
            and
            
10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12 
            and they lived at the bottom of a well.

4.8父节点和祖先节点-标签树的上行遍历


html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

4.8.1parent
a标签的父亲节点,就是p标签

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.a.parent)
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>

4.8.2parents
输出a的所有父亲节点,0p>1body>2html>3所有的文档(倒数第1个和倒数第2个其实是一样的)

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(list(enumerate(soup.a.parents)))
[(0, <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>), (1, <body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
<p class="story">...</p>
</body>), (2, <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
<p class="story">...</p>
</body></html>), (3, <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
<p class="story">...</p>
</body></html>)]

4.9兄弟节点-标签树的平行遍历


>>> html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

4.9.1next_siblings后续所有的平行节点

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(list(enumerate(soup.a.next_siblings)))
[(0, '\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \n            and\n            '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n            and they lived at the bottom of a well.\n        ')]

4.9.2previous_siblings前续所有的平行节点

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(list(enumerate(soup.a.previous_siblings)))
[(0, '\n            Once upon a time there were three little sisters; and their names were\n            ')]

4.10总结

单纯用以上的方法是远远不够的,因为实际匹配的时候还涉及到其他属性的匹配等状况.



5.Beautiful Soup标准选择器

5.1find_all(name, attrs, recursive, string, **kwargs)

  • name : 对标签名称的检索字符串
  • attrs: 对标签属性值的检索字符串,可标注属性检索
  • recursive: 是否对子孙全部检索,默认True
  • string: <>…</>中字符串区域的检索字符串

注意:

  • 可根据标签名、属性、内容查找文档
  • <tag>(..) 等价于 <tag>.find_all(..)
    soup(..) 等价于 soup.find_all(..)

注意:返回的是列表

>>> html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
5.1.1 name

获取基础方法,返回的是列表

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')

>>> print(soup.find_all('ul'))
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]

>>> print(type(soup.find_all('ul')[0]))
<class 'bs4.element.Tag'>

用遍历的方式提取数据

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> for ul in soup.find_all("ul"):
    print(ul.find_all("li"))

    
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
5.1.2 attrs
>>> html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>

属性用字典类型进行查找

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.find_all(attrs={'id': 'list-1'}))
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]

>>> print(soup.find_all(attrs={'name': 'elements'}))
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]

特殊的attrs可以不用字典的方式进行查找.比如属性是id/class,class比较特殊,要加下划线_,变成class_

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.find_all(id='list-1'))
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]

>>> print(soup.find_all(class_='element'))
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
5.1.3 string

两个li标签里面的文本是Foo,运行的结果是之间把string打印出来,而不是把标签打印出来,匹配内容相对比较有用,但是用在元素查找并没有那么方便.

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.find_all(text='Foo'))
['Foo', 'Foo']

5.2find( name , attrs , recursive , text , **kwargs )

find与find_all方法是一样的.find返回单个元素(匹配结果的第一个值),find_all返回所有元素.

>>> html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')

>>> print(soup.find('ul'))
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>

>>> print(type(soup.find('ul')))
<class 'bs4.element.Tag'>

>>> print(soup.find('page'))#不存在的标签
None

5.3find_parents() find_parent()

find_parents()返回所有祖先节点,find_parent()返回直接父节点。

5.4find_next_siblings() find_next_sibling()

find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点。

5.5find_previous_siblings() find_previous_sibling()

find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点。

5.6find_all_next() find_next()

find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点

5.7find_all_previous() 和 find_previous()

find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点

6.CSS选择器

  • 通过select()直接传入CSS选择器即可完成选择
  • 标签名不加任何修饰,类名前加点,id名前加 #,在这里我们也可以利用类似的方法来筛选元素,用到的方法是 soup.select(),返回类型是 list.
  • 注意属性和标签属于同一节点,所以中间不能加空格,否则会无法匹配到。
  • 学习参考链接:https://www.cnblogs.com/yizhenfeng168/p/6979339.html
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.select('.panel .panel-heading'))
[<div class="panel-heading">
<h4>Hello</h4>
</div>]

>>> print(soup.select('ul li'))
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

>>> print(soup.select('#list-2 .element'))
[<li class="element">Foo</li>, <li class="element">Bar</li>]

>>> print(type(soup.select('ul')[0]))
<class 'bs4.element.Tag'>

遍历的方式查找标签名

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> for ul in soup.select('ul'):
    print(ul.select('li'))

    
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

6.1获取属性

有两种方法,获取的结果是一致的

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> for ul in soup.select('ul'):
    print(ul['id'])#方法一
    print(ul.attrs['id'])#方法二

    
list-1
list-1
list-2
list-2

6.2获取内容

获取标签里面的文本

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'lxml')
>>> for li in soup.select('li'):
    print(li.get_text())

    
Foo
Bar
Jay
Foo
Bar

7. 总结

  • 推荐使用lxml解析库,必要时使用html.parser
  • 标签选择筛选功能弱但是速度快
  • 建议使用find()、find_all() 查询匹配单个结果或者多个结果
  • 如果对CSS选择器熟悉建议使用select()
  • 记住常用的获取属性和文本值的方法



相关文章

网友评论

    本文标题:(四)BeautifulSoup库基础学习|Python3网络爬

    本文链接:https://www.haomeiwen.com/subject/ilzbzxtx.html