美文网首页
BeautifulSoup4简单使用

BeautifulSoup4简单使用

作者: 流光汐舞 | 来源:发表于2018-03-01 20:39 被阅读0次
    BeautifulSoup4的安装

    pip install beautifulsoup4

    image.png
    BeautifulSoup4的使用

    以下面一段html文档为例子,如:

    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    <p class="story">...</p>
    """
    
    常用方法
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html_doc, 'html.parser')
    print(soup.prettify())
    
    # 获取title标签
    print(soup.title)
    # <title>The Dormouse's story</title>
    
    # 获取title标签名称
    print(soup.title.name)
    # title
    
    # 获取title标签的内容
    print(soup.title.string)
    # The Dormouse's story
    
    # 获取title的父标签
    print(soup.title.parent)
    # <head><title>The Dormouse's story</title></head>
    
    # 获取title的父标签名称
    print(soup.title.parent.name)
    # head
    
    # 获取p标签
    print(soup.p)
    # <p class="title"><b>The Dormouse's story</b></p>
    
    # 获取p标签class属性
    print(soup.p['class'])
    #  ['title']    #返回的是list
    
    # 获取所有的a标签
    print(soup.find_all('a'))
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    
    # 获取id='link3'的标签
    print(soup.find(id="link3"))
    # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    
    # 获取所有的a标签的链接
    for link in soup.find_all('a'):
        print(link.get('href'))
    
    #   http://example.com/elsie
    #   http://example.com/lacie
    #   http://example.com/tillie
    
    # 获取文档中所有文字内容
    print(soup.get_text())
    
    # The Dormouse's story
    #
    # The Dormouse's story
    # Once upon a time there were three little sisters; and their names were
    #    Elsie,
    #    Lacie and
    #    Tillie;
    #    and they lived at the bottom of a well.
    
    

    1. 对象的种类

    BeautifulSoup将复杂HTML文档转换成一个复杂的属性结构,每个节点都是对象,所有对象分为4种类型:TagNavigabStringBeautifulSoupComment

    1.1 Tag:对象与XML或HTML原生文档中的tag相同
    print(soup.title)
    # <title>The Dormouse's story</title>
    
    print(type(soup.title))
    # <class 'bs4.element.Tag'>
    

    Tag有2个重要的属性:name , attrs

    name:tag的标签名称

    print(soup.title.name)
    # title
    

    attrs:tag的属性

    print(soup.p.attrs)
    # {'class': ['title']}
    
    1.2 NatigabString:标签的文本内容
    print(soup.p.string)
    # The Dormouse's story
    
    print(type(soup.p.string))
    # <class 'bs4.element.NavigableString'>
    
    1.3 BeautifulSoup:表示一个文档内容,大部分时候,我们可以把它当做一个特殊的Tag
    print(soup.name)
    # [document]
    
    print(type(soup))
    # <class 'bs4.BeautifulSoup'>
    
    1.4 Comment:是一个特殊类型的 NavigableString 对象,其输出的内容不包括注释符号。
    makeup='<p><!--Hello--></p>'
    soup = BeatuifulSoup(makeup,'lxml')
    
    print(soup.p.string)    
    # Hello
    
    print(type(soup.p.string))
    # <class 'bs4.element.Comment'>
    

    2. 遍历文档树

    2.1 子节点:.contents.children属性

    tag的.contents属性可以将tag的子节点以列表的方式输出:

    print(soup.p.contents)
    # [<b>The Dormouse's story</b>]     # 因为只有一个节点
    
    print(soup.p.contents[0])
    # [<b>The Dormouse's story</b>] 我们也可以获取列表的第一个标签。如果没有,会报错
    

    tag的.children返回一个生成器,可以对tag的子节点进行循环。

    print(type(soup.p.children))
    # <class 'list_iterator'>
    
    for child in soup.p.children:
        print(child)    # <b>The Dormouse's story</b>
    
    2.2 所有子孙节点.descendants属性

    .descendants属性可以对所有的tag子孙节点进行递归循环,和.childern类似。

    for tag in soup.body.descendants:
        print(tag)
    
    # 输出结果:
    # <b>The Dormouse's story</b>
    # The Dormouse's story
    

    3. 搜索文档树

    3.1 find_all(name, attrs , recursive, text,limit, **kwargs)

    find_all()参数:

    name:查找名字为name的tag。(可以传入string,正则,列表)
    
    attrs:tag的属性
    
    recursive:是否递归,默认True
    
    text:tag标签文本
    
    limit:限制条数
    

    3.1.1 name传入string

    print(soup.find_all('p', attrs = {'class': 'title'}))
    # [<p class="title"><b>The Dormouse's story</b></p>]
    
    print(soup.find_all('p', text='...'))
    # [<p class="story">...</p>]
    
    print(soup.find_all('a', limit=2))
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
    
    

    3.1.2 name传入re正则表达式

    for tag in soup.find_all(re.compile('^b')):
        print(tag.name)
    
    # body
    # b
    

    3.1.3 name传入列表

    for tag in soup.find_all(['body','b']):
        print(tag.name)
    
    # body
    # b
    
    3.2 按CSS选择器搜索

    3.2.1 通过标签名查找

    print(soup.select('title'))
    [<title>The Dormouse's story</title>]
    

    3.2.2 通过类名查找

    print(soup.select('.sister'))
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    

    3.2.3 通过id名查找

    print(soup.select("#link1"))
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
    

    3.2.4 组合查找

    print(soup.select('#link1,title'))
    # [<title>The Dormouse's story</title>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
    

    3.2.5 属性查找

    print(soup.select('a[class="sister"]'))
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    

    3.2.6 获取内容

    for tag in soup.select('a'):
        print(tag.get_text())
    
    # Elsie
    # Lacie
    # Tillie
    

    相关文章

      网友评论

          本文标题:BeautifulSoup4简单使用

          本文链接:https://www.haomeiwen.com/subject/ulmnxftx.html