美文网首页
爬虫笔记(4):BeautifulSoup

爬虫笔记(4):BeautifulSoup

作者: Haohao_95 | 来源:发表于2018-07-11 21:18 被阅读0次

    主要用途为获取网页元素。

    1. 解析器类型

    解析器 使用方法 优势 劣势
    Python标准库 BeautifulSoup(markup, "html.parser") Python的内置标准库、执行速度适中 、文档容错能力强 Python 2.7.3 or 3.2.2)前的版本中文容错能力差
    lxml HTML 解析器 BeautifulSoup(markup, "lxml") 速度快、文档容错能力强 需要安装C语言库
    lxml XML 解析器 BeautifulSoup(markup, "xml") 速度快、唯一支持XML的解析器 需要安装C语言库
    html5lib BeautifulSoup(markup, "html5lib") 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 速度慢、不依赖外部扩展

    2. 基本使用

    1)使用prettify()来进行补全与处理获取网页元素的错误

    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.prettify()) #容错处理
    print(soup.title.string)#获取title标签中的文字
    

    2)选择器

    a)得到BeautifulSoup对象

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html,'lxml')
    

    b)选择标签

    print(soup.title)
    print(type(soup.title))
    print(soup.head)
    print(soup.p)
    
    """
    <title>The Dormouse's story</title>
    <class 'bs4.element.Tag'>
    <head><title>The Dormouse's story</title></head>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    """
    

    c)选择标签元素

    (获取的都是第一个匹配到的元素)
    1. 获取名称
    print(soup.title.name)
    #title
    2. 获取标签属性
    print(soup.p.attrs['name'])
    print(soup.p['name'])
    """
    dromouse
    dromouse
    """
    3. 获取内容
    print(soup.p.string)
    #The Dormouse's story
    

    3)获取子孙节点,父节点及兄弟节点

    a) contents获取列表类型的节点内容(只将子节点作为一个list项输出出来)

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.p.contents)
    
    """
    ['\n            Once upon a time there were three little sisters; and their names were\n            ', <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n            and\n            ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n            and they lived at the bottom of a well.\n        ']
    """
    

    b)children获取list_iterator object类型(只输出到子节点)

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html,'lxml')
    print(soup.p.children)
    for i,child in enumerate(soup.p.children)
        print(i,child)
    
    """
    <list_iterator object at 0x7ff476c387f0>
    0 
                Once upon a time there were three little sisters; and their names were
                
    1 <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>
    2 
    
    3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    4  
                and
                
    5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    6 
                and they lived at the bottom of a well.
    """
    

    c) descendants迭代获取所有的子孙节点

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.p.descendants)
    for i, child in enumerate(soup.p.descendants):
        print(i, child)
    """
    <generator object descendants at 0x10650e678>
    0 
                Once upon a time there were three little sisters; and their names were
                
    1 <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>
    2 
    
    3 <span>Elsie</span>
    4 Elsie
    5 
    
    6 
    
    7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    8 Lacie
    9  
                and
                
    10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    11 Tillie
    12 
    """
    

    d)parent与parents获取父节点与子孙节点

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.a.parent)
    
    """
    <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
                and
                <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
    """
    
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(list(enumerate(soup.a.parents)))
    
    """
    [(0, <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
                and
                <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>), (1, <body>
    <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
                and
                <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
    <p class="story">...</p>
    </body>), (2, <html>
    <head>
    <title>The Dormouse's story</title>
    </head>
    <body>
    <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
                and
                <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
    <p class="story">...</p>
    </body></html>), (3, <html>
    <head>
    <title>The Dormouse's story</title>
    </head>
    <body>
    <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
                and
                <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
    <p class="story">...</p>
    </body></html>)]
    """
    

    e)next_siblings与previous_siblings获取兄弟节点

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(list(enumerate(soup.a.next_siblings)))
    print(list(enumerate(soup.a.previous_siblings)))
    

    4)标准选择器

    find_all( name , attrs , recursive , text , **kwargs )
    
    html='''
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</li>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
    '''
    

    根据标签去取元素

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    for ul in soup.find_all('ul'):
        print(ul.find_all('li'))
    
    “”“
    得到一个包含结果的列表类型
    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
    [<li class="element">Foo</li>, <li class="element">Bar</li>]
    ”“”
    

    使用名称来获取对象:

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all(attrs={'id': 'list-1'}))
    print(soup.find_all(attrs={'name': 'elements'}))
    

    使用id或者class来获取指定元素列表对象:

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html,'lxml')
    print(soup.find_all(id='list-1'))
    print(soup.find_all(class_='element'))  #class一定要加‘_'
    
    """
    [<ul class="list" id="list-1" name="elements">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>]
    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
    """
    

    做字符匹配:

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all(text='Foo'))
    
    #['Foo', 'Foo']
    

    5) CSS选择器:

    CSS风格的选择器

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.select('.panel .panel-heading'))#选择class="panel"中的class="panel-heading"元素
    print(soup.select('ul li'))#选择ul标签中的li标签
    print(soup.select('#list-2 .element'))#选择id=“list-2”中的class="element"元素
    print(type(soup.select('ul')[0]))
    

    获取标签中的文本:

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    for li in soup.select('li'):
        print(li.get_text())
    """
    Foo
    Bar
    Jay
    Foo
    Bar
    """
    

    相关文章

      网友评论

          本文标题:爬虫笔记(4):BeautifulSoup

          本文链接:https://www.haomeiwen.com/subject/spzxpftx.html