美文网首页LovingPython
Python之BeautifulSoup

Python之BeautifulSoup

作者: 月蚀様 | 来源:发表于2019-06-01 00:11 被阅读0次

    BeautifulSoup是什么

    一个灵活方便的网页解析库,处理高效,支持多种解析器

    利用他不用编写正则表达式即可方便地实现网页信息的提取


    安装

    pip install beautifulsoup4
    

    支持的解析库

    解析器 使用方法 优势 劣势
    Python标准库 BeautifulSoup(markup, "html.parser") 内置库,速度一般,容错率不错 python老版本容错率差
    lxml HTML解析库 BeautifulSoup(markup, "lxml") 速度快,容错率强 需要安装C语言库
    lxml XML解析库 BeautifulSoup(markup, "xml") 速度快,唯一支持的XML解析器 需要安装C语言库
    html5lib BeautifulSoup(markup, "html5lib") 最好的容错性,以浏览器的方式解析文档,生成HTML5格式的文档 速度慢,不依赖外部扩展

    基本使用

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, "lxml")
    
    print(soup.prettify())
    print(soup.title.string)
    

    标签选择器

    选择元素

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, "lxml")
    
    print(type(soup.title))
    print(soup.title)
    print(soup.head)
    print(soup.p)  # 匹配第一个结果
    

    获取名称

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, "lxml")
    print(soup.title.name)
    

    获取属性

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, "lxml")
    
    print(soup.p.attrs['name'])
    print(soup.p['name'])
    

    获取内容

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, "lxml")
    print(soup.p.string)
    

    嵌套选择

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, "lxml")
    print(soup.head.title.string)
    

    子节点 and 子孙节点

    用 contents 以返回列表的形式获取所有子节点

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, "lxml")
    print(soup.p.contents)
    

    用 children 以返回迭代器的形式获取所有子节点

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, "lxml")
    print(soup.p.children)
    
    for i, child in enumerate(soup.p.children):
      print(i, child)
    

    用 descendants 以返回列表的形式获取所有子孙节点

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, "lxml")
    print(soup.p.descendants)
    
    for i, child in enumerate(soup.p.descendants):
      print(i, descendant)
    

    父节点 and 祖先节点

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, 'lxml')
    print(soup.a.parent)
    print(list(soup.a.parents))
    

    兄弟节点

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, 'lxml')
    print(list(enumerate(soup.a.net_siblings)))
    print(list(enumerate(soup.a.previous_siblings)))
    

    标准选择器

    find_all(name, attrs, recursive, text, **kwargs)
    可以根据标签、属性、内容查找文档

    name

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all('li'))
    print(type(soup.find_all('li')[0]))
    

    attrs

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all(attrs={'id': '...'}))
    print(soup.find_all(attrs={'name': '...'}))
    
    print(soup.find_all(id='...')
    print(soup.find_all(class_='...')
    

    text

    只匹配,不会返回匹配的内容


    其他

    find(name, attrs, recursive, text, **kwargs)

    find_parents() and find_parent()

    find_next_siblings() and find_next_sibling()

    find_previous_siblings() and find_previous_sibling()

    find_all_next() and find_next()

    find_all_previous() and find_previous()

    CSS选择器

    select()

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, 'lxml')
    soup.select('.class_content .blabla2')
    soup.select('tag_name li')
    soup.select('#id_content .element')
    

    获取属性

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, 'lxml')
    for ul in soup.select('ul'):
      print(ul['id'])
      print(ul.attrs['id'])
    

    获取内容

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, 'lxml')
    for li in soup.select('li'):
      print(li.get_text())
    

    相关文章

      网友评论

        本文标题:Python之BeautifulSoup

        本文链接:https://www.haomeiwen.com/subject/fdnntctx.html