美文网首页
[spider]网页内容提取之Bs4

[spider]网页内容提取之Bs4

作者: Franckisses | 来源:发表于2019-03-08 09:16 被阅读0次

    Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.
    今天就简单的介绍一下beautiful soup的使用。

    安装库:

    pip install bs4
    
    bs4 解析器的方法以及优缺点

    只介绍lxml解析器的使用方法,因为比较好用而且解析速度适中,难易程度适中。如果你使用的是anaconda的环境的话,lxml已经装好了,我们直接使用就可以了。

    html_doc = 
    """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.
    </p>
    <p class="story">...</p>
    """
    

    1.初始化,接受的参数是字符串。

    soup = BeautifulSoup(html_doc,"lxml")
    

    2.将代码格式化

    print(soup.prettify())
    #使用此方法会使代码更加的美观并且有了明显的层次化。
    

    3.tag实例soup.标签a/p,代表遍历到第一个a标签或者p标签

    print(type(soup.a))
    #匹配即停止
    print(soup.a)
    结果:
    <class 'bs4.element.Tag'>
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
    

    4.获取文本内容:

    title_text = soup.title.string
    title_text1 = soup.title.get_text()
    title_text2 = soup.title.text
    
    print(title_text)
    print(title_text1)
    print(title_text2)
    #结果
    The Dormouse's story
    The Dormouse's story
    The Dormouse's story
    

    5.选取属性

    b = soup.a.attrs
    print(b['href'])
    print(b['class'][0])
    print(b['id'])
    #结果:
    http://example.com/elsie
    sister
    link1
    

    6.选取多个属性

    #返回值是一个列表,将所有的a标签返回
    c = soup.find_all('a') 
    print(c)
    #结果:
    [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    
    c = soup.find_all(name='a',attrs={'href':'http://example.com/tillie'})
    print(c)
    #结果:
    [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    

    7.contents属性

    # 返回的结果是一个列表形式。可以将元素中的节点以及文本都匹配出来。
    print(soup.p.contents)
    #结果:
    [<b>The Dormouse's story</b>]
    

    8.父节点以及祖先节点parent,patents

    p_parent = soup.p.parent
    print(p_parent)
    #结果:
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and 
    their names were
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
    and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    </body>
    
    p_parents = soup.p.parents
    print(p_parents)
    print(list(enumerate(p_parents)))
    
    # 查找兄弟节点
    # next_sibling 查找下一个兄弟节点
    # previous_sibling 查找上一个兄弟节点
    # next_siblings  查找所有后面的节点
    # previous_siblings 查找所有的前面节点
    

    9.css选择器

    print(soup.select('p a'))
    #结果:
    [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
    <a class="sister" href="http://example.com/lacie"     id="link2">Lacie</a>, 
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    
     print(soup.select('.sister'))
     [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
     <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
     <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    

    相关文章

      网友评论

          本文标题:[spider]网页内容提取之Bs4

          本文链接:https://www.haomeiwen.com/subject/dyzalqtx.html