美文网首页爬虫
爬虫_041_bs简单使用

爬虫_041_bs简单使用

作者: 为宇绸缪 | 来源:发表于2023-03-01 22:30 被阅读0次

    下面的一段HTML代码将作为例子被多次用到.这是爱丽丝梦游仙境的的一段内容

    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    <p class="story">...</p>
    """
    

    使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出:

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, 'lxml')
    # html进行美化
    print(soup.prettify())
    

    匹配代码

    <html>
     <head>
      <title>
       The Dormouse's story
      </title>
     </head>
     <body>
      <p class="title">
       <b>
        The Dormouse's story
       </b>
      </p>
      <p class="story">
       Once upon a time there were three little sisters; and their names were
       <a class="sister" href="http://example.com/elsie" id="link1">
        Elsie
       </a>
       ,
       <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
       </a>
       and
       <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
       </a>
       ;
    and they lived at the bottom of a well.
      </p>
      <p class="story">
       ...
      </p>
     </body>
    </html>
    

    几个简单的浏览结构化数据的方法:

    soup.title  # 获取标签title
    # <title>The Dormouse's story</title>
    
    soup.title.name   # 获取标签名称
    # 'title'
    
    soup.title.string   # 获取标签title内的内容
    # 'The Dormouse's story'
    
    soup.title.parent  # 获取父级标签
    
    soup.title.parent.name  # 获取父级标签名称
    # 'head'
    
    soup.p
    # <p class="title"><b>The Dormouse's story</b></p>
    
    soup.p['class']  # 获取p的class属性值
    # 'title'
    
    soup.a
    # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
    
    soup.find_all('a')
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    
    soup.find(id="link3")  # 获取id为link3的标签
    # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    

    从文档中找到所有<a>标签的链接:

    for link in soup.find_all('a'):
        print(link.get('href'))
        # http://example.com/elsie
        # http://example.com/lacie
        # http://example.com/tillie
    

    从文档中获取所有文字内容:

    print(soup.get_text())
    

    相关文章

      网友评论

        本文标题:爬虫_041_bs简单使用

        本文链接:https://www.haomeiwen.com/subject/nrbdldtx.html