美文网首页Crawler
BeautyfulSoup/python3基本使用

BeautyfulSoup/python3基本使用

作者: 疯帮主 | 来源:发表于2018-11-10 18:08 被阅读0次

    简单开始

    # 这个代码是不完整的,有些没有闭合标签
    html = """
    <!DOCTYPE html>
    <html lang="zh-CN">
    <head>
        <meta charset="utf-8">
        <title>迅影网,迅雷电影下载,最新电影下载,高清电影下载
        <link rel="icon" href="/static/favicon.ico">
        <link rel="stylesheet" href="/static/bootstrap/css/bootstrap.min.css">
    </head>
    <body>
    <header>
    <div class="header-box">
        <div class="container">
            <span class="header-help">欢迎来到迅影网,一起分享电影给我们带来的快乐。</span>
            <div class="pull-right">
                <a class="header-help" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')" href="javascript:;">Ctrl+D 加入收藏夹</a> -
                <a class="header-help" style="color:red;" href="http://www.xunyingwang.com/tools/savewebsite">保存到桌面
            </div>
    
    """
    soup = bs4.BeautifulSoup(html, 'lxml')
    # 格式化代码,其实也不是很好用,闭合不准确
    print(soup.prettify())
    print(soup.title.string)
    print(soup.span.string)
    

    输出:

    <!DOCTYPE html>
    <html lang="zh-CN">
     <head>
      <meta charset="utf-8"/>
      <title>
       迅影网,迅雷电影下载,最新电影下载,高清电影下载
       <link href="/static/favicon.ico" rel="icon"/>
       <link href="/static/bootstrap/css/bootstrap.min.css" rel="stylesheet"/>
      </title>
     </head>
     <body>
      <header>
       <div class="header-box">
        <div class="container">
         <span class="header-help">
          欢迎来到迅影网,一起分享电影给我们带来的快乐。
         </span>
         <div class="pull-right">
          <a class="header-help" href="javascript:;" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')">
           Ctrl+D 加入收藏夹
          </a>
          -
          <a class="header-help" href="http://www.xunyingwang.com/tools/savewebsite" style="color:red;">
           保存到桌面
          </a>
         </div>
        </div>
       </div>
      </header>
     </body>
    </html>
    None
    欢迎来到迅影网,一起分享电影给我们带来的快乐。
    

    标签选择器

    选择元素

    html = """
    <!DOCTYPE html>
    <html lang="zh-CN">
    <head>
        <meta charset="utf-8">
        <title>迅影网,迅雷电影下载,最新电影下载,高清电影下载</title>
        <link rel="icon" href="/static/favicon.ico">
        <link rel="stylesheet" href="/static/bootstrap/css/bootstrap.min.css">
    </head>
    <body>
    <header>
    <div class="header-box">
        <div class="container">
            <span class="header-help">欢迎来到迅影网,一起分享电影给我们带来的快乐。</span>
            <div class="pull-right">
                <a class="header-help" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')" href="javascript:;">Ctrl+D 加入收藏夹</a> -
                <a class="header-help" style="color:red;" href="http://www.xunyingwang.com/tools/savewebsite">保存到桌面</a>
            </div>
    
    """
    soup = bs4.BeautifulSoup(html, 'lxml')
    print(type(soup.title))
    print(soup.title)
    print(soup.head)
    print(soup.link)
    

    输出:

    <class 'bs4.element.Tag'>
    <title>迅影网,迅雷电影下载,最新电影下载,高清电影下载</title>
    <head>
    <meta charset="utf-8"/>
    <title>迅影网,迅雷电影下载,最新电影下载,高清电影下载</title>
    <link href="/static/favicon.ico" rel="icon"/>
    <link href="/static/bootstrap/css/bootstrap.min.css" rel="stylesheet"/>
    </head>
    <link href="/static/favicon.ico" rel="icon"/>
    

    当有相同的标签时,会选第一个

    获取名称

    html = """
        <link rel="stylesheet" href="/static/bootstrap/css/bootstrap.min.css">
    """
    soup = bs4.BeautifulSoup(html, 'lxml')
    print(soup.link.name)
    

    输出:

    link
    

    获取属性

    html = """
        <link rel="stylesheet" href="/static/bootstrap/css/bootstrap.min.css">
    """
    soup = bs4.BeautifulSoup(html, 'lxml')
    print(soup.link['rel'])
    print(soup.link.attrs['rel'])
    

    输出:

    ['stylesheet']
    ['stylesheet']
    

    获取内容

    html = """
    <div>
    <b>在这</b>
    </div>
    """
    soup = bs4.BeautifulSoup(html, 'lxml')
    print(soup.div.string)
    
    html = """
    <div><b>在这</b></div>
    """
    soup = bs4.BeautifulSoup(html, 'lxml')
    print(soup.div.string)
    

    输出:

    None
    在这
    

    一个换行就匹配不到了

    嵌套选择

    html = """
    <div>
    <b>在这</b>
    </div>
    """
    soup = bs4.BeautifulSoup(html, 'lxml')
    print(soup.div.b.string)
    

    输出:

    在这
    

    获取子节点

    使用contents

    html = """
    <div class="pull-right">
                <a class="header-help" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')" href="javascript:;">Ctrl+D 加入收藏夹</a> -
                <a class="header-help" style="color:red;" href="http://www.xunyingwang.com/tools/savewebsite">保存到桌面</a>
            </div>
    """
    soup = bs4.BeautifulSoup(html, 'lxml')
    print(soup.div.contents)
    

    输出:

    ['\n', <a class="header-help" href="javascript:;" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')">Ctrl+D 加入收藏夹</a>, ' -\n            ', <a class="header-help" href="http://www.xunyingwang.com/tools/savewebsite" style="color:red;">保存到桌面</a>, '\n']
    

    每个标签和每个标签间的字符都是一个元素

    使用children

    html = """
    <div class="pull-right">
                <a class="header-help" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')" href="javascript:;">Ctrl+D 加入收藏夹</a> -
                <a class="header-help" style="color:red;" href="http://www.xunyingwang.com/tools/savewebsite">保存到桌面</a>
            </div>
    """
    soup = bs4.BeautifulSoup(html, 'lxml')
    print(soup.div.children)
    for i,child in enumerate(soup.div.children):
        print(i, child)
    

    输出:

    <list_iterator object at 0x000001EC5AC591D0>
    0 
    
    1 <a class="header-help" href="javascript:;" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')">Ctrl+D 加入收藏夹</a>
    2  -
                
    3 <a class="header-help" href="http://www.xunyingwang.com/tools/savewebsite" style="color:red;">保存到桌面</a>
    4 
    

    children返回的是一个迭代器
    enumerate返回迭代索引和内容

    使用返回子孙节点

    html = """
    <div class="pull-right">
                <a class="header-help" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')" href="javascript:;">Ctrl+D 加入收藏夹</a> -
                <a class="header-help" style="color:red;" href="http://www.xunyingwang.com/tools/savewebsite">
                <span>
                <b>保存到桌面<b>
                </span>
                </a>
            </div>
    """
    soup = bs4.BeautifulSoup(html, 'lxml')
    print(soup.div.descendants)
    for i,child in enumerate(soup.div.descendants):
        print(i, child)
    

    输出:

    <generator object Tag.descendants at 0x000001EC5AC66A20>
    0 
    
    1 <a class="header-help" href="javascript:;" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')">Ctrl+D 加入收藏夹</a>
    2 Ctrl+D 加入收藏夹
    3  -
                
    4 <a class="header-help" href="http://www.xunyingwang.com/tools/savewebsite" style="color:red;">
    <span>
    <b>保存到桌面<b>
    </b></b></span>
    </a>
    5 
    
    6 <span>
    <b>保存到桌面<b>
    </b></b></span>
    7 
    
    8 <b>保存到桌面<b>
    </b></b>
    9 保存到桌面
    10 <b>
    </b>
    11 
    
    12 
    
    13 
    

    父节点

    单个父节点

    html = """
    <html>
    <body>
    <div class="pull-right">
                <a class="header-help" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')" href="javascript:;">Ctrl+D 加入收藏夹</a> -
                <a class="header-help" style="color:red;" href="http://www.xunyingwang.com/tools/savewebsite">保存到桌面</a>
            </div>
    </body>
    </html>
    """
    soup = bs4.BeautifulSoup(html, 'lxml')
    print(soup.div.parent)
    

    输出:

    <body>
    <div class="pull-right">
    <a class="header-help" href="javascript:;" onclick="notice('快捷键 Ctrl+D 可以快速添加到收藏夹。')">Ctrl+D 加入收藏夹</a> -
                <a class="header-help" href="http://www.xunyingwang.com/tools/savewebsite" style="color:red;">保存到桌面</a>
    </div>
    </body>
    

    祖父节点

    html = """
    <html>
        <body>
            <div>
                <p>I am</p>
                <p>Here</p>
            </div>
        </body>
    </html>
    """
    soup = bs4.BeautifulSoup(html, 'lxml')
    print(soup.p.parents)
    for i, parent in enumerate(soup.p.parents):
        print(i, parent)
    

    输出:

    <generator object PageElement.parents at 0x000001EC5AD93F48>
    0 <div>
    <p>I am</p>
    <p>Here</p>
    </div>
    1 <body>
    <div>
    <p>I am</p>
    <p>Here</p>
    </div>
    </body>
    2 <html>
    <body>
    <div>
    <p>I am</p>
    <p>Here</p>
    </div>
    </body>
    </html>
    3 <html>
    <body>
    <div>
    <p>I am</p>
    <p>Here</p>
    </div>
    </body>
    </html>
    

    兄弟节点

    html = """
    <div>
        <p>I am here?</p>
        <p>Where are you now?</p>
        <P>See you late</p>
        <p>You are my sunshine</p>
        <p>How much I love you</p>
    </div>
    """
    soup = bs4.BeautifulSoup(html, 'lxml')
    # 小兄弟节点
    print(list(enumerate(soup.p.next_siblings)))
    # 大兄弟节点
    print(list(enumerate(soup.p.previous_siblings)))
    

    输出:

    [(0, '\n'), (1, <p>Where are you now?</p>), (2, '\n'), (3, <p>See you late</p>), (4, '\n'), (5, <p>You are my sunshine</p>), (6, '\n'), (7, <p>How much I love you</p>), (8, '\n')]
    [(0, '\n')]
    

    标准选择器

    find_all(name, attrs, recursive, text, **kwargs)

    name标签名

    html = """
    <div>
        <p>I am here?</p>
        <p>Where are you now?</p>
        <P>See you late</p>
        <p>You are my sunshine</p>
        <p>How much I love you</p>
    </div>
    """
    soup = bs4.BeautifulSoup(html, 'lxml')
    print(soup.find_all('p'))
    print(type(soup.find_all('p')))
    print(soup.find_all('p')[0])
    print(type(soup.find_all('p')[0]))
    

    输出:

    [<p>I am here?</p>, <p>Where are you now?</p>, <p>See you late</p>, <p>You are my sunshine</p>, <p>How much I love you</p>]
    <class 'bs4.element.ResultSet'>
    <p>I am here?</p>
    <class 'bs4.element.Tag'>
    

    attrs属性

    html = """
    <div class="item active">
            <a target="_blank" href="http://www.xunyingwang.com/movie/635355.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/qush1m40f973.jpg" alt="反贪风暴3 迅雷下载"></a>
            <div class="carousel-caption">反贪风暴3 迅雷下载</div>
        </div>    <div class="item">
            <a target="_blank" href="http://www.xunyingwang.com/movie/626726.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/k8z1ddvla5el.jpg" alt="黄金兄弟 迅雷下载"></a>
            <div class="carousel-caption">黄金兄弟 迅雷下载</div>
        </div>    <div class="item">
            <a target="_blank" href="http://www.xunyingwang.com/movie/430296.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" alt="超人总动员2 迅雷下载"></a>
            <div class="carousel-caption">超人总动员2 迅雷下载</div>
        </div>    <div class="item">
            <a target="_blank" href="http://www.xunyingwang.com/movie/639458.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/molwiw5acl5z.jpg" alt="江湖儿女 迅雷下载"></a>
            <div class="carousel-caption">江湖儿女 迅雷下载</div>
        </div>    <div class="item">
            <a target="_blank" href="http://www.xunyingwang.com/movie/464481.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/pdfz2ygc1n7l.jpg" alt="蚁人2:黄蜂女现身 迅雷下载"></a>
            <div class="carousel-caption">蚁人2:黄蜂女现身 迅雷下载</div>
        </div></div>"""
    
    soup = bs4.BeautifulSoup(html, 'lxml')
    print(soup.find_all(attrs={'href': 'http://www.xunyingwang.com/movie/430296.html'}))
    print(soup.find_all(attrs={"class": 'carousel-caption'}))
    print(soup.find_all(class_='carousel-caption'))
    

    输出:

    [<a href="http://www.xunyingwang.com/movie/430296.html" target="_blank"><img alt="超人总动员2 迅雷下载" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" width="100%"/></a>]
    [<div class="carousel-caption">反贪风暴3 迅雷下载</div>, <div class="carousel-caption">黄金兄弟 迅雷下载</div>, <div class="carousel-caption">超人总动员2 迅雷下载</div>, <div class="carousel-caption">江湖儿女 迅雷下载</div>, <div class="carousel-caption">蚁人2:黄蜂女现身 迅雷下载</div>]
    [<div class="carousel-caption">反贪风暴3 迅雷下载</div>, <div class="carousel-caption">黄金兄弟 迅雷下载</div>, <div class="carousel-caption">超人总动员2 迅雷下载</div>, <div class="carousel-caption">江湖儿女 迅雷下载</div>, <div class="carousel-caption">蚁人2:黄蜂女现身 迅雷下载</div>]
    

    text文本内容

    html = """
    <div class="item active">
            <a target="_blank" href="http://www.xunyingwang.com/movie/635355.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/qush1m40f973.jpg" alt="反贪风暴3 迅雷下载"></a>
            <div class="carousel-caption">反贪风暴3 迅雷下载</div>
        </div>    <div class="item">
            <a target="_blank" href="http://www.xunyingwang.com/movie/626726.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/k8z1ddvla5el.jpg" alt="黄金兄弟 迅雷下载"></a>
            <div class="carousel-caption">黄金兄弟 迅雷下载</div>
        </div>    <div class="item">
            <a target="_blank" href="http://www.xunyingwang.com/movie/430296.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" alt="超人总动员2 迅雷下载"></a>
            <div class="carousel-caption">超人总动员2 迅雷下载</div>
        </div>    <div class="item">
            <a target="_blank" href="http://www.xunyingwang.com/movie/639458.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/molwiw5acl5z.jpg" alt="江湖儿女 迅雷下载"></a>
            <div class="carousel-caption">江湖儿女 迅雷下载</div>
        </div>    <div class="item">
            <a target="_blank" href="http://www.xunyingwang.com/movie/464481.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/pdfz2ygc1n7l.jpg" alt="蚁人2:黄蜂女现身 迅雷下载"></a>
            <div class="carousel-caption">蚁人2:黄蜂女现身 迅雷下载</div>
        </div>    <div class="item">
            <a target="_blank" href="http://www.xunyingwang.com/movie/430296.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" alt="超人总动员2 迅雷下载"></a>
            <div class="carousel-caption">超人总动员2 迅雷下载</div>
        </div></div>"""
    
    soup = bs4.BeautifulSoup(html, 'lxml')
    print(soup.find_all(text='超人总动员2 迅雷下载'))
    

    输出:

    ['超人总动员2 迅雷下载', '超人总动员2 迅雷下载']
    

    find方法

    html = """
    <div class="item active">
            <a target="_blank" href="http://www.xunyingwang.com/movie/635355.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/qush1m40f973.jpg" alt="反贪风暴3 迅雷下载"></a>
            <div class="carousel-caption">反贪风暴3 迅雷下载</div>
        </div>    <div class="item">
            <a target="_blank" href="http://www.xunyingwang.com/movie/626726.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/k8z1ddvla5el.jpg" alt="黄金兄弟 迅雷下载"></a>
            <div class="carousel-caption">黄金兄弟 迅雷下载</div>
        </div>    <div class="item">
            <a target="_blank" href="http://www.xunyingwang.com/movie/430296.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" alt="超人总动员2 迅雷下载"></a>
            <div class="carousel-caption">超人总动员2 迅雷下载</div>
        </div>    <div class="item">
            <a target="_blank" href="http://www.xunyingwang.com/movie/639458.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/molwiw5acl5z.jpg" alt="江湖儿女 迅雷下载"></a>
            <div class="carousel-caption">江湖儿女 迅雷下载</div>
        </div>    <div class="item">
            <a target="_blank" href="http://www.xunyingwang.com/movie/464481.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/pdfz2ygc1n7l.jpg" alt="蚁人2:黄蜂女现身 迅雷下载"></a>
            <div class="carousel-caption">蚁人2:黄蜂女现身 迅雷下载</div>
        </div>    <div class="item">
            <a target="_blank" href="http://www.xunyingwang.com/movie/430296.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" alt="超人总动员2 迅雷下载"></a>
            <div class="carousel-caption">超人总动员2 迅雷下载</div>
        </div></div>"""
    
    soup = bs4.BeautifulSoup(html, 'lxml')
    print(soup.find(text='超人总动员2 迅雷下载'))
    

    输出:

    超人总动员2 迅雷下载
    

    CSS选择器

    html = """
    <div class="item active">
            <a target="_blank" href="http://www.xunyingwang.com/movie/635355.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/qush1m40f973.jpg" alt="反贪风暴3 迅雷下载"></a>
            <div class="carousel-caption">反贪风暴3 迅雷下载</div>
        </div>    <div class="item">
            <a target="_blank" href="http://www.xunyingwang.com/movie/626726.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/k8z1ddvla5el.jpg" alt="黄金兄弟 迅雷下载"></a>
            <div class="carousel-caption">黄金兄弟 迅雷下载</div>
        </div>    <div class="item">
            <a target="_blank" href="http://www.xunyingwang.com/movie/430296.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" alt="超人总动员2 迅雷下载"></a>
            <div class="carousel-caption">超人总动员2 迅雷下载</div>
        </div>    <div class="item">
            <a target="_blank" href="http://www.xunyingwang.com/movie/639458.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/molwiw5acl5z.jpg" alt="江湖儿女 迅雷下载"></a>
            <div class="carousel-caption">江湖儿女 迅雷下载</div>
        </div>    <div class="item">
            <a target="_blank" href="http://www.xunyingwang.com/movie/464481.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/pdfz2ygc1n7l.jpg" alt="蚁人2:黄蜂女现身 迅雷下载"></a>
            <div class="carousel-caption">蚁人2:黄蜂女现身 迅雷下载</div>
        </div>    <div class="item">
            <a target="_blank" href="http://www.xunyingwang.com/movie/430296.html"><img width="100%" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" alt="超人总动员2 迅雷下载"></a>
            <div class="carousel-caption">超人总动员2 迅雷下载</div>
        </div></div>"""
    
    soup = bs4.BeautifulSoup(html, 'lxml')
    print(soup.select(".item .carousel-caption"))
    print(soup.select(".item a img"))
    print(soup.select(".item div")[2].get_text())
    print(soup.select(".item a img")[4]['alt'])
    

    输出:

    [<div class="carousel-caption">反贪风暴3 迅雷下载</div>, <div class="carousel-caption">黄金兄弟 迅雷下载</div>, <div class="carousel-caption">超人总动员2 迅雷下载</div>, <div class="carousel-caption">江湖儿女 迅雷下载</div>, <div class="carousel-caption">蚁人2:黄蜂女现身 迅雷下载</div>, <div class="carousel-caption">超人总动员2 迅雷下载</div>]
    [<img alt="反贪风暴3 迅雷下载" src="http://img1.xmspc.com/uploads/images/qush1m40f973.jpg" width="100%"/>, <img alt="黄金兄弟 迅雷下载" src="http://img1.xmspc.com/uploads/images/k8z1ddvla5el.jpg" width="100%"/>, <img alt="超人总动员2 迅雷下载" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" width="100%"/>, <img alt="江湖儿女 迅雷下载" src="http://img1.xmspc.com/uploads/images/molwiw5acl5z.jpg" width="100%"/>, <img alt="蚁人2:黄蜂女现身 迅雷下载" src="http://img1.xmspc.com/uploads/images/pdfz2ygc1n7l.jpg" width="100%"/>, <img alt="超人总动员2 迅雷下载" src="http://img1.xmspc.com/uploads/images/oq11ymjufgdg.jpg" width="100%"/>]
    超人总动员2 迅雷下载
    蚁人2:黄蜂女现身 迅雷下载
    

    参考文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html

    相关文章

      网友评论

        本文标题:BeautyfulSoup/python3基本使用

        本文链接:https://www.haomeiwen.com/subject/engzxqtx.html