美文网首页
Python工具之lxml解析html

Python工具之lxml解析html

作者: 42chaos | 来源:发表于2017-04-10 09:44 被阅读210次

    lxml解析

    from lxml import etree
    text='''
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    '''
    
    html=etree.HTML(text)
    #读取文件
    #html=etree.parse('test.html')
    result=etree.tostring(html)
    print(result)
    

    输出结果,补全了html的标签

    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    </body></html>
    

    获取a标签和a的class

    print html.xpath('//a')
    #[<Element a at 0x10bdc0cb0>, <Element a at 0x10bdc0c68>, <Element a at 0x10bdc0b90>]
    print html.xpath('//a/@href')
    #['http://example.com/elsie', 'http://example.com/lacie', 'http://example.com/tillie']
    

    相关文章

      网友评论

          本文标题:Python工具之lxml解析html

          本文链接:https://www.haomeiwen.com/subject/dzyhattx.html