美文网首页
python HTML解析之 - lxml

python HTML解析之 - lxml

作者: tafanfly | 来源:发表于2019-02-19 15:58 被阅读0次

    lxml

    lxml是处理XML和HTML的python语言,解析的时候,自动处理各种编码问题。而且它天生支持 XPath 1.0、XSLT 1.0、定制元素类。
    安装:

    pip install lxml

    lxml用法

    HTML 实例

    <!DOCTYPE html>
    <html>
    <head>
    <meta charset="utf-8">
    <title>Study/title>
    </head>
    <body>
    
    <h1>webpage</h1>
    <p>source link</p>
    <a href="http://www.runoob.com/html/html-tutorial.html" target="_blank">HTML</a> 
    <a href="http://www.runoob.com/python/python-tutorial.html" target="_blank">Python</a>
    <a href="http://www.runoob.com/cplusplus/cpp-tutorial.html" target="_blank">C++</a> 
    <a href="http://www.runoob.com/java/java-tutorial.html" target="_blank">Java</a>
    </body>
    </html>
    
    (1)HTML读取

    test, test.html指上述实例

    • 直接读取内容
    from lxml import etree
    html = etree.HTML(test)
    
    • 直接读取文件
    from lxml import etree
    html = etree.parse(test.html)
    
    (2)获取标签

    获取a的所有标签, 这种html内容有多种写法,可以 直接得到了4个元素。

    • //a:获取html下的所有 a 标签
    • /html/body/a:沿着节点顺序找 a 标签
    • /descendant::a:当前节点后代里面找 a 标签
    a_tags = html.xpath('//a')
    In [12]: print a_tags
    [<Element a at 0x7fdd7aea6c20>, <Element a at 0x7fdd7aea6e60>, <Element a at 0x7fdd7aea6a70>, <Eaea6c68>]
    a_tags_2 = html.xpath('/html/body/a')
    In [14]: print a_tags
    [<Element a at 0x7fdd7aea6c20>, <Element a at 0x7fdd7aea6e60>, <Element a at 0x7fdd7aea6a70>, <Eaea6c68>]
    a_tags_3 = html.xpath('/descendant::a')
    In [16]: print a_tags_3
    [<Element a at 0x7fdd7aea6c20>, <Element a at 0x7fdd7aea6e60>, <Element a at 0x7fdd7aea6a70>, <Eaea6c68>]
    
    (3)获取标签属性, 文本

    按照(2)中的方法,再加上/@href,可以直接得到属性值。

    a_attribute_2 = html.xpath('/html/body/a/@href')
    
    In [21]: print a_attribute_2
    ['http://www.runoob.com/html/html-tutorial.html', 'http://www.runoob.com/python/python-tutorial.html', 'http://www.runoob.com/cplusplus/cpp-tutorial.html', 'http://www.runoob.com/java/java-tutorial.html']
    
    a_text_2 = html.xpath('/html/body/a/text()')
    
    In [31]: print a_text_2
    ['HTML', 'Python', 'C++', 'Java']
    

    或者得到(2)中的结果,一一轮询。

    for tag in a_tags_2:
        print tag.attrib, tag.text
    
    {'href': 'http://www.runoob.com/html/html-tutorial.html', 'target': '_blank'} HTML
    {'href': 'http://www.runoob.com/python/python-tutorial.html', 'target': '_blank'} Python
    {'href': 'http://www.runoob.com/cplusplus/cpp-tutorial.html', 'target': '_blank'} C++
    {'href': 'http://www.runoob.com/java/java-tutorial.html', 'target': '_blank'} Java
    
    (4)筛选标签
    • 按照属性
    python_tag = html.xpath('/html/body/a[@href="http://www.runoob.com/python/python-tutorial.html"]')
    
    In [42]: print python_tag[0].attrib
    {'href': 'http://www.runoob.com/python/python-tutorial.html', 'target': '_blank'}
    In [43]: print python_tag[0].text
    Python
    
    • 按照文本
    python_tag = html.xpath('/html/body/a[text()="Python"]')
    
    In [47]: print python_tag[0].attrib
    {'href': 'http://www.runoob.com/python/python-tutorial.html', 'target': '_blank'}
    In [48]: print python_tag[0].text
    Python
    
    • 按照位置
    python_tag = html.xpath('/html/body/a[position()=2]')
    # python_tag = html.xpath('/html/body/a[2]')
    
    In [52]: print python_tag[0].attrib
    {'href': 'http://www.runoob.com/python/python-tutorial.html', 'target': '_blank'}
    In [53]: print python_tag[0].text
    Python
    

    更多表达式见 python xpath的学习
    参考: https://www.jianshu.com/p/2ae6d51522c3

    相关文章

      网友评论

          本文标题:python HTML解析之 - lxml

          本文链接:https://www.haomeiwen.com/subject/oyzaeqtx.html