美文网首页大数据 爬虫Python AI Sql
lxml库中etree.HTML()和etree.tostrin

lxml库中etree.HTML()和etree.tostrin

作者: 小董不太懂 | 来源:发表于2019-07-20 15:14 被阅读2次

    1. 测试HTML代码

    # 测试代码test.html
    <html>
        <head>
            <meta charset="UTF-8">
        </head>
        <body>
            <div class='main-content'>
                <h1 id="title">This is a test!</h1>
                <p class="main-content ref">This is paragraph1</p>
                <div>
                    <p>测试语句1</p>
                </div>
            </div>
            <div>
                <p>This is paragraph2</p>
                <div>
                    <p class="ref">测试语句2</p>
                </div>
            </div>
        </body>
    </html>
    
    

    2. etree.HTML( )

    调用HTML类对HTML文本进行初始化,成功构造XPath解析对象,同时可以自动修正HMTL文本(标签缺少闭合自动添加上)

    from lxml import etree  #首先导入lxml库的etree模块
    
    with open('test.html','r') as f:
        c = f.read()
    #调用HTML类进行初始化,成功构造XPath解析对象
    tree = etree.HTML(c)
    
    

    3. etree.tostring()

    tostring( )方法可以输出修正之后的HTML代码,也可以直接读取文本进行解析,但是结果为bytes类型,因此需要利用decode()方法将其转成str类型

    具体的decode( )格式需要浏览器审查页查看
    import requests
    from lxml import etree
    
    with open('real_case.html', 'r', encoding='utf-8') as f:
        c = f.read()
    tree = etree.HTML(c)
    table_element = tree.xpath("//div[@class='table-box'][1]/table/tbody/tr")
    
    for row in table_element:
        try:
            td1 = row.xpath('td')[0]
            #具体的转成什么格式,需要审查网页元素,查看
            s1 = etree.tostring(td1).decode('utf-8')
            print(s1)
        except Exception as error:
            pass
    
    

    相关文章

      网友评论

        本文标题:lxml库中etree.HTML()和etree.tostrin

        本文链接:https://www.haomeiwen.com/subject/kuhulctx.html