python HTML解析之 - lxml

作者: tafanfly | 来源:发表于2019-02-19 15:58 被阅读0次

【python】爬虫： lxml解析库、XPath语法详解
Scrapy环境安装（window系统下）
python HTML解析之 - lxml
lxml的使用方法
python解析库安装
python爬虫从入门到放弃之八：Xpath
Python工具之lxml解析html
python3解析库lxml
lxml库与Xpath语法
lxml - 用Python解析XML和HTML

lxml

lxml是处理XML和HTML的python语言，解析的时候，自动处理各种编码问题。而且它天生支持 XPath 1.0、XSLT 1.0、定制元素类。
安装：

pip install lxml

lxml用法

HTML 实例

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Study/title>
</head>
<body>

<h1>webpage</h1>
<p>source link</p>
<a href="http://www.runoob.com/html/html-tutorial.html" target="_blank">HTML</a> 
<a href="http://www.runoob.com/python/python-tutorial.html" target="_blank">Python</a>
<a href="http://www.runoob.com/cplusplus/cpp-tutorial.html" target="_blank">C++</a> 
<a href="http://www.runoob.com/java/java-tutorial.html" target="_blank">Java</a>
</body>
</html>

（1）HTML读取

test, test.html指上述实例

直接读取内容

from lxml import etree
html = etree.HTML(test)

直接读取文件

from lxml import etree
html = etree.parse(test.html)

（2）获取标签

获取a的所有标签，这种html内容有多种写法，可以直接得到了4个元素。

//a：获取html下的所有 a 标签
/html/body/a：沿着节点顺序找 a 标签
/descendant::a：当前节点后代里面找 a 标签

a_tags = html.xpath('//a')
In [12]: print a_tags
[<Element a at 0x7fdd7aea6c20>, <Element a at 0x7fdd7aea6e60>, <Element a at 0x7fdd7aea6a70>, <Eaea6c68>]
a_tags_2 = html.xpath('/html/body/a')
In [14]: print a_tags
[<Element a at 0x7fdd7aea6c20>, <Element a at 0x7fdd7aea6e60>, <Element a at 0x7fdd7aea6a70>, <Eaea6c68>]
a_tags_3 = html.xpath('/descendant::a')
In [16]: print a_tags_3
[<Element a at 0x7fdd7aea6c20>, <Element a at 0x7fdd7aea6e60>, <Element a at 0x7fdd7aea6a70>, <Eaea6c68>]

（3）获取标签属性，文本

按照（2）中的方法，再加上/@href，可以直接得到属性值。

a_attribute_2 = html.xpath('/html/body/a/@href')

In [21]: print a_attribute_2
['http://www.runoob.com/html/html-tutorial.html', 'http://www.runoob.com/python/python-tutorial.html', 'http://www.runoob.com/cplusplus/cpp-tutorial.html', 'http://www.runoob.com/java/java-tutorial.html']

a_text_2 = html.xpath('/html/body/a/text()')

In [31]: print a_text_2
['HTML', 'Python', 'C++', 'Java']

或者得到（2）中的结果，一一轮询。

for tag in a_tags_2:
    print tag.attrib, tag.text

{'href': 'http://www.runoob.com/html/html-tutorial.html', 'target': '_blank'} HTML
{'href': 'http://www.runoob.com/python/python-tutorial.html', 'target': '_blank'} Python
{'href': 'http://www.runoob.com/cplusplus/cpp-tutorial.html', 'target': '_blank'} C++
{'href': 'http://www.runoob.com/java/java-tutorial.html', 'target': '_blank'} Java

（4）筛选标签

按照属性

python_tag = html.xpath('/html/body/a[@href="http://www.runoob.com/python/python-tutorial.html"]')

In [42]: print python_tag[0].attrib
{'href': 'http://www.runoob.com/python/python-tutorial.html', 'target': '_blank'}
In [43]: print python_tag[0].text
Python

按照文本

python_tag = html.xpath('/html/body/a[text()="Python"]')

In [47]: print python_tag[0].attrib
{'href': 'http://www.runoob.com/python/python-tutorial.html', 'target': '_blank'}
In [48]: print python_tag[0].text
Python

按照位置

python_tag = html.xpath('/html/body/a[position()=2]')
# python_tag = html.xpath('/html/body/a[2]')

In [52]: print python_tag[0].attrib
{'href': 'http://www.runoob.com/python/python-tutorial.html', 'target': '_blank'}
In [53]: print python_tag[0].text
Python

更多表达式见 python xpath的学习
参考： https://www.jianshu.com/p/2ae6d51522c3