美文网首页
python HTML解析之 - lxml

python HTML解析之 - lxml

作者: tafanfly | 来源:发表于2019-02-19 15:58 被阅读0次

lxml

lxml是处理XML和HTML的python语言,解析的时候,自动处理各种编码问题。而且它天生支持 XPath 1.0、XSLT 1.0、定制元素类。
安装:

pip install lxml

lxml用法

HTML 实例

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Study/title>
</head>
<body>

<h1>webpage</h1>
<p>source link</p>
<a href="http://www.runoob.com/html/html-tutorial.html" target="_blank">HTML</a> 
<a href="http://www.runoob.com/python/python-tutorial.html" target="_blank">Python</a>
<a href="http://www.runoob.com/cplusplus/cpp-tutorial.html" target="_blank">C++</a> 
<a href="http://www.runoob.com/java/java-tutorial.html" target="_blank">Java</a>
</body>
</html>
(1)HTML读取

test, test.html指上述实例

  • 直接读取内容
from lxml import etree
html = etree.HTML(test)
  • 直接读取文件
from lxml import etree
html = etree.parse(test.html)
(2)获取标签

获取a的所有标签, 这种html内容有多种写法,可以 直接得到了4个元素。

  • //a:获取html下的所有 a 标签
  • /html/body/a:沿着节点顺序找 a 标签
  • /descendant::a:当前节点后代里面找 a 标签
a_tags = html.xpath('//a')
In [12]: print a_tags
[<Element a at 0x7fdd7aea6c20>, <Element a at 0x7fdd7aea6e60>, <Element a at 0x7fdd7aea6a70>, <Eaea6c68>]
a_tags_2 = html.xpath('/html/body/a')
In [14]: print a_tags
[<Element a at 0x7fdd7aea6c20>, <Element a at 0x7fdd7aea6e60>, <Element a at 0x7fdd7aea6a70>, <Eaea6c68>]
a_tags_3 = html.xpath('/descendant::a')
In [16]: print a_tags_3
[<Element a at 0x7fdd7aea6c20>, <Element a at 0x7fdd7aea6e60>, <Element a at 0x7fdd7aea6a70>, <Eaea6c68>]
(3)获取标签属性, 文本

按照(2)中的方法,再加上/@href,可以直接得到属性值。

a_attribute_2 = html.xpath('/html/body/a/@href')

In [21]: print a_attribute_2
['http://www.runoob.com/html/html-tutorial.html', 'http://www.runoob.com/python/python-tutorial.html', 'http://www.runoob.com/cplusplus/cpp-tutorial.html', 'http://www.runoob.com/java/java-tutorial.html']

a_text_2 = html.xpath('/html/body/a/text()')

In [31]: print a_text_2
['HTML', 'Python', 'C++', 'Java']

或者得到(2)中的结果,一一轮询。

for tag in a_tags_2:
    print tag.attrib, tag.text

{'href': 'http://www.runoob.com/html/html-tutorial.html', 'target': '_blank'} HTML
{'href': 'http://www.runoob.com/python/python-tutorial.html', 'target': '_blank'} Python
{'href': 'http://www.runoob.com/cplusplus/cpp-tutorial.html', 'target': '_blank'} C++
{'href': 'http://www.runoob.com/java/java-tutorial.html', 'target': '_blank'} Java
(4)筛选标签
  • 按照属性
python_tag = html.xpath('/html/body/a[@href="http://www.runoob.com/python/python-tutorial.html"]')

In [42]: print python_tag[0].attrib
{'href': 'http://www.runoob.com/python/python-tutorial.html', 'target': '_blank'}
In [43]: print python_tag[0].text
Python
  • 按照文本
python_tag = html.xpath('/html/body/a[text()="Python"]')

In [47]: print python_tag[0].attrib
{'href': 'http://www.runoob.com/python/python-tutorial.html', 'target': '_blank'}
In [48]: print python_tag[0].text
Python
  • 按照位置
python_tag = html.xpath('/html/body/a[position()=2]')
# python_tag = html.xpath('/html/body/a[2]')

In [52]: print python_tag[0].attrib
{'href': 'http://www.runoob.com/python/python-tutorial.html', 'target': '_blank'}
In [53]: print python_tag[0].text
Python

更多表达式见 python xpath的学习
参考: https://www.jianshu.com/p/2ae6d51522c3

相关文章

网友评论

      本文标题:python HTML解析之 - lxml

      本文链接:https://www.haomeiwen.com/subject/oyzaeqtx.html