【Python学习】No.7 爬虫相关

作者: LL_路上 | 来源:发表于2019-04-14 20:33 被阅读0次

1.XPath中的text()和string()区别

<book><author>Tom John</author></book>

image.png

XML例子：

1. <book>
2. <author>Tom <em>John</em> cat</author>
3. <pricing>
4. <price>20</price>
5. <discount>0.8</discount>
6. </pricing>
7. </book>

text()
经常在XPath表达式的最后看到text()，它仅仅返回所指元素的文本内容。

1. let $x := book/author/text()
2. return $x

返回的结果是Tom cat，其中的John不属于author直接的节点内容。

string()
string()函数会得到所指元素的所有节点文本内容，这些文本讲会被拼接成一个字符串。

1. let $x := book/author/string()
2. return $x

返回的内容是”Tom John cat”

data()
大多数时候，data()函数和string()函数通用，而且不建议经常使用data()函数，有数据表明，该函数会影响XPath的性能。

1. let $x := book/pricing/string()
2. return $x

返回的是200.8

1. let $x := book/pricing/data()
2. return $x

这样将返回分开的20和0.8，他们的类型并不是字符串而是xs:anyAtomicType，于是就可以使用数学函数做一定操作。

1. let $x := book/pricing/price/data()
2. let $y := book/pricing/discount/data()
3. return $x*$y

比如上面这个例子，就只能使用data()，不能使用text()或 string()，因为XPath不支持字符串做数学运算。

总结
text()不是函数，XML结构的细微变化，可能会使得结果与预期不符，应该尽量少用，data()作为特殊用途的函数，可能会出现性能问题，如无特殊需要尽量不用，string()函数可以满足大部分的需求。

2.Xpath 常用函数

contains ()： //div[contains(@id,'in')] ,表示选择id中包含有’in’的div节点
text()：由于一个节点的文本值不属于属性，比如“<a class=”baidu“ href=”http://www.baidu.com“>baidu</a>”,所以，用text()函数来匹配节点：//a[text()='baidu']
last()：前面已介绍
starts-with()： //div[starts-with(@id,'in')] ，表示选择以’in’开头的id属性的div节点
not()函数，表示否定，//input[@name=‘identity’ and not(contains(@class,‘a’))] ，表示匹配出name为identity并且class的值中不包含a的input节点。 not()函数通常与返回值为true or false的函数组合起来用，比如contains(),starts-with()等，但有一种特别情况请注意一下：我们要匹配出input节点含有id属性的，写法如下：//input[@id]，如果我们要匹配出input节点不含用id属性的，则为：//input[not(@id)]

本文标题：【Python学习】No.7 爬虫相关

本文链接：https://www.haomeiwen.com/subject/eiaswqtx.html