<html>
<head>
<title>Title of the page</title>
</head>
<body>
<h1>H1 Tag</h1>
<h2>H2 Tag with <a href="#">link</a></h2>
<p>First Paragraph</p>
<p>Second Paragraph</p>
</body>
</html>
以上面的简单html页面为例
依次写入命令:
scrapy shell
from scrapy.selector import Selector
文本编辑器里输入一下内容并复制:
html_doc = '''
<html>
<head>
<title>Title of the page</title>
</head>
<body>
<h1>H1 Tag</h1>
<h2>H2 Tag with <a href="#">link</a></h2>
<p>First Paragraph</p>
<p>Second Paragraph</p>
</body>
</html>
'''
并在shell中输入粘贴命令: %paste,现在要对html_doc的内容用xpath解析,依次输入:
In [9]: sel = Selector(text=html_doc)
In [10]: sel.extract()
Out[10]: u'<html>\n <head>\n <title>Title of the page</title>\n </head>\n <body>\n <h1>H1 Tag</h1>\n <h2>H2 Tag with <a href="#">link</a></h2>\n <p>First Paragraph</p>\n <p>Second Paragraph</p>\n </body>\n</html>'
In [11]: sel.xpath('/html/head/title')
Out[11]: [<Selector xpath='/html/head/title' data=u'<title>Title of the page</title>'>]
In [12]: sel.xpath('/html/head/title').extract()
Out[12]: [u'<title>Title of the page</title>']
In [13]: sel.xpath('//title').extract()
Out[13]: [u'<title>Title of the page</title>']
In [14]: sel.xpath('//text').extract()
Out[14]: []
In [15]: sel.xpath('//text()').extract()
Out[15]:
[u'\n ',
u'\n ',
u'Title of the page',
u'\n ',
u'\n ',
u'\n ',
u'H1 Tag',
u'\n ',
u'H2 Tag with ',
u'link',
u'\n ',
u'First Paragraph',
u'\n ',
u'Second Paragraph',
u'\n ',
u'\n']
In [16]: sel.xpath('/html/body/p')
Out[16]:
[<Selector xpath='/html/body/p' data=u'<p>First Paragraph</p>'>,
<Selector xpath='/html/body/p' data=u'<p>Second Paragraph</p>'>]
In [17]: sel.xpath('/html/body/p').extract()
Out[17]: [u'<p>First Paragraph</p>', u'<p>Second Paragraph</p>']
In [18]: sel.xpath('//p').extract()
Out[18]: [u'<p>First Paragraph</p>', u'<p>Second Paragraph</p>']
In [19]: sel.xpath('//p[1]').extract()
Out[19]: [u'<p>First Paragraph</p>']
In [20]: sel.xpath('//p[2]').extract()
Out[20]: [u'<p>Second Paragraph</p>']
In [21]: sel.xpath('//p')[0].extract()
Out[21]: u'<p>First Paragraph</p>'
In [22]: sel.xpath('//p')[1].extract()
Out[22]: u'<p>Second Paragraph</p>'
In [23]: sel.xpath('//p/text()')[1].extract()
Out[23]: u'Second Paragraph'
一些xpath工具介绍:
- 用chrome查看元素xpaht: https://udemy-images.s3.amazonaws.com/redactor/2017-02-12_18-00-40-6fd2add5705fd0f5dbaf66a16683647d/CopyXPath.png
- XPath Helper (Chrome Extension)
- FireBug (Firefox Extension)
- FirePath (Firefox Extension)
- XPath Tester: link: http://www.freeformatter.com/xpath-tester.html
习题:
问题 3:
In the following code, how can you extract the URL only?
<a href="http://www.udemy.com">Udemy Platform</a>
答: //a/@href
网友评论