美文网首页head first Scrapy
Scrapy 学习2 xpath简介

Scrapy 学习2 xpath简介

作者: 法号无涯 | 来源:发表于2017-11-10 08:27 被阅读9次
    <html>
      <head>
        <title>Title of the page</title>
      </head>
      <body>
        <h1>H1 Tag</h1>
        <h2>H2 Tag with <a href="#">link</a></h2>
        <p>First Paragraph</p>
        <p>Second Paragraph</p>
      </body>
    </html>
    

    以上面的简单html页面为例
    依次写入命令:
    scrapy shell
    from scrapy.selector import Selector
    文本编辑器里输入一下内容并复制:

    html_doc = '''
    <html>
      <head>
        <title>Title of the page</title>
      </head>
      <body>
        <h1>H1 Tag</h1>
        <h2>H2 Tag with <a href="#">link</a></h2>
        <p>First Paragraph</p>
        <p>Second Paragraph</p>
      </body>
    </html>
    '''
    

    并在shell中输入粘贴命令: %paste,现在要对html_doc的内容用xpath解析,依次输入:

    In [9]: sel = Selector(text=html_doc)
    
    In [10]: sel.extract()
    Out[10]: u'<html>\n  <head>\n    <title>Title of the page</title>\n  </head>\n  <body>\n    <h1>H1 Tag</h1>\n    <h2>H2 Tag with <a href="#">link</a></h2>\n    <p>First Paragraph</p>\n    <p>Second Paragraph</p>\n  </body>\n</html>'
    
    In [11]: sel.xpath('/html/head/title')
    Out[11]: [<Selector xpath='/html/head/title' data=u'<title>Title of the page</title>'>]
    
    In [12]: sel.xpath('/html/head/title').extract()
    Out[12]: [u'<title>Title of the page</title>']
    
    In [13]: sel.xpath('//title').extract()
    Out[13]: [u'<title>Title of the page</title>']
    
    In [14]: sel.xpath('//text').extract()
    Out[14]: []
    
    In [15]: sel.xpath('//text()').extract()
    Out[15]: 
    [u'\n  ',
     u'\n    ',
     u'Title of the page',
     u'\n  ',
     u'\n  ',
     u'\n    ',
     u'H1 Tag',
     u'\n    ',
     u'H2 Tag with ',
     u'link',
     u'\n    ',
     u'First Paragraph',
     u'\n    ',
     u'Second Paragraph',
     u'\n  ',
     u'\n']
    
    In [16]: sel.xpath('/html/body/p')
    Out[16]: 
    [<Selector xpath='/html/body/p' data=u'<p>First Paragraph</p>'>,
     <Selector xpath='/html/body/p' data=u'<p>Second Paragraph</p>'>]
    
    In [17]: sel.xpath('/html/body/p').extract()
    Out[17]: [u'<p>First Paragraph</p>', u'<p>Second Paragraph</p>']
    
    In [18]: sel.xpath('//p').extract()
    Out[18]: [u'<p>First Paragraph</p>', u'<p>Second Paragraph</p>']
    
    In [19]: sel.xpath('//p[1]').extract()
    Out[19]: [u'<p>First Paragraph</p>']
    
    In [20]: sel.xpath('//p[2]').extract()
    Out[20]: [u'<p>Second Paragraph</p>']
    
    In [21]: sel.xpath('//p')[0].extract()
    Out[21]: u'<p>First Paragraph</p>'
    
    In [22]: sel.xpath('//p')[1].extract()
    Out[22]: u'<p>Second Paragraph</p>'
    
    In [23]: sel.xpath('//p/text()')[1].extract()
    Out[23]: u'Second Paragraph'
    

    一些xpath工具介绍:

    1. 用chrome查看元素xpaht: https://udemy-images.s3.amazonaws.com/redactor/2017-02-12_18-00-40-6fd2add5705fd0f5dbaf66a16683647d/CopyXPath.png
    2. XPath Helper (Chrome Extension)
    3. FireBug (Firefox Extension)
    4. FirePath (Firefox Extension)
    5. XPath Tester: link: http://www.freeformatter.com/xpath-tester.html

    习题:

    问题 3:
    In the following code, how can you extract the URL only?

    <a href="http://www.udemy.com">Udemy Platform</a>
    答: //a/@href

    相关文章

      网友评论

        本文标题:Scrapy 学习2 xpath简介

        本文链接:https://www.haomeiwen.com/subject/icvvmxtx.html