3 Scrapy 爬取 (1)

作者: 法号无涯 | 来源:发表于2017-11-10 09:36 被阅读8次

    以网页 http://quotes.toscrape.com/ 为例
    命令:
    scrapy shell 'http://quotes.toscrape.com/'

    In [4]: response.xpath('//*[@class="quote"]')
    Out[4]: 
    [<Selector xpath='//*[@class="quote"]' data=u'<div class="quote" itemscope itemtype="h'>,
     <Selector xpath='//*[@class="quote"]' data=u'<div class="quote" itemscope itemtype="h'>,
     <Selector xpath='//*[@class="quote"]' data=u'<div class="quote" itemscope itemtype="h'>,
     <Selector xpath='//*[@class="quote"]' data=u'<div class="quote" itemscope itemtype="h'>,
     <Selector xpath='//*[@class="quote"]' data=u'<div class="quote" itemscope itemtype="h'>,
     <Selector xpath='//*[@class="quote"]' data=u'<div class="quote" itemscope itemtype="h'>,
     <Selector xpath='//*[@class="quote"]' data=u'<div class="quote" itemscope itemtype="h'>,
     <Selector xpath='//*[@class="quote"]' data=u'<div class="quote" itemscope itemtype="h'>,
     <Selector xpath='//*[@class="quote"]' data=u'<div class="quote" itemscope itemtype="h'>,
     <Selector xpath='//*[@class="quote"]' data=u'<div class="quote" itemscope itemtype="h'>]
    
    In [5]: quotes = response.xpath('//*[@class="quote"]')
    
    In [6]: quote = quotes[0]
    
    In [7]: quote
    Out[7]: <Selector xpath='//*[@class="quote"]' data=u'<div class="quote" itemscope itemtype="h'>
    
    In [8]: quote.extract()
    Out[8]: u'<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d</span>\n        <span>by <small class="author" itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world"> \n            \n            <a class="tag" href="/tag/change/page/1/">change</a>\n            \n            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>\n            \n            <a class="tag" href="/tag/thinking/page/1/">thinking</a>\n            \n            <a class="tag" href="/tag/world/page/1/">world</a>\n            \n        </div>\n    </div>'
    

    对单个quote的处理:

    In [9]: quote.xpath('.//*[@class="text"]')
    Out[9]: [<Selector xpath='.//*[@class="text"]' data=u'<span class="text" itemprop="text">\u201cThe '>]
    
    In [10]: quote.xpath('.//*[@class="text"]/text()').extract()
    Out[10]: [u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d']
    
    In [11]: quote.xpath('.//*[@class="text"]/text()').extract_first()
    Out[11]: u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d'
    

    上面是用class,也可以用itemprop

    text = quote.xpath('.//*[@itemprop="text"]/text()').extract_first()
    
    In [13]: text
    Out[13]: u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d'
    

    对于 custom quote,如果不加最前面那个点 . 的话:

    In [16]: quote.xpath('//*[@itemprop="text"]/text()').extract()
    Out[16]: 
    [u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d',
     u'\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d',
     u'\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d',
     u'\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d',
     u"\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d",
     u'\u201cTry not to become a man of success. Rather become a man of value.\u201d',
     u'\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d',
     u"\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d",
     u"\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d",
     u'\u201cA day without sunshine is like, you know, night.\u201d']
    

    有点神奇,具体为什么?我不知道。以后知道了再回来补吧。
    对于 meta 标签的 content部分的获取,语法稍微不同

    quote.xpath('.//*[@itemprop="keywords"]/@content').extract()
    Out[20]: [u'change,deep-thoughts,thinking,world']
    

    相关文章

      网友评论

        本文标题:3 Scrapy 爬取 (1)

        本文链接:https://www.haomeiwen.com/subject/eeivmxtx.html