美文网首页
scrapy爬虫实践 --- day one

scrapy爬虫实践 --- day one

作者: 夜雨寒山 | 来源:发表于2017-11-06 13:55 被阅读0次

    第一个爬虫项目

    该项目的源代码见: GitHub - scrapy/quotesbot: This is a sample Scrapy project for educational purposes

    网站的页面如下:


    qutoesbot页面.png

    我们可以抓取页面中的正文,作者,和标签三个部分。Let's start!

    step one:

    新建一个项目,姑且就叫quotesbot吧。在terminal的某个目录下中输入如下命令:

    scrapy startproject quotesbot
    

    然后我们就可以看到如下的目录结构:

    8179906-137d9b6db40fdb89.png

    目录结构的内容暂且不表。

    step two:

    编写源代码。需要在spiders目录下新建一个文件。可以叫它quotesbot.py。
    源码如下:

    from scrapy import Spider
    
    class Quotesbot(Spider):
        name = 'quotesbot'
        start_urls = ['http://quotes.toscrape.com/',]
    
        def parse(self, response):
            quotes = response.xpath("//div[@class='quote']")
            for quote in quotes:
                yield {
                    'text': quote.xpath("./span[@class='text']/text()").extract_first(),
                    'author': quote.xpath(".//small[@class='author']/text()").extract_first(),
                    'tags': quote.xpath(".//a[@class='tag']/text()").extract()
                    }
    

    step three:

    进入quotesbot目录,在terminal中输入如下命令:

     scrapy crawl quotesbot -o quotesbot.json
    

    -o 表示将数据保存到后面的文件中。
    执行完成后,我们可以看到目录中新生成了该文件。


    9EDB274E-566E-41AB-BB94-7036A971667D.png

    该文件的内容如下:

    [
    {"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": "change"},
    {"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d", "author": "J.K. Rowling", "tags": "abilities"},
    {"text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d", "author": "Albert Einstein", "tags": "inspirational"},
    {"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d", "author": "Jane Austen", "tags": "aliteracy"},
    {"text": "\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d", "author": "Marilyn Monroe", "tags": "be-yourself"},
    {"text": "\u201cTry not to become a man of success. Rather become a man of value.\u201d", "author": "Albert Einstein", "tags": "adulthood"},
    {"text": "\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d", "author": "Andr\u00e9 Gide", "tags": "life"},
    {"text": "\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d", "author": "Thomas A. Edison", "tags": "edison"},
    {"text": "\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d", "author": "Eleanor Roosevelt", "tags": "misattributed-eleanor-roosevelt"},
    {"text": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin", "tags": "humor"}
    ]
    

    可以看到生成了我们预期的数据。good job!

    相关文章

      网友评论

          本文标题:scrapy爬虫实践 --- day one

          本文链接:https://www.haomeiwen.com/subject/cghlmxtx.html