第一个爬虫项目
该项目的源代码见: GitHub - scrapy/quotesbot: This is a sample Scrapy project for educational purposes
网站的页面如下:
qutoesbot页面.png
我们可以抓取页面中的正文,作者,和标签三个部分。Let's start!
step one:
新建一个项目,姑且就叫quotesbot吧。在terminal的某个目录下中输入如下命令:
scrapy startproject quotesbot
然后我们就可以看到如下的目录结构:
8179906-137d9b6db40fdb89.png目录结构的内容暂且不表。
step two:
编写源代码。需要在spiders目录下新建一个文件。可以叫它quotesbot.py。
源码如下:
from scrapy import Spider
class Quotesbot(Spider):
name = 'quotesbot'
start_urls = ['http://quotes.toscrape.com/',]
def parse(self, response):
quotes = response.xpath("//div[@class='quote']")
for quote in quotes:
yield {
'text': quote.xpath("./span[@class='text']/text()").extract_first(),
'author': quote.xpath(".//small[@class='author']/text()").extract_first(),
'tags': quote.xpath(".//a[@class='tag']/text()").extract()
}
step three:
进入quotesbot目录,在terminal中输入如下命令:
scrapy crawl quotesbot -o quotesbot.json
-o 表示将数据保存到后面的文件中。
执行完成后,我们可以看到目录中新生成了该文件。
9EDB274E-566E-41AB-BB94-7036A971667D.png
该文件的内容如下:
[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": "change"},
{"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d", "author": "J.K. Rowling", "tags": "abilities"},
{"text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d", "author": "Albert Einstein", "tags": "inspirational"},
{"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d", "author": "Jane Austen", "tags": "aliteracy"},
{"text": "\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d", "author": "Marilyn Monroe", "tags": "be-yourself"},
{"text": "\u201cTry not to become a man of success. Rather become a man of value.\u201d", "author": "Albert Einstein", "tags": "adulthood"},
{"text": "\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d", "author": "Andr\u00e9 Gide", "tags": "life"},
{"text": "\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d", "author": "Thomas A. Edison", "tags": "edison"},
{"text": "\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d", "author": "Eleanor Roosevelt", "tags": "misattributed-eleanor-roosevelt"},
{"text": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin", "tags": "humor"}
]
可以看到生成了我们预期的数据。good job!
网友评论