Scrapy简介

作者: ximengchj | 来源:发表于2017-01-20 16:43 被阅读0次

overview.html

Scrapy是一个应用程序框架，为各种各样的应用程序爬取网站提取结构化数据，如数据挖掘，信息处理或者历史档案。

Scrapy不止可以做网站的数据提取，也可以用于APIs（如 Amazon Associates Web Services）的数据提取或者作为专用的web蜘蛛。

运行一个简单的蜘蛛

这是从 http://quotes.toscrape.com 网站获取名言的蜘蛛代码片段：

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)```

把代码存放在文件中，命名为`quotes_spider.py`，使用`runspider`命令运行蜘蛛。
`scrapy runspider quotes_spider.py -o quotes.json`
运行结束时你会有个`quotes.json`文件列出所有的JSON格式名言，包括文本和作者，类似这样（这里为了阅读重新格式化了）：

[{
"author": "Jane Austen",
"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"
},
{
"author": "Groucho Marx",
"text": "\u201cOutside of a dog, a book is man's best friend. Inside of a dog it's too dark to read.\u201d"
},
{
"author": "Steve Martin",
"text": "\u201cA day without sunshine is like, you know, night.\u201d"
},
...]

## 发生了什么
当你运行命令`scrapy runspider quotes_spider.py`时，Scrapy找到代码中蜘蛛的定义，然后在crawler引擎中运行它。

蜘蛛从`start_urls`属性中给定的URLS开始请求（此例只有quotes的humor目录网址），然后调用默认的回调函数`parse`，把请求结果作为参数。在`parse`回调中，我们使用CSS选择器循环quote元素，生成一个含有quote文本和作者的python字典，查找下一页的链接地址计划用另一个请求使用相同的`parse`方法回调。

此处你注意到Scrapy的主要优点：请求的计划和处理都是异步的。这意味着Scrapy不需要等待一个请求的结束然后处理，它可以发送另一个请求或者同时做其他的事情。这意味着即使有些请求失败或者出错了，其他的请求也会继续运行。

这可以使你快速爬去数据（同时发送多个请求）Scrapy通过[一些小的设置](https://doc.scrapy.org/en/latest/topics/settings.html#topics-settings-ref)可以使你的爬虫更加礼貌。你可以设置每次请求之间的延迟，限制同时请求每个域名或ip的个数，或者直接使用 [using an auto-throttling extension](https://doc.scrapy.org/en/latest/topics/autothrottle.html#topics-autothrottle) 自动实现。

#### 提示
此处使用JSON文件导出结果你也可以使用XML或CSV格式，或者使用pipline把item存到数据库中。

网友评论

本文标题：Scrapy简介

本文链接：https://www.haomeiwen.com/subject/ydgwbttx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Scrapy简介

Scrapy一览https://doc.scrapy.org/en/latest/intro/overview.html

运行一个简单的蜘蛛

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读