爬虫课堂（二十）|编写Spider之使用Item Pipelin

作者: 小怪聊职场 | 来源:发表于2018-03-25 17:03 被阅读0次

爬虫课堂（二十）|编写Spider之使用Item Pipelin
Scrapy的基本使用（三）——Scrapy爬虫的数据类型
爬虫课堂（十九）|编写Spider之使用Item封装数据
scrapy-爬取猫眼电影-存储至csv中
Scrapy----Item Pipeline的一个小问题
爬虫课堂（十八）|编写Spider之使用Selector提取数据
3.Scrapy 入门案例
scp_merger
python爬虫之scrapy 入门案例
Python3 + Scrapy 爬取豆瓣评分数据存入Mysql

在前面的章节中，讲解了提取数据和使用Item封装数据，接下来讲解如何处理爬取到的数据。
在Scrapy框架中，Item Pipeline是处理数据的组件，如下图20-1所示，当Item在Spider中被收集之后，将会被传递到Item Pipeline，每一个组件会按照一定的顺序执行对Item的处理。

图20-1

每个Item Pipeline是实现了简单方法的Python类。他们接收到Item并通过它执行一些行为，同时也决定此Item是否继续通过pipeline，或是被丢弃而不再进行处理。
以下是Item Pipeline的一些典型应用：

清理HTML数据。
验证爬取的数据（检查item包含某些字段）。
查重（并丢弃）。
将爬取结果保存到数据库或者文件中。

一、编写Item Pipeline类
编写Item Pipeline很简单，每个Item Pipeline组件是一个独立的Python类，同时必须实现process_item方法:

process_item(item, spider)
每个item pipeline组件都需要调用该方法，这个方法必须返回一个 Item (或任何继承类)对象，或是抛出 DropItem 异常，被丢弃的item将不会被之后的pipeline组件所处理。
参数:
item (Item 对象) – 被爬取的item
spider (Spider 对象) – 爬取该item的spider

也可以实现以下两个方法:

open_spider(spider)
当spider被开启时，这个方法被调用。
参数:
spider (Spider 对象) – 被开启的spider
close_spider(spider)
当spider被关闭时，这个方法被调用
参数:
spider (Spider 对象) – 被关闭的spider
1、将爬取结果保存到数据库或者文件中
在创建一个Scrapy项目时，会自动生成一个pipelines.py文件，它用来放置用户自定义的Item Pipeline，在tutorial项目的pipelines.py中实现DataSubmitJsonFilePipeline，代码如下：

import json

# 调用scrapy提供的json export把item写入JSON文件
class DataSubmitJsonFilePipeline(object):
----def __init__(self):
--------self.file = open('jianshuArticle.json', 'wb')

# 把item写入JSON文件
----def process_item(self, item, spider):
--------line = json.dumps(dict(item)) + "\n"
--------self.file.write(line)
--------return item

----def close_spider(self, spider):
--------self.file.close()

2、查重（并丢弃）
略
3、验证爬取的数据（检查item包含某些字段）
略
4、清理HTML数据
略
二、启用一个Item Pipeline组件
在Scrapy框架中，Item Pipeline是可选的组件，可以选择性启用某个或某些Item Pipeline。为了启用一个Item Pipeline组件，必须将它的类添加配置文件settings.py的 ITEM_PIPELINES中：

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
# tutorial是项目名，DataSubmitJsonFilePipeline是pipelines.py中处理item的类名
----'tutorial.pipelines.DataSubmitJsonFilePipeline': 1,
}

分配给每个类的整型值，确定了它们运行的顺序，item按数字从低到高的顺序执行，这些数字一般定义在0-1000范围内。
三、settings.py、pipelines.py和Item Pipeline类协作
settings.py、pipelines.py和Item Pipeline类在项目中的协作如下图20-2、图20-3所示：