scrapy(二) pipelines

作者: 万事万物 | 来源:发表于2021-06-10 06:52 被阅读0次

scrapy(二) pipelines
31.scrapy管道的使用
在scrapy的pipelines中连接数据库
初学scrapy的坑
scrapy爬取图片时，出现 ValueError:Missin
二、scrapy框架中（pipelines、settings、i
Python爬虫之Scrapy数据保存MongoDB
scrapy汽车之家（配置pipelines）
爬虫scrapy框架（5）——pipelines
scrapy pipelines.py 文件

前言：

前一章写了一篇入门教程（不能说是教程，只能算是自己学习的笔记），了解了 scrapy 安装、创建scrapy项目、生成爬虫实现一个简单爬取。虽然对于我们来说爬取的数据并没有什么用，只是只是简单的输出的到控制台。

  def parse(self, response):
        div_list=response.xpath("//div[@class='col-md-8']/div[@class='quote']")
        for div in div_list:
            item={}
            # 获取 text 内容
            item["text"]=div.xpath("./span[@class='text']/text()").extract_first()
            # 获取 by 后的内容
            item["by_text"]=div.xpath(".//small[@class='author']/text()").extract_first()
            # 获取 by 后a标签中href的值
            item['by_href']=div.xpath("./span/a/@href").extract_first()
            # 获取所有的标签
            tags_list=div.xpath("./div[@class='tags']/a")
            tags_item_list=[]
            for tags in tags_list:
                tags_item={} 
                tags_item["href"]=tags.xpath('./@href').extract_first()
                tags_item["text"]=tags.xpath('./text()').extract_first()
                tags_item_list.append(tags_item)
            #将标签信息添加到item中
            item["tags"]=tags_item_list
            print(item)
            #为了展示好看，最后按照 - 进行分隔
            print('-'*20)

既然 scrapy 是一个框架，那么就有它的规范，肯定不能在这里进行业务处理（虽然只是打印），这里是做数据解析的地方。我们应该交由pipelines*进行处理。

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class TutorialPipeline:
    def process_item(self, item, spider):
        return item

pipelines 默认是被注释的，需要到 settings.py 文件中找到ITEM_PIPELINES取消注释。

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'tutorial.pipelines.TutorialPipeline': 300,
}

ITEM_PIPELINES 是一个字典，在以后的学习中，随着越来越深入，可以自定义Pipeline。
'tutorial.pipelines.TutorialPipeline' 表示Pipeline文件地址，300 表示权重值。上面提到Pipelines是可以定义多个的，那么谁先执行，谁后执行就是靠这个权重来决定的。权重值越小，优先率越高。

现在思考一个问题，如何将数据推送到Pipeline？可以使用yield将数据推送到TutorialPipeline
quotes.py

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        div_list=response.xpath("//div[@class='col-md-8']/div[@class='quote']")
        for div in div_list:
            item={}
            # 获取 text 内容
            item["text"]=div.xpath("./span[@class='text']/text()").extract_first()
            # 获取 by 后的内容
            item["by_text"]=div.xpath(".//small[@class='author']/text()").extract_first()
            # 获取 by 后a标签中href的值
            item['by_href']=div.xpath("./span/a/@href").extract_first()
            # 获取所有的标签
            tags_list=div.xpath("./div[@class='tags']/a")
            tags_item_list=[]
            for tags in tags_list:
                tags_item={} 
                tags_item["href"]=tags.xpath('./@href').extract_first()
                tags_item["text"]=tags.xpath('./text()').extract_first()
                tags_item_list.append(tags_item)
            #将标签信息添加到item中
            item["tags"]=tags_item_list
            yield item

pipelines.py

from itemadapter import ItemAdapter


class TutorialPipeline:
    def process_item(self, item, spider):
        # 这里依旧使用 print 将数据导出
        print(item)
        print("-"*20)
        return item

输出结果

$ scrapy crawl quotes --nolog
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'by_text': 'Albert Einstein', 'by_href': '/author/Albert-Einstein', 'tags': [{'href': '/tag/change/page/1/', 'text': 'change'}, {'href': '/tag/deep-thoughts/page/1/', 'text': 'deep-thoughts'}, {'href': '/tag/thinking/page/1/', 'text': 'thinking'}, {'href': '/tag/world/page/1/', 'text': 'world'}]}
--------------------
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'by_text': 'J.K. Rowling', 'by_href': '/author/J-K-Rowling', 'tags': [{'href': '/tag/abilities/page/1/', 'text': 'abilities'}, 
{'href': '/tag/choices/page/1/', 'text': 'choices'}]}
--------------------
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'by_text': 'Albert Einstein', 'by_href': '/author/Albert-Einstein', 'tags': [{'href': '/tag/inspirational/page/1/', 'text': 'inspirational'}, {'href': '/tag/life/page/1/', 'text': 'life'}, {'href': '/tag/live/page/1/', 'text': 'live'}, {'href': '/tag/miracle/page/1/', 'text': 'miracle'}, {'href': '/tag/miracles/page/1/', 'text': 'miracles'}]}
--------------------
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'by_text': 'Jane Austen', 'by_href': '/author/Jane-Austen', 'tags': [{'href': '/tag/aliteracy/page/1/', 'text': 'aliteracy'}, {'href': '/tag/books/page/1/', 'text': 'books'}, {'href': '/tag/classic/page/1/', 'text': 'classic'}, {'href': '/tag/humor/page/1/', 'text': 'humor'}]}
--------------------
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'by_text': 'Marilyn Monroe', 'by_href': '/author/Marilyn-Monroe', 'tags': [{'href': '/tag/be-yourself/page/1/', 'text': 'be-yourself'}, {'href': '/tag/inspirational/page/1/', 'text': 'inspirational'}]}
--------------------
{'text': '“Try not to become a man of success. Rather become a man of value.”', 'by_text': 'Albert Einstein', 'by_href': '/author/Albert-Einstein', 'tags': [{'href': '/tag/adulthood/page/1/', 'text': 'adulthood'}, {'href': '/tag/success/page/1/', 'text': 'success'}, {'href': '/tag/value/page/1/', 'text': 'value'}]}
--------------------
{'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'by_text': 'André Gide', 'by_href': '/author/Andre-Gide', 'tags': [{'href': '/tag/life/page/1/', 'text': 'life'}, {'href': '/tag/love/page/1/', 'text': 'love'}]}
--------------------
{'text': "“I have not failed. I've just found 10,000 ways that won't work.”", 'by_text': 'Thomas A. Edison', 'by_href': '/author/Thomas-A-Edison', 'tags': [{'href': '/tag/edison/page/1/', 'text': 'edison'}, {'href': '/tag/failure/page/1/', 'text': 'failure'}, {'href': '/tag/inspirational/page/1/', 'text': 'inspirational'}, {'href': '/tag/paraphrased/page/1/', 'text': 'paraphrased'}]}
--------------------
{'text': "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", 'by_text': 'Eleanor Roosevelt', 'by_href': '/author/Eleanor-Roosevelt', 'tags': [{'href': '/tag/misattributed-eleanor-roosevelt/page/1/', 'text': 'misattributed-eleanor-roosevelt'}]}
--------------------
{'text': '“A day without sunshine is like, you know, night.”', 'by_text': 'Steve Martin', 'by_href': '/author/Steve-Martin', 'tags': [{'href': '/tag/humor/page/1/', 'text': 'humor'}, {'href': '/tag/obvious/page/1/', 'text': 'obvious'}, {'href': '/tag/simile/page/1/', 'text': 'simile'}]}
--------------------

虽然最终结果一样，但是对于整体来说业务实现了解耦，使得业务扩展性更好。

回到正题，为什么使用yield？使用List将数据封装成集合一并返回可以吗？
我们可以试试

def parse(self, response):
        div_list=response.xpath("//div[@class='col-md-8']/div[@class='quote']")
        item_list=[]
        for div in div_list:
            item={}
            # 获取 text 内容
            item["text"]=div.xpath("./span[@class='text']/text()").extract_first()
            # 获取 by 后的内容
            item["by_text"]=div.xpath(".//small[@class='author']/text()").extract_first()
            # 获取 by 后a标签中href的值
            item['by_href']=div.xpath("./span/a/@href").extract_first()
            # 获取所有的标签
            tags_list=div.xpath("./div[@class='tags']/a")
            tags_item_list=[]
            for tags in tags_list:
                tags_item={} 
                tags_item["href"]=tags.xpath('./@href').extract_first()
                tags_item["text"]=tags.xpath('./text()').extract_first()
                tags_item_list.append(tags_item)
            #将标签信息添加到item中
            item["tags"]=tags_item_list
            item_list.append(item)
        # 推送list
        yield item_list

运行结果如下：意思是说 scrapy 可以返回 request对象、item(字典)或者None(空)，但是却返回的是list

 ERROR: Spider must return request, item, or None, got 'list' in <GET http://quotes.toscrape.com/>

yield就像是一个管道，负责数据的推送，简单点说就是一边推送，另一边有数据时就可以进行处理。这样的好处在于处理效率高并且能够减少内存的占用。

假设可以使用 list，有什么坏处？

爬取的数据需要存放到list中，一旦数据过大，内存就可能被占满。
即使内存足够支撑，存放list这段时间，Pipeline会处于闲置状态（需要等待数据推送到这里），效率降低。

使用pipeline

从 pipeline 的字典形式可以看出，pipeline可以有多个，而且确实pipeline能够定于多个。
为什么需要多个pipeline：

可能会有多个spider，不同的pipeline处理不同的item的内容。
一个spider的内容可以要做不同的操作，比如存入不同的数据库中。
注意：
pipeline的权重越小，优先级越高。
pipeline中的process_item方法名不能修改为其他的名称。

使用 spider.name 用于获取当前爬虫名称

总结

以上内容属于我的学习总结，若有什么不对，解释有误的地方请下发留言，我会及时改正。

scrapy(二) pipelines
前言：前一章[https://www.jianshu.com/p/418fe7490d4c]写了一篇入门教程（不...
31.scrapy管道的使用
scrapy管道的使用学习目标：掌握 scrapy管道(pipelines.py)的使用之前我们在scrap...
在scrapy的pipelines中连接数据库
在scrapy项目中的settings.py中设置在scrapy项目中的pipelines.py中设置
初学scrapy的坑
爬取腾讯招聘,scrapy项目 items配置 spider配置 settings配置 pipelines配置蛋...
scrapy爬取图片时，出现 ValueError:Missin
原因：因为在settings.py存储图片，其ITEM_PIPELINES = {'scrapy.pipeline...
二、scrapy框架中（pipelines、settings、i
（一）scrapy 与 requsts与beautifulsoup的区别是什么呢？个人粗暴的认为： 1、scra...
Python爬虫之Scrapy数据保存MongoDB
Python爬虫之Scrapy数据保存MongoDB 首先在Pipelines.py中创建一个类：在Settin...
scrapy汽车之家（配置pipelines）
之前一直再搞模拟登陆，发现爬虫的水越来越深，js是这个世界上最恶心的语言，各种加密，各种反爬，怪我太菜，被一系列反...
爬虫scrapy框架（5）——pipelines
scrapy crawl musicspide -o mu.json 方式是框架为我们提供的一种数据存储方式，但...
scrapy pipelines.py 文件
# -*- coding: utf-8 -*-import datetime, time, osimport py...