使用FilesPipeline和ImagesPipeline

作者: 喵帕斯0_0 | 来源:发表于2018-05-23 00:12 被阅读10次

    除了爬取文本,我们可能还需要下载文件、视频、图片、压缩包等,这也是一些常见的需求。scrapy提供了FilesPipeline和ImagesPipeline,专门用于下载普通文件及图片。两者的使用方法也十分简单,首先看下FilesPipeline的使用方式。

    FilesPipeline

    FilesPipeline的工作流如下:

    1. spider中爬取要下载的文件链接,将其放置于item中的file_urls
    2. spider将其返回并传送至pipeline链;
    3. FilesPipeline处理时,它会检测是否有file_urls字段,如果有的话,会将url传送给scarpy调度器和下载器;
    4. 下载完成之后,会将结果写入item的另一字段filesfiles包含了文件现在的本地路径(相对于配置FILE_STORE的路径)、文件校验和checksum、文件的url

    从上面的过程可以看出使用FilesPipeline的几个必须项:

    1. Item要包含file_urlsfiles两个字段;
    2. 打开FilesPipeline配置;
    3. 配置文件下载目录FILE_STORE

    下面以下载https://twistedmatrix.com/documents/current/core/examples/页面下的python代码为例:

    # items.py
    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    class ExamplesItem(scrapy.Item):
        file_urls = scrapy.Field()  # 指定文件下载的连接
        files = scrapy.Field()      #文件下载完成后会往里面写相关的信息
    
    #example.py
    # -*- coding: utf-8 -*-
    import scrapy
    from ..items import ExamplesItem
    
    class ExamplesSpider(scrapy.Spider):
        name = 'examples'
        allowed_domains = ['twistedmatrix.com']
        start_urls = ['https://twistedmatrix.com/documents/current/core/examples/']
    
        def parse(self, response):
            urls  = response.css('a.reference.download.internal::attr(href)').extract()
            for url in urls:
                yield ExamplesItem(file_urls = [response.urljoin(url)])
    
    #setting.py
    #...
    FILES_STORE = '/root/TwistedExamples/file_store'
    #...
    

    运行scrapy crawl examples,然后在FILES_STORE/full目录下,可以看到已经下载了文件,此时用url的SHA1 hash来作为文件的名称,后面会讲到如何自定义自己想要的名称。先来看看ImagesPipeline

    FilesPipeline.png
    ImagesPipeline

    IMagesPipeline的过程与FilePipeline差不多,参数名称和配置不一样;如下:

    FilesPipelin ImagesPipeline
    Package scrapy.pipelines.files.FilesPipeline scrapy.pipelines.images.ImagesPipeline
    Item file_urls
    files
    image_urls
    images
    存储路径配置参数 FILES_STROE IMAGES_STORE

    除此之外,ImagesPipeline还支持以下特别功能:

    1. 生成缩略图,通过配置IMAGES_THUMBS = {'size_name': (width_size,heigh_size),}
    2. 过滤过小图片,通过配置IMAGES_MIN_HEIGHTIMAGES_MIN_WIDTH来过滤过小的图片。

    下面我们以爬取http://image.so.com/z?ch=beauty下美女的图片为例,看下ImagePipeline是如何生效的。
    通过抓取该网站地址的请求,可以发现图片地址是通过接口http://image.so.com/zj?ch=beauty&sn=0&listtype=new&temp=1来获取图片地址的,其中sn=0表示图片数据的偏移量,默认每次返回30个图片信息,其返回包是一个json字符串,如下:

    
        "end": false,
        "count": 30,
        "lastid": 30,
        "list": [{
            "id": "b0cd2c3beced890b801b845a7d2de081",
            "imageid": "f90d2737a6d14cbcb2f1f2d5192356dc",
            "group_title": "清纯美女户外迷人写真笑颜迷人",
            "tag": "萌女",
            "grpseq": 1,
            "cover_imgurl": "http:\/\/i1.umei.cc\/uploads\/tu\/201608\/80\/0dexb2tjurx.jpg",
            "cover_height": 960,
            "cover_width": 640,
            "total_count": 8,
            "index": 1,
            "qhimg_url": "http:\/\/p0.so.qhmsg.com\/t017d478b5ab2f639ff.jpg",
            "qhimg_thumb_url": "http:\/\/p0.so.qhmsg.com\/sdr\/238__\/t017d478b5ab2f639ff.jpg",
            "qhimg_width": 238,
            "qhimg_height": 357,
            "dsptime": ""
        },
        ......省略
        , {
            "id": "37f6474ea039f34b5936eb70d77c057c",
            "imageid": "3125c84c138f1d31096f620c29b94512",
            "group_title": "美女萝莉铁路制服写真清纯动人",
            "tag": "萌女",
            "grpseq": 1,
            "cover_imgurl": "http:\/\/i1.umei.cc\/uploads\/tu\/201701\/798\/kuojthsyf1j.jpg",
            "cover_height": 587,
            "cover_width": 880,
            "total_count": 8,
            "index": 30,
            "qhimg_url": "http:\/\/p2.so.qhimgs1.com\/t0108dc82794264fe32.jpg",
            "qhimg_thumb_url": "http:\/\/p2.so.qhimgs1.com\/sdr\/238__\/t0108dc82794264fe32.jpg",
            "qhimg_width": 238,
            "qhimg_height": 159,
            "dsptime": ""
        }]
    }
    

    我们可以通过返回包的qhimg_url获取图片的链接,具体代码如下:

    #items.py
    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    class BeautyItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        name = scrapy.Field()
        image_urls = scrapy.Field()
        images = scrapy.Field()
    
    #beauty.py
    # -*- coding: utf-8 -*-
    import scrapy
    import json
    from ..items import BeautyItem
    
    class BeautypicSpider(scrapy.Spider):
        name = 'beautypic'
        allowed_domains = ['image.so.com']
        url_pattern = 'http://image.so.com/zj?ch=beauty&sn={offset}&listtype=new&temp=1'
    #    start_urls = ['http://image.so.com/']
        def start_requests(self):
            step = 30
            for page in range(0,3):
                url = self.url_pattern.format(offset = page*step)
                yield scrapy.Request(url, callback = self.parse)
    
        def parse(self, response):
            ret = json.loads(response.body)
            for row in ret['list']:
                yield BeautyItem(image_urls=[row['qhimg_url']], name = row['group_title'])
    
    #settings.py
    # Obey robots.txt rules
    ROBOTSTXT_OBEY =False
    
    ITEM_PIPELINES = {
         'scrapy.pipelines.images.ImagesPipeline':5,
    }
    IMAGES_STORE = '/root/beauty/store_file'
    

    下载下来的图片文件如下:


    ImagesPipeline.png
    修改文件默认名

    从FilePipeline和ImagePipeline中可以看到下载的文件名都比较怪异,不太直观,这些文件名使用的是url地址的sha1散列值,主要用于防止重名的文件相互覆盖,但有时我们想文件按照我们的期望来命名。比如对于下载文件,通过查看FilesPipeline的源码,可以发现文件名主要由FilesPipeline.file_path来决定的,部分代码如下:

    class FilesPipeline(MediaPipeline):
       ...
       def file_path(self, request, response=None, info=None):
            ## start of deprecation warning block (can be removed in the future)
            def _warn():
                from scrapy.exceptions import ScrapyDeprecationWarning
                import warnings
                warnings.warn('FilesPipeline.file_key(url) method is deprecated, please use '
                              'file_path(request, response=None, info=None) instead',
                              category=ScrapyDeprecationWarning, stacklevel=1)
    
            # check if called from file_key with url as first argument
            if not isinstance(request, Request):
                _warn()
                url = request
            else:
                url = request.url
    
            # detect if file_key() method has been overridden
            if not hasattr(self.file_key, '_base'):
                _warn()
                return self.file_key(url)
            ## end of deprecation warning block
    
            media_guid = hashlib.sha1(to_bytes(url)).hexdigest()  # change to request.url after deprecation
            media_ext = os.path.splitext(url)[1]  # change to request.url after deprecation
            return 'full/%s%s' % (media_guid, media_ext)
        ...
    
    

    因此我们可以通过继承FilesPipeline重写file_path()方法来重定义文件名,新的自定义SelfDefineFilePipline如下:

    #pipelines.py
    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    from scrapy.pipelines.files import FilesPipeline
    from urllib.parse import urlparse
    import os
    class MatplotlibExamplesPipeline(object):
        def process_item(self, item, spider):
            return item
    
    
    class SelfDefineFilePipline(FilesPipeline):
        """
        继承FilesPipeline,更改其存储文件的方式
        """
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
    
        def file_path(self, request, response=None, info=None):
            parse_result = urlparse(request.url)
            path = parse_result.path
            basename = os.path.basename(path)
            return basename
    

    在配置文件settings.py中打开SelfDefineFilePipline并运行爬虫,以下为下载结果。

    SelfDefineFilePipline.png

    这里讲的只是其中一种方法,主要是为了提供一种思路,更改文件名的方法有很多,要看具体场景,比如下载图片那一节,url并没有带图片的名称,那么通过只更改file_path()方法来命名应该不可能,因为item['name']并没有传进来,通过查找源码,发现在get_media_requests()方法中是通过Request来下载图片的,这个方法里面也有带item信息,可以将item['name']在Request的meta参数传递,在file_path()方法就能获取到外部传进来的名字。所以看源码其实也是学习框架的一种方式。

    总结

    本篇讲了如何使用scrapy自带的FilesPipeline和ImagesPipeline来下载文件和图片,然后讲了如何通过继承并重写上述类的方法来重定义文件名的命名方法。下一篇主要学习下LineExtractor快速提前链接和Exporter导出结果到文件。

    相关文章

      网友评论

        本文标题:使用FilesPipeline和ImagesPipeline

        本文链接:https://www.haomeiwen.com/subject/ztdljftx.html