美文网首页
Scrapy爬取图片分组存储

Scrapy爬取图片分组存储

作者: 爱搞事的喵 | 来源:发表于2018-12-28 19:26 被阅读0次

1.创建工程 scrapy startproject tutorial

2.创建蜘蛛 scrapy genspider imageSpider

3.创建爬取的Item

#图片下载管道
class ImageItem(scrapy.Item):
   image_urls = scrapy.Field()
   images = scrapy.Field()
   image_name = scrapy.Field()

4.蜘蛛代码 提示:这里爬取的图片链接可能是一个list集合所以需要在pipline处理一下

import scrapy
from tutorial.items import ImageItem

class ImagespiderSpider(scrapy.Spider):
   name = 'imageSpider'
   allowed_domains = ['lab.scrapyd.cn']
   start_urls = ['http://lab.scrapyd.cn/archives/55.html/',
                 'http://lab.scrapyd.cn/archives/57.html',
                 ]

   def parse(self, response):
       item = ImageItem()
       image_urls = response.css(".post img::attr(src)").extract()
       item['image_urls'] = image_urls
       item['image_name'] = response.css(".post-title a::text").extract_first()
       yield item

5.编写pipline处理下载的图片分组

class ImagePipline(ImagesPipeline):
   def get_media_requests(self, item, info):
       for image_url in item["image_urls"]:
           yield Request(image_url,meta={'name':item['image_name']})

   #重命名的功能 重写此功能可以得到自己想要文件名称 否则就是uuid的随机字符串
   def file_path(self, request, response=None, info=None):
       #图片名称
       img_name = request.url.split('/')[-1]
       #图片分类的名称
       name = request.meta['name']
       #处理特殊字符串
       name = re.sub(r'[?\\*|“<>:/()0123456789]','',name)
       #分文件夹存储
       filename = u'{0}/{1}'.format(name,img_name)
       return filename

   def item_completed(self, results, item, info):
       image_paths = [x['path'] for ok,x in results if ok]
       #上面的表达式等于
       # for ok,x in results:
       #     if ok:
       #         print(x['path'])
       if not image_paths:
           raise DropItem('Item contains no images')
       item['image_urls'] = image_paths
       return item

6.记得在setting里面设置pipline

ITEM_PIPELINES = {
    'tutorial.pipelines.ImagePipline': 300,
}

相关文章

网友评论

      本文标题:Scrapy爬取图片分组存储

      本文链接:https://www.haomeiwen.com/subject/qwvelqtx.html