美文网首页
Scrapy爬虫实战项目【002】 - 抓取360摄影美图

Scrapy爬虫实战项目【002】 - 抓取360摄影美图

作者: akiraakito0514 | 来源:发表于2018-08-26 15:34 被阅读0次

    爬取360摄影美图

    参考来源:《Python3网络爬虫开发实战》 第497页 作者:崔庆才

    目的:使用Scrapy爬取360摄影美图,保存至MONGODB数据库并将图片下载至本地

    目标网址:http://image.so.com/z?ch=photography

    分析/知识点:

    1. 爬取难度:
      a. 入门级,静态网页中不含图片信息,通过AJAX动态获取图片并渲染,返回结果为JSON格式;

    2. 图片下载处理:使用内置的ImagesPipeline,进行少量方法改写;

    3. MONGODB存储;

    实际步骤:

    1. 创建Scrapy项目/images(spider)
    Terminal: > scrapy startproject images360
    Terminal: > scrapy genspider images image.so.com
    
    1. 配置settings.py文件
    # MONGODB配置
    MONGO_URI = 'localhost'
    MONGO_DB = 'images360'
    
    # 下载图片默认保存目录(ImagePipelin要用到)
    IMAGES_STORE = './images'
    
    # 嘿嘿嘿...
    ROBOTSTXT_OBEY = False
    
    # headers
    DEFAULT_REQUEST_HEADERS = {
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Language': 'en',
    }
    
    # 启用Pipeline(ImagePipeline优先级要最高)
    ITEM_PIPELINES = {
        'images360.pipelines.ImagePipeline': 300,
        'images360.pipelines.MongoPipeline': 301,
    }
    
    1. 编写items.py文件
    from scrapy import Item, Field
    
    # 图片信息全部获取
    class MovieItem(Item):
        cover_height = Field()
        cover_imgurl = Field()
        cover_width = Field()
        dsptime = Field()
        group_title = Field()
        grpseq = Field()
        id = Field()
        imageid = Field()
        index = Field()
        label = Field()
        qhimg_height = Field()
        qhimg_thumb_url = Field()
        qhimg_url = Field()
        qhimg_width = Field()
        tag = Field()
        total_count = Field()
    
    1. 编写pipelines.py文件
      a) ImagePipeline: 根据Scrapy官方文档修改:
      Downloading and processing files and images
    # 图片下载Pipeline
    class ImagePipeline(ImagesPipeline):
        def file_path(self, request, response=None, info=None):
            '''
            重写file_path方法,获取图片名
            '''
            url = request.url
            file_name = url.split('/')[-1]
            return file_name
    
        def item_completed(self, results, item, info):
            '''
            将下载失败的图片剔除,不保存至数据库
            '''
            image_paths = [x['path'] for ok, x in results if ok]
            if not image_paths:
                raise DropItem('Image Downloaded Failed')
            return item
    
    
        def get_media_requests(self, item, info):
            '''
            重新请求图片url,调度器重新安排下载
            '''
            yield Request(url=item['qhimg_url'])
    

    b) MongoPipeline: 根据Scrapy官方文档修改:https://doc.scrapy.org/en/latest/topics/item-pipeline.html?highlight=mongo 代码略

    5. 编写spiders > images.py文件
    注意:
    a) 重写start_requests(self);
    b) 动态获取请求url;动态Field赋值并生成对应的ImageItem

    # 每张图片动态赋值并生产ImageItem
    for image in images:
        item = ImageItem()
        for field in item.fields:
            if field in image.keys():
                item[field] = image.get(field)
        yield item
    

    c) 完整代码如下:

    import json
    from scrapy import Spider, Request
    from images360.items import ImageItem
    
    class ImagesSpider(Spider):
        name = 'images'
        # allowed_domains = ['image.so.com']
        # start_urls = ['http://image.so.com/z?ch=photography']
    
        url = 'http://image.so.com/zj?ch=photography&sn={sn}&listtype=new&temp=1'
    
        # 重写
        def start_requests(self):
            # 循环生产请求前1200张照片(sn = [1-41])
            for sn in range(1, 41):
                yield Request(url=self.url.format(sn=sn * 30), callback=self.parse)
    
        def parse(self, response):
            results = json.loads(response.text)
            # 判断list是否在results的keys中
            if 'list' in results.keys():
                images = results.get('list')
    
            # 每张图片动态赋值并生产ImageItem
            for image in images:
                item = ImageItem()
                for field in item.fields:
                    if field in image.keys():
                        item[field] = image.get(field)
                yield item
    

    6. 运行结果

    temp-1.png temp-2.png

    小结

    1. 入门级项目,进一步熟悉Scrapy的使用流程;
    2. 熟悉网页AJAX返回结果的获取和解析;
    3. 初步了解ImagesPipeline的使用方法,以及学会如何根据需要进行改写。

    相关文章

      网友评论

          本文标题:Scrapy爬虫实战项目【002】 - 抓取360摄影美图

          本文链接:https://www.haomeiwen.com/subject/pieoiftx.html