1 安装Scrapy

Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架。可以应用在包括数据挖掘，信息处理或存储历史数据等一系列的程序中。

本文编写一个简单的Python 爬虫用于抓取http://desk.zol.com.cn/的部分壁纸。

开发环境是mac OS ,python 版本是2.7.

step1 需要先安装python 的虚拟环境。virtualenv可以搭建虚拟且独立的python环境，可以使每个项目环境与其他项目独立开来，保持环境的干净，解决包冲突问题。

  pip install virtualenv

创建一个虚拟且独立空间。env 是虚拟环境的名称

virtualenv env

启动虚拟环境（就是运行目录env/bin 下的activate 文件）

. env/bin/activate

step2 安装Scrapy。

pip install Scrapy

安装Python 图形处理库,下载图片时需要使用到这个库。

pip install Pillow

step3 创建项目 ,download 是项目名称。

scrapy startproject download

2 编写爬虫

定义抓取的Item。第一步是定义我们需要爬取的数据结构。

items.py

import scrapy   
class DownloadItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    image_urls = scrapy.Field() //图片的网址
    images = scrapy.Field() //图片信息 scrapy 自动获取的
    images_page_url = scrapy.Field()
    images_catalog = scrapy.Field() //图片的存放目录

编写网络爬虫。在spiders目录下新建一个文件dmoz_spider.py，用于编写爬虫逻辑。我使用Chrome浏览器的开发者工具对网站的结构进行分析，使用scrapy 选择器提取响应的信息。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
    
import scrapy
from download.items import DownloadItem
    
class DmozSpider(scrapy.Spider):
    name = "download"
    allowed_domains = ["zol.com.cn"]
    start_urls = [
        "http://desk.zol.com.cn/"
    ]
    
    def parse(self, response):
        #连接到图片内容页面
        for sel in response.xpath('//a[@class="pic"]'):
            shortUrl = sel.xpath('@href').extract()[0]
            url = response.urljoin(shortUrl)
            yield scrapy.Request(url, callback=self.parse_article) #连接到内容页的调用，在回调函数处理。
            
        #连接到图片内容页面
        for sel in response.xpath('//a[@class="title"]'):
            shortUrl = sel.xpath('@href').extract()[0]
            url = response.urljoin(shortUrl)
            yield scrapy.Request(url, callback=self.parse_article)
    
    #真正下载图片的处理函数
    def parse_article(self, response):
        item = DownloadItem()
        bigImgUrl = response.xpath('//img[@id="bigImg"]/@src').extract() #获取图片的URL
        
        item['image_urls'] = bigImgUrl
        item['images_page_url'] = response.url
        url = response.url
        catalog = url.split('_')[-3]
        catalog = catalog.split('/')[-1]
        item['images_catalog'] = catalog #获取图片的目录 用于存放图片
        yield item
    
        nextPageUrl = response.xpath('//a[@id="pageNext"]/@href').extract()[0] #下一页
        
        if nextPageUrl.index('.html') >= 0:
            url = response.urljoin(nextPageUrl);
            yield scrapy.Request(url, callback=self.parse_article)

3 保存图片

图片的保存需要用到Scrapy 的图片处理Pipe。在setting.py中设置。先使用scrapy.pipelines.images.ImagesPipeline保存图片，再使用自己编写的 download.pipelines.DownloadPipeline对图片分类处理。

在setting.py 中设置。

 BOT_NAME = 'download'
 
 SPIDER_MODULES = ['download.spiders']
 NEWSPIDER_MODULE = 'download.spiders'
 
 #同时使用图片和文件管道
 ITEM_PIPELINES = {
                   'scrapy.pipelines.images.ImagesPipeline': 1,
                   'download.pipelines.DownloadPipeline':2,
                   }
 IMAGES_STORE = '/Users/superzhan/Documents/project/python/Scrapy/download/' # 图片存储路径

爬虫抓去到的数据需要通过pipelines 来分类保存。在pipelines.py 中对下载到的图片进行分类保存。

# -*- coding: utf-8 -*-
 
 import os
 import shutil
 
 class DownloadPipeline(object):
 
     #move file
     def process_item(self, item, spider):
         curPath = '/Users/superzhan/Documents/Project/python/Scrapy/download/'
         
         #分类后的图片目录
         targetPath ='/Users/superzhan/Documents/Project/python/Scrapy/download/Img/'
 
          #创建分类目录
         catalog = item['images_catalog']
         targetCatalog = os.path.join(targetPath,catalog)
         if False == os.path.exists(targetCatalog):
             os.mkdir(targetCatalog)
         
         images_path= item['images'][0]['path']
         full_image_path = os.path.join(curPath,images_path)
         target_image_path = os.path.join(targetCatalog,full_image_path.split('/')[-1])
         
         #分类
         shutil.move(full_image_path,target_image_path)
 
         return item