1.创建工程
scrapy startproject tutorial
2.创建蜘蛛
scrapy genspider imageSpider
3.创建爬取的Item
#图片下载管道
class ImageItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
image_name = scrapy.Field()
4.蜘蛛代码 提示:这里爬取的图片链接可能是一个list集合所以需要在pipline处理一下
import scrapy
from tutorial.items import ImageItem
class ImagespiderSpider(scrapy.Spider):
name = 'imageSpider'
allowed_domains = ['lab.scrapyd.cn']
start_urls = ['http://lab.scrapyd.cn/archives/55.html/',
'http://lab.scrapyd.cn/archives/57.html',
]
def parse(self, response):
item = ImageItem()
image_urls = response.css(".post img::attr(src)").extract()
item['image_urls'] = image_urls
item['image_name'] = response.css(".post-title a::text").extract_first()
yield item
5.编写pipline处理下载的图片分组
class ImagePipline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item["image_urls"]:
yield Request(image_url,meta={'name':item['image_name']})
#重命名的功能 重写此功能可以得到自己想要文件名称 否则就是uuid的随机字符串
def file_path(self, request, response=None, info=None):
#图片名称
img_name = request.url.split('/')[-1]
#图片分类的名称
name = request.meta['name']
#处理特殊字符串
name = re.sub(r'[?\\*|“<>:/()0123456789]','',name)
#分文件夹存储
filename = u'{0}/{1}'.format(name,img_name)
return filename
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok,x in results if ok]
#上面的表达式等于
# for ok,x in results:
# if ok:
# print(x['path'])
if not image_paths:
raise DropItem('Item contains no images')
item['image_urls'] = image_paths
return item
6.记得在setting里面设置pipline
ITEM_PIPELINES = {
'tutorial.pipelines.ImagePipline': 300,
}
网友评论