目标:抓取图片网站 http://hunter-its.com上的图片
1.建立项目 beauty
scrapy startproject beauty
2.cd到目录,并新建爬虫,使用基础模板
cd beauty
scrapy genspider hunter hunter-its.com
image.png
3.pycharm打开项目,先编写item
打开item.py文件,定义名字和地址
import scrapy
class BeautyItem(scrapy.Item):
name = scrapy.Field()
address = scrapy.Field()
image.png
4.编写spider,爬虫文件
导入之前定义的BeautyItem模块,和Request模块
from beauty.items import BeautyItem
from scrapy.http import Request
使用xpath获取全部的图片节点
pics = response.xpath('//div[@class="pic"]/ul/li')
循环获取li节点中的所有图片和地址
for pic in pics:
item = BeautyItem()
name = pic.xpath('./a/img/@alt').extract()[0]
address = pic.xpath('./a/img/@src').extract()[0]
item['name'] = name
item['address'] = address
yield item
递归调用函数,爬取多页数据
for i in range(2, 8):
url = 'http://hunter-its.com/m/'+str(i)+'.html'
print(url)
yield Request(url, callback=self.parse)
完整代码
# -*- coding: utf-8 -*-
import scrapy
from beauty.items import BeautyItem
from scrapy.http import Request
class HunterSpider(scrapy.Spider):
name = 'hunter'
allowed_domains = ['hunter-its.com']
start_urls = ['http://hunter-its.com/m/1.html']
def parse(self, response):
#获取全部的图片节点
pics = response.xpath('//div[@class="pic"]/ul/li')
for pic in pics:
item = BeautyItem()
name = pic.xpath('./a/img/@alt').extract()[0]
address = pic.xpath('./a/img/@src').extract()[0]
item['name'] = name
item['address'] = address
yield item
for i in range(2, 8):
url = 'http://hunter-its.com/m/'+str(i)+'.html'
print(url)
yield Request(url, callback=self.parse)
image.png
5.编写数据处理脚本pipelines.py,导入requests模块
import requests
class BeautyPipeline(object):
def process_item(self, item, spider):
#模拟浏览器
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
#使用request模块,发送get请求
r = requests.get(url=item['address'], headers=headers, timeout=4)
print(item['address'])
#下载图片,存储在本地文件目录下
with open(r'/Users/vincentwen/Downloads/hunter/'+ item['name'] + '.jpg', 'wb') as f:
f.write(r.content)
image.png
6.修改setting ITEM_PIPELINES
ITEM_PIPELINES = {
'beauty.pipelines.BeautyPipeline': 100,
}
image.png
7.运行爬虫
scrapy crawl hunter
image.png
image.png
觉得文章有用,请用支付宝扫描,领取一下红包!打赏一下
支付宝红包码
网友评论