本次分享将展示如何利用Scrapy爬取网页中的图片。爬取的网页如下:
首先建立sina_trip项目:
scrapy startproject sina_trip
在settings.py中,添加代码:
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_URLS_FIELD = 'url'
IMAGES_STORE = r'.'
items.py中的代码如下:
import scrapy
class SinaTripItem(scrapy.Item):
url = scrapy.Field()
然后在spiders文件夹下新建文件sina_trip_spider.py,代码如下:
import scrapy
from scrapy.spiders import Spider
from scrapy.selector import Selector
from sina_trip.items import SinaTripItem
class sinaTripSpider(Spider):
name = "sinaTripSpider" #name of Spider
start_urls = ["http://travel.sina.com.cn/"] #start url
def parse(self, response): #parse function
item = SinaTripItem()
sel = Selector(response)
sites = sel.xpath("//img/@src").extract() #extract url of pictures
for site in sites:
item['url'] = ['http:'+site]
yield item
在终端输入命令:
scrapy crawl sinaTripSpider
运行结果如下:
运行完后,在spiders文件夹下会多出full文件夹,这是图片下载后保存的地址:
多出full文件夹
full里面的图片如下:
full文件夹的图片
Bingo,我们的图片爬虫也成功啦~~
本文的GitHub地址如下,欢迎大家访问:https://github.com/jclian91/scrapy-for-sina_trip-
本次分享到此结束,欢迎大家批评与交流~~
网友评论