最近,在学习Python的爬虫框架scrapy。现在利用scrapy框架,把之前写过的一个落网爬虫重新实现一遍。
爬虫的具体分析见本人之前写的python爬虫-爬取高逼格音乐网站《落网》
首先,先进入dos模式下面,在合适的目录建一个scrapy的工程,如下图:
上面所示,一个新的scrapy课程创建成功;在spiders目录下面新创建一个爬虫文件,具体的结构如下:
接下来看看具体的实现
实现
items.py (scrapy)
item模块用来定义爬取的目标,就是从非结构性的数据源提取结构性数据。
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class LuowangItem(scrapy.Item):
index = scrapy.Field() #期刊号
songName = scrapy.Field() #歌曲名
songDownloadURL = scrapy.Field() #歌曲下载地址
luo_spider.py (scrapy)
spiders模块主要是爬虫的主体部分,爬虫的主要实现。
# -*- coding: utf-8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from ..items import LuowangItem
from scrapy.http import Request
import re
class LuoSpider(Spider):
name = 'luo' #爬虫名
allowed_domains = ['luoo.net']
start_urls = ['http://www.luoo.net/music/folk'] #爬取的原始url
def parse(self, response):
selector = Selector(response)
vol_list = selector.xpath('//div[@class="vol-list"]/div')
pattern = re.compile('[0-9]+')
for vol in vol_list:
item = LuowangItem()
index_url = vol.xpath('a/@href').extract()[0]
index = re.search(pattern,vol.xpath('div/a/text()').extract()[0]).group().lstrip('0')
item['index'] = index
yield Request(index_url, meta={'item': item}, callback=self.get_songInfos) #调用爬取每一期刊里面的内容
#获取下一页期刊url
next_url = selector.xpath('//div[@class="paginator"]/a[@class="next"]/@href').extract()
if next_url:
next_url = next_url[0]
yield Request(next_url, callback=self.parse) #回调爬取下一页期刊
def get_songInfos(self, response):
item = response.meta['item'] #传入期刊号
radio = item['index']
songinfos = response.xpath('//*[@id="luooPlayerPlaylist"]/ul/li')
for songinfo in songinfos:
songName = songinfo.xpath('div/a/text()').extract()[0].split('.')[1].lstrip() #获取歌曲名
number = songinfo.xpath('div/a/text()').extract()[0].split('.')[0].lstrip()
songDownloadURL = 'http://mp3-cdn2.luoo.net/low/luoo/radio' + str(radio) + '/' + str(number) + '.mp3' #获取歌曲下载地址
item['songName'] = songName
item['songDownloadURL'] = songDownloadURL
yield item
pipelines.py(scrapy)
主要用来处理爬取的item
# -*- coding: utf-8 -*-
# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import urllib2
class LuowangPipeline(object):
def process_item(self, item, spider):
songName = item['songName']
songDownloadURL = item['songDownloadURL']
try:
data = urllib2.urlopen(songDownloadURL).read()
except urllib2.URLError:
print("######链接不存在,继续下载下一首########")
with open (('D:\\test\\song\\%s.mp3' %(songName)).decode('utf-8'), 'wb') as f:
f.write(data)
return item
到这里,整个落网音乐的爬虫就实现了。到dos正确的目录下面,在爬虫的对应的目录下面,执行
scrapy crawl luo
结果如下:
如果对您有点帮助的话,麻烦您给点个赞,谢谢。
网友评论