美文网首页
爬取网易云音乐更新-增加爬取多个歌单

爬取网易云音乐更新-增加爬取多个歌单

作者: 这个太难了 | 来源:发表于2018-08-10 14:44 被阅读0次

之前爬取的只是一个歌单里边的歌曲,现在增加爬取多个歌单的功能,通过一次次点击歌单的分析,网易云对歌单也是通过一个id值来确定的,最开始的地址(点击歌单后):url=https://music.163.com/discover/playlist,注意:#号不能要。进去歌单后,页面是这样的

通过Network查找,发现第一页每个歌单的id是在一个Name叫playlist里边,再通过每个歌单的id在里边搜了一下,发现在每个id会在这么3个地方 那我该选哪一个来解析呢?
注意看一下我们的歌单url,是这样的:
url = https://music.163.com/playlist?id=2264645756
分析一下,可以发现,我们找到每个歌单是通过id的,也就是说我们要的就是id,用来构造访问每个歌单的url:"https://music.163.com/#/playlist?id=歌单的id",再结合解析,我用的bs4,那毫无疑问,我肯定选择第三个来解析,我可以直接解析出红框中的内容,然后用"https://music.163.com+我得到的内容"来构造歌单url
此处的解析代码:
def getplayList(html):  #解析出歌单的id和歌单名字
    playlists = []
    soup = BeautifulSoup(html, 'html.parser')
    id = soup.select('.dec a')
    for i in id:
        playlist = []
        playlist.append(i['href'])  #获得歌单的[/playlist?id=2264645756]
        playlist.append(i['title']) #获得歌单的名字
        playlists.append(playlist) #把每一个歌单的名字和id放到列表中
    return playlists

playlists是这样的一个形式(只选了部分),但其中的歌单名我并没有用,因为我没想好怎么用它

[['/playlist?id=2264645756', '想 一 个 人 在 黄 昏 后'], ['/playlist?id=2355333774', '你不是我的诗\xa0正如我不是你的梦'], ['/playlist?id=363692915', '「 Indie 」 那天午后打了个盹儿'], ['/playlist?id=2347578332', '西音东渐:日式西洋古典美学'], ['/playlist?id=2331853291', '爱是紫色的折叠梦境,曼妙又绮丽'], ['/playlist?id=2349865512', '【情话说唱】我的歌里写的是你'], ['/playlist?id=2353471182', '攒了一大堆好听的歌想和你一起听'], ['/playlist?id=2352321741', 'Bass Institute|Bass House'], ['/playlist?id=2298138241', '听了几个故事,正好讲给你玩'], ['/playlist?id=2311431519', '你的名字我的心事〈情歌说唱〉'], ['/playlist?id=2299157419', '活的像风 没有归宿 却也够酷'], ['/playlist?id=2302705693', '『古风』我自问酒不问仙 半世逍遥半世癫']] 

这样就能得出歌单的id了,就可以构造出歌单的url了,如下(只贴出了实现本功能的代码):

 playurl = 'https://music.163.com/discover/playlist'#歌单页面的url(第一页)
    headers = {
        'User-Agent': 'Mozilla/5.0'
    }
    playhtml = getHtml(playurl,headers=headers)#获得歌单页面
    playidlist = getplayList(playhtml)  #解析出id、每个歌单名字
    for u in playidlist:  #构造每个歌单url,用于下载歌单中的音乐
        start_url_list.append('https://music.163.com'+u[0])

start_url_list的结构内容是这样的:

['https://music.163.com/playlist?id=2264645756', 'https://music.163.com/playlist?id=2355333774', 'https://music.163.com/playlist?id=363692915', 'https://music.163.com/playlist?id=2347578332', 'https://music.163.com/playlist?id=2331853291', 'https://music.163.com/playlist?id=2349865512', 'https://music.163.com/playlist?id=2353471182', 'https://music.163.com/playlist?id=2352321741', 'https://music.163.com/playlist?id=2298138241', 'https://music.163.com/playlist?id=2311431519', 'https://music.163.com/playlist?id=2299157419', 'https://music.163.com/playlist?id=2302705693', 'https://music.163.com/playlist?id=2300523945', 'https://music.163.com/playlist?id=2301094981', 'https://music.163.com/playlist?id=2301267346', 'https://music.163.com/playlist?id=2297457355', 'https://music.163.com/playlist?id=2290267281', 'https://music.163.com/playlist?id=2291115145', 'https://music.163.com/playlist?id=2301310816', 'https://music.163.com/playlist?id=2290797610', 'https://music.163.com/playlist?id=2286380125', 'https://music.163.com/playlist?id=2283281232', 'https://music.163.com/playlist?id=2278767768', 'https://music.163.com/playlist?id=2277307819', 'https://music.163.com/playlist?id=2343741251', 'https://music.163.com/playlist?id=2274985772', 'https://music.163.com/playlist?id=2335662972', 'https://music.163.com/playlist?id=2274346473', 'https://music.163.com/playlist?id=2336165805', 'https://music.163.com/playlist?id=2274803562', 'https://music.163.com/playlist?id=2339316534', 'https://music.163.com/playlist?id=2272295927', 'https://music.163.com/playlist?id=2336073422', 'https://music.163.com/playlist?id=2286925070', 'https://music.163.com/playlist?id=2341435171']
Process finished with exit code 0

也就是说每个项就是一个歌单的url,那接下来就是通过每个歌单的url去爬取歌单里边的音乐了,当然,得结合之前的去歌单解析出每首歌的id值,接下来的步骤就和第一篇文章的步骤一样了爬取网易云部分音乐
完整代码:

import requests
from bs4 import BeautifulSoup

def getHtml(url,headers):
    try:
        r = requests.get(url,headers = headers)
        r.raise_for_status()
        r.encoding = 'utf-8'
        return r.text
    except:
        print('爬取失败')
        return ''

def htmlParser(html):
    try:
        id_list = []
        soup = BeautifulSoup(html,'html.parser')
        li = soup.select('.f-hide li a')
        for i in li:
            id_list.append(i['href'].split('=')[-1])
        return id_list
    except:
        print('获得id出错')
        return ''

def get_name_singer(html):
    name_sig_list = []
    soup = BeautifulSoup(html,'html.parser')
    name = soup.select('.f-ff2')
    singer = soup.select('p.des.s-fc4 span a')
    name_sig_list.append(name[0].text)
    name_sig_list.append(singer[0].text)
    return name_sig_list
def getMusic(lst,nslst):

        urls = []
        for id in lst:
            urls.append('http://music.163.com/song/media/outer/url?id='+id+'.mp3')
        for i in range(len(urls)):
            try:
                r = requests.get(urls[i])
                with open('music/'+nslst[i][1].strip()+','+nslst[i][0].strip()+'.mp3','wb') as f:
                    f.write(r.content)
                    print('第{}首音乐下载成功'.format(i+1))
            except :
                print('第{}首音乐下载失败'.format(i+1))
        f.close()
def getplayList(html):
    playlists = []
    soup = BeautifulSoup(html, 'html.parser')
    id = soup.select('.dec a')
    for i in id:
        playlist = []
        playlist.append(i['href'])
        playlist.append(i['title'])
        playlists.append(playlist)
    return playlists
def main():
    urlls = []
    name_singer_list = []
    start_url_list = []
    # start_url = 'https://music.163.com/playlist?id=2153101541'
    playurl = 'https://music.163.com/discover/playlist'#歌单页面的url(第一页)
    headers = {
        'User-Agent': 'Mozilla/5.0'
    }
    playhtml = getHtml(playurl,headers=headers)#获得歌单页面
    playidlist = getplayList(playhtml)  #解析出id、每个歌单名字
    for u in playidlist:  #构造每个歌单url,用于下载歌单中的音乐
        start_url_list.append('https://music.163.com'+u[0])
    print(start_url_list)
    for url in start_url_list:
        html = getHtml(url,headers=headers)
        idlist = htmlParser(html)
        for id in idlist:
            urlls.append('https://music.163.com/song?id='+id)
        for url in urlls:
            html = getHtml(url,headers)
            name_singer_list.append(get_name_singer(html))
        # print(name_singer_list)
        getMusic(idlist,name_singer_list)
main()

运行结果:

F:\New_Anaconda\python.exe E:/Spider_Folder/网易云音乐下载.py
第1首音乐下载成功
第2首音乐下载成功
第3首音乐下载成功
第4首音乐下载成功
第5首音乐下载成功
第6首音乐下载成功

Process finished with exit code -1
碍于用的是流量,还有网速感人,只是下载了一点。保存的截图:

介于没有运行完整,可能会后边有什么错误,慢慢改吧。

相关文章

网友评论

      本文标题:爬取网易云音乐更新-增加爬取多个歌单

      本文链接:https://www.haomeiwen.com/subject/wubkbftx.html