目标网址:https://music.163.com/#/discover/artist
简单分析一下网页:
推荐歌手这跟下面的是有重复的,所以我们直接忽略掉。
而下面的华语、欧美等跟网址的id是对应的,姓名开头字母ABCD则是跟initial对应。
很简单就可以找出所有的网址
idList = [1001, 1002, 1003, 2001, 2002, 2003, 6001, 6002, 6003, 7001, 7002, 7003, 4001, 4002, 4003]
initialList = [65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90]
for i in idList:
for j in initialList:
url = 'http://music.163.com/discover/artist/cat?id=' + str(i) + '&initial=' + str(j)
print(url)
然后我们单独处理每个网址,获取所有的歌手名字及对应id
image.png
歌手名很简单,都是a链接class='nm nm-icn f-thide s-fc0',所以:
f=open('163.txt','w+',encoding='utf-8')
def get_artists(url):
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
r = requests.get(url, headers=headers,verify=False)
soup = BeautifulSoup(r.text, 'lxml')
for artist in soup.find_all('a', attrs={'class': 'nm nm-icn f-thide s-fc0'}):
artist_name = artist.string
artist_id = artist['href'].replace('/artist?id=', '').strip()
try:
f.write(artist_id+'----'+artist_name+'\n')
except Exception as msg:
print(msg)
结果如图:
image.png
完整代码:https://github.com/Liangjianghao/everyDay_spider.git cloud_music
网友评论