思路和一些坑爹的地方
这个项目前前后后折腾了我三天,才前前后后都搞明白,然后到了最后我发现......哎,等会再说。
首先我要抓取的页面是我个人喜欢的音乐页面,看这个页面的加载速度,很明显是异步加载,那么找到下面这条XHR也就不难了:
网易云detial.jpg分析这个请求,8,csrf_token
和cookie
直接复制粘贴就行,我到现在都不知道这个token是怎么计算的。csrf_token这个值也出现在了cookie里,不过我测试过其实不一致也没出现问题。另外Form Data中的params其实也是这个token计算出来的,后面我会细说。
找到这条XHR后,我们需要找到params
和encSecKey
两个值得计算方法,在js标签下,从上往下一个一个搜,很快就能找到这两个值的计算方法在core.js
里(我记得看到过一个方法可以根据XHR中的initial来找对应的js文件,但是我没找到)。 现在要做的就是调式这个js文件,具体怎么调试请参考下面的章节。这里的过程大概是这样:
=> 逆向推演
e9f.data=k9b.cE1x({params:bQC7v.encText,encSecKey:bQC7v.encSecKey})
=> bQC7v由下面的函数计算出来
var bQC7v=window.asrsea(JSON.stringify(j9a), bui2x(["流泪","强"]), bui2x(Oa3x.md), bui2x(["爱心","女孩","惊恐","大笑"]));
=> 这里bui2x其实是完成了一次中文->字符串的转换
var bui2x=function(cqS5X){
var m0x=[];
k9b.bd0x(cqS5X,function(cqQ4U){
m0x.push(Oa3x.emj[cqQ4U])
});
return m0x.join("")
};
具体的bui2x如下,我们以第二个参数为例:
=> bui2x(["流泪","强"]
参考字典Oa3x.emj,"流泪"->"01000", "强"->"1"
因此bui2x(["流泪","强"] -> "010001"
按照这个方法可以计算出window.asrsea函数后三个参数,但是第一个参数是什么呢?其实这里有个更简单的方法,就是在调式js的时候直接把这四个参数打印在console里。不过这个分析过程是必要的,它可以让我们知道这个函数的后三个参数是固定值,可以直接复制粘贴到代码里。这样我们就有:
JSON.stringify(j9a)
=> "{"csrf_token":"e4515cd84baa025c38432652ec58839c"}"
bui2x(["流泪", "强"])
=> "010001"
bui2x(Oa3x.md)
=> "00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e7"
bui2x(["爱心", "女孩", "惊恐", "大笑"])
=> "0CoJUm6Qyw8W8jud"
接着再来看window.asrsea函数,对应的代码如下:
!function() {
//返回一个16位的字符串,由字符和数字组成
function a(a) {
var d, e, b = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", c = "";
for (d = 0; a > d; d += 1)
e = Math.random() * b.length, //[0,1) * len(b)
e = Math.floor(e), //所以e最后是b里面的一个字母
c += b.charAt(e);
return c
}
//AES加密,iv是12345678,mode是CBC
function b(a, b) {
var c = CryptoJS.enc.Utf8.parse(b)
, d = CryptoJS.enc.Utf8.parse("0102030405060708")
, e = CryptoJS.enc.Utf8.parse(a)
, f = CryptoJS.AES.encrypt(e, c, { //e是text,c是key,
iv: d,
mode: CryptoJS.mode.CBC
});
return f.toString()
}
//
function c(a, b, c) {
var d, e;
return setMaxDigits(131),
d = new RSAKeyPair(b,"",c),
e = encryptedString(d, a)
}
function d(d, e, f, g) {
var h = {}
, i = a(16);
return h.encText = b(d, g), //第一次AES加密
h.encText = b(h.encText, i), //第二次AES加密
h.encSecKey = c(i, e, f), //RSA加密
h
}
//无关
function e(a, b, d, e) {
var f = {};
return f.encText = c(a + e, b, d),
f
}
window.asrsea = d, //d的返回值h就是上文中的bQC7v,它包含了两个参数
window.ecnonasr = e
}();
分析到这一步,接下来我们只要把d函数改写成python代码就可以开始计算我们需要的变量值,然后构建请求开始抓取了。
这里我必须要吐槽一下这个js文件,用了大量的莫名其妙的变量和无数匿名函数,后者还能解释为增加代码效率的一种手段,这变量的命名,我实在是不理解,我在想这是不是他们的反爬手段?
然而……
然而……
然而……
在我好不容易搞懂这些,又用了好多时间在windows上安装pycrypto库,然后用它编写AES加密代码后发现:这一切其实是不需要的!在构建请求的时候,不需要csrf_token,不需要cookie,只需要formdata,而这个formdata中,encSecKey对我们来说是固定的(它们看起来不一样是因为用了16位随机数,然而模拟的时候我们可以采用一样的随机数),encText的话,是根据csrf_token的二次AES加密结果。然而csrf_token根本不需要,这两个值我就直接复制粘贴了一份,结果直接能用。我还担心这两个值会有时效性的问题,因此我过了一天再次运行了下代码,仍然能用。
经过我的测试,只要构建一个普通的带formdata的请求头,服务器就会直接返回你要的json文件。至于每首歌的热评,简单看一下其实也是个json文件,它的请求是:
http://music.163.com/weapi/v1/resource/comments/R_SO_4_512365425
前面一段时固定的,后面几个数字是歌曲的ID,它可以在上一个json文件里找到,之后这个请求返回的json文件里会包含所有的热评和评论。
这TM就尴尬了啊!
代码在最后。
一些可能会有用的技能
虽然做了很多无用功,多学点总没坏处,记录在此。
用Chrome调试JS
有人推荐过Fiddler调式JS,因为它可以将js文件替换为本地的文件,从而进行修改调试。同时Fiddler还支持远程调式,比如说抓取手机APP上的内容,这点非常有用,目前我也没找到替代品。
但是针对第一点,个人还是喜欢Chrome的调式功能。如果你写过C代码,Fiddler替换js的方法其实就相当于在代码里加print语句用来定位问题,而Chrome调式就是gdb。
网易云debug.jpg调式的界面如图,解释一下:
- 文件处于sources目录下,可以在上文中的core.js右键选择open in panel。格式化代码可以通过点击4处的{}符号完成,此处因为该代码已经被格式化过(名字末尾有format字样)了,所以这个符号就没了。
- 这是比较重要的地方,一定要unblackbox the script,不然在这个文件里设置断点会被略过。
- 每行代码都可以设置断点,单击左侧的行号即可;另外每行里如果有额外的函数,也可以设置为断点。
- 如果代码没有格式化过,这里会出现一个{}按钮,按一下就会自动把代码格式化。
- 逐步调式的几个按钮,下一步,进入函数,跳出函数之类的。
- 需要观察的变量,这些变量也可以在console标签下直接打印出来。
Python3.6安装pycrypto2.6.1库
首先,安装这个库的指令是:
pip install pycrypto
然后再window 10以及Python 3.6的大环境下,这里会出问题,我遇到的问题来自于inttype.h,具体我没截图了,大概是这样的:
running install
running build
running build_py
running build_ext
warning: GMP or MPIR library not found; Not building Crypto.PublicKey._fastmath.
building ‘Crypto.Random.OSRNG.winrandom‘ extension
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Isrc/ -Isrc/inc-msvc/ -IC:\Python36\include
-IC:\Python36\include winrand.c
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(26): error C2061: syntax error: identifier ‘intmax_t‘
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(27): error C2061: syntax error: identifier ‘rem‘
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(27): error C2059: syntax error: ‘;‘
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(28): error C2059: syntax error: ‘}‘
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(30): error C2061: syntax error: identifier ‘imaxdiv_t‘
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(30): error C2059: syntax error: ‘;‘
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(40): error C2143: syntax error: missing ‘{‘ before ‘__cdecl‘
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(41): error C2146: syntax error: missing ‘)‘ before identifier ‘_Number‘
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(41): error C2061: syntax error: identifier ‘_Number‘
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(41): error C2059: syntax error: ‘;‘
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(42): error C2059: syntax error: ‘)‘
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(45): error C2143: syntax error: missing ‘{‘ before ‘__cdecl‘
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(46): error C2146: syntax error: missing ‘)‘ before identifier ‘_Numerator‘
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(46): error C2061: syntax error: identifier ‘_Numerator‘
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(46): error C2059: syntax error: ‘;‘
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(46): error C2059: syntax error: ‘,‘
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(48): error C2059: syntax error: ‘)‘
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(50): error C2143: syntax error: missing ‘{‘ before ‘__cdecl‘
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(56): error C2143: syntax error: missing ‘{‘ before ‘__cdecl‘
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(63): error C2143: syntax error: missing ‘{‘ before ‘__cdecl‘
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(69): error C2143: syntax error: missing ‘{‘ before ‘__cdecl‘
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(76): error C2143: syntax error: missing ‘{‘ before ‘__cdecl‘
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(82): error C2143: syntax error: missing ‘{‘ before ‘__cdecl‘
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(89): error C2143: syntax error: missing ‘{‘ before ‘__cdecl‘
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(95): error C2143: syntax error: missing ‘{‘ before ‘__cdecl‘
error: command ‘C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\x86_amd64\\cl.exe‘ failed with exit status 2
这里安装Microsoft Visual Studio Community即可,然后选择安装的时候只需要选择C++编译器,其余都不要,大概5,6G的样子。当然如果你是用VS环境开发的,另当别论。
之后需要配置环境变量,如图,我用的powershell:
网易云安装pycrypto.jpg这位兄弟的文章很有参考意义:Python踩坑之路-Python-3.6 安装pycrypto 2.6.1各种疑难杂症及解决方案。
使用pycrypto进行AES加密
这有文档:Python Cryptography Toolkit (pycrypto)。想本文中我们提到的CBC模式的AES加密,代码如下:
from Crypto.Cipher import AES
from Crypto import Random
def pad(plain):
padlen = 16 - len(plain) % 16
return plain + "0"*padlen
key = '0102030405060708'
iv = Random.new().read(AES.block_size)
cipher = AES.new(key, AES.MODE_CBC, iv)
plain = "The message that needs to be encrypted"
plain = pad(plain)
msg = cipher.encrypt(plain)
print (msg)
使用prettytable库展示结果
参见代码DisplayResult.py。
网易云效果.jpg代码
Sqlite3api.py
configure.py请参考拙作:爬取糗事百科的内容和图片并展示。
import sqlite3
import os
import Configure
conn = None
def sqlite3_init():
global conn
try:
conn = sqlite3.connect(Configure.DB_NAME)
except Exception as e:
print ('sqlite3 init fail.')
print (e)
def sqlite3_execute(sql, args = None):
global conn
data = None
try:
cur = conn.cursor()
if args:
cur.execute(sql, args)
else:
cur.execute(sql)
data = cur.fetchall()
except Exception as e:
print (e, "[SQL]:" + sql.strip())
conn.rollback()
conn.commit()
if data:
return data
return None
def sqlite3_close():
global conn
conn.close()
def unitest():
sqlite3_init()
sqlite3_execute("CREATE TABLE stocks (date text, trans text, symbol text, qty real, price real)")
sqlite3_execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
sqlite3_execute("INSERT INTO stocks VALUES ('2006-03-28', 'BUY', 'IBM', 1000, 45.00)")
sqlite3_execute("INSERT INTO stocks VALUES ('2006-04-05', 'BUY', 'MSFT', 1000, 72.00)")
sqlite3_execute("INSERT INTO stocks VALUES ('2006-04-06', 'SELL', 'IBM', 500, 53.00)")
assert 4 == sqlite3_execute("SELECT count(*) FROM stocks")[0][0]
sqlite3_execute("DROP TABLE stocks")
sqlite3_close()
if __name__ == '__main__':
unitest()
NeteaseMusic.py
configure.py请参考拙作:爬取糗事百科的内容和图片并展示。
import requests
import re
import json
import os
import time
from random import choice
from prettytable import PrettyTable
import Configure
from Sqlite3api import *
import DisplayResult as dr
url = "http://music.163.com/weapi/v3/playlist/detail"#?csrf_token=c471024dc44337a5c6a627ba90b47c6e"
header = {'user-agent': choice(Configure.FakeUserAgents)}
#payload = {'csrf_token':'c471024dc44337a5c6a627ba90b47c6e'}
formdata = {
'params':'jBULGYRTVRVfR0k15G9wjIo43oTqF27KLw89P8drW6x1Igeb1tOuXVzTtI6VJi8n77nyaQN/es61paePKs0HDe3ILYLzH41xTWAzBGdnoS9k9onAr3lq9VpBXgW6n64cP64IXfB4rc5w+hmXP3TKnL7ZpKG5z/IbEVmZq5HbfAF7k6jBtXj3VATfWi4/9ZZxVY3qzm1F2/oEXKAtrlxUwghyRvOgkMonxMGtxZyFpJM=',
'encSecKey':'51ffbd951e86680aaa5c274158822c79f677cf8c41afe815c7df29fcd811172f2587613d5cb9e6367f3032c980cf1d21f1a00fb357ebf62777a9d9babd8f73344155a60e366421725f0bab5da4f134e0a367c8e073e45711e71177243f584728e275563e0c42a1d66db731caf995e5cdb98a044abc871133f809e0eac0f204ae'
}
#cookies = {}
#cookiestr = '__f_=1524688672786; _ntes_nnid=db485c95c6901acef88e90d844e9720e,1524688669236; \
# _ntes_nuid=db485c95c6901acef88e90d844e9720e; \
# _iuqxldmzr_=32; \
# __remember_me=true; \
# __utmc=94650624; \
# __utma=94650624.1055904552.1525039592.1525220980.1525224618.9; \
# __utmz=94650624.1525224618.9.4.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); \
# WM_TID=RRHt%2BF%2BgzZ6%2BbXWgT5BAA7qi%2BFPdLpNy; \
# JSESSIONID-WYYY=bdQAAit65dBMPxeMu767ice6vEaq%2B%2BEYd24EDrz5p67FuZVNQFESJFY\
# gygv%2BQjbIQyJrM4%5C6S3m3PkGF3hiB83n9Cki%2FpSMPRtdSxUQa5qZcBEBJPtvW2GtmpSDf\
# QCdBU1suIorWHVpgdG4fllxpSxtrM6DBZuPBq9XVz749WwEtX2%2Bs%3A1525227547173; \
# MUSIC_U=ba49f310c620d1f4e1ca78948278d9d8cadcc61448045996eb8693a4835a76a6c19\
# a62ba5e2f5e641936fd7b8f93d102a70b41177f9edcea;\
# __csrf=c1641a0a560d3d124a89f0edfd12517c; \
# __utmb=94650624.26.10.1525224618'
#for cookie in cookiestr.split(';'):
# name,value=cookie.strip().split('=',1)
# cookies[name]=value
# index
#res=requests.post(url, headers=header, params=payload, data=formdata, cookies=cookies)
def getSongList():
try:
response = requests.post(url, headers=header, data=formdata)
content = None
if response.status_code == requests.codes.ok:
content = response.text
except RequestException as e:
print (e)
return
lists = json.loads(content)
tracks = lists.get('playlist').get('tracks')
for track in tracks:
#print (track)
args = []
args.append(track.get('id'))
args.append(track.get('name'))
args.append(int(track.get('l').get('size')) // int(track.get('l').get('br')) * 8)
args.append(track.get('ar')[0].get('name'))
args.append(track.get('al').get('name'))
args.append('http://music.163.com/#/song?id={0:d}'.format(args[0]))
args.append(0)
sqlite3_execute("INSERT INTO info VALUES (?,?,?,?,?,?,?)", tuple(args))
getSongDetail(args[0])
#time.sleep(0.25)
def getSongDetail(id):
url = 'http://music.163.com/weapi/v1/resource/comments/R_SO_4_{0:d}'.format(id)
try:
response = requests.post(url, headers=header, data=formdata)
content = None
if response.status_code == requests.codes.ok:
content = response.text
except RequestException as e:
print (e)
return
data = json.loads(content)
CommentsCount = data.get('total')
sqlite3_execute("UPDATE info SET comment = ? WHERE id = ?", (CommentsCount, id,))
hotComments = data.get('hotComments')
for hotComment in hotComments:
user = hotComment.get('user').get('nickname')
content = hotComment.get('content').strip().replace(' ','')
sqlite3_execute("INSERT INTO comment VALUES (?,?,?)", (id, user, content,))
if __name__ == '__main__':
sqlite3_init()
#抓取数据,一般抓一次就好了
sqlite3_execute("CREATE TABLE info (id int, name text, duration int, singer text, album text, songurl text, comment int)")
sqlite3_execute("CREATE TABLE comment (id int, user text, comment text)")
getSongList()
# 展示数据
#dr.DisplayResults()
#sqlite3_execute("DROP TABLE info")
#sqlite3_execute("DROP TABLE comment")
sqlite3_close()
DisplayResult.py
from prettytable import PrettyTable
from prettytable import ALL
from prettytable import FRAME
from prettytable import NONE
import os
import re
from Sqlite3api import *
def DisplayResults():
table = PrettyTable()
table.field_names = ["ID", "音乐标题", "时长", "歌手", "专辑", "评论数"]
table.sortby = "评论数"
table.reversesort=True
page = 12
offset = 12
data = sqlite3_execute("SELECT id, name, duration, singer, album, comment FROM info limit {0:d}".format(page))
for item in data:
l = list(item)
l[2] = "{0:d}:{1:02d}".format(l[2]//60,l[2] - l[2]//60*60)
table.add_row(l)
print (table)
patterns = []
patterns.append(re.compile('show (\d+)', re.S))
patterns.append(re.compile('showrange (\d+),(\d+)', re.S))
patterns.append(re.compile('set page (\d+)', re.S))
patterns.append(re.compile('comment (\d+)', re.S))
patterns.append(re.compile('quit', re.S))
patterns.append(re.compile('help', re.S))
patterns.append(re.compile('next', re.S))
while True:
inputstr = input("输入命令:")
data = None
idx = None
for pattern in patterns:
data = pattern.findall(inputstr)
if data:
idx = patterns.index(pattern)
break
# show ID
if 0 == idx:
table = PrettyTable()
table.field_names = ["ID", "音乐标题", "时长", "歌手", "专辑", "评论数"]
data = sqlite3_execute("SELECT id, name, duration, singer, album, comment FROM info WHERE id={0:d}".format(int(data[0])))
if data:
for item in data:
l = list(item)
l[2] = "{0:d}:{1:02d}".format(l[2]//60,l[2] - l[2]//60*60)
table.add_row(l)
print (table)
# showrange #1,#2
elif 1 == idx:
table = PrettyTable()
table.field_names = ["ID", "音乐标题", "时长", "歌手", "专辑", "评论数"]
table.sortby = "评论数"
table.reversesort=True
data = sqlite3_execute("SELECT id, name, duration, singer, album, comment FROM info LIMIT {0:d} OFFSET {1:d}".format(int(data[0][1]) - int(data[0][0]),int(data[0][0])))
if data:
for item in data:
l = list(item)
l[2] = "{0:d}:{1:02d}".format(l[2]//60,l[2] - l[2]//60*60)
table.add_row(l)
print (table)
# set page #
elif 2 == idx:
page = int(data[0])
# comment #
elif 3 == idx:
table = PrettyTable()
table.field_names = ["评论内容","用户"]
table.align["评论内容"] = "l"
table.hrules = ALL
data = sqlite3_execute("SELECT comment, user FROM comment WHERE id={0:d}".format(int(data[0])))
if data:
for item in data:
table.add_row(list(item))
print (table)
# quit
elif 4 == idx:
os._exit(0)
# help
elif 5 == idx:
print (" 命令列表:")
print (" next --- 打印下一页歌曲")
print (" show _id_ --- 打印ID为id的歌曲")
print (" showrange _begin_,_end_ --- 打印从begin到end的歌曲 ")
print (" set page _num_ --- 设置每页的歌曲数 ")
print (" comment _id_ --- 打印ID为id的歌曲的热门评论 ")
print (" help --- 打印本列表")
print (" quit --- 退出程序")
# next
elif 6 == idx:
table = PrettyTable()
table.field_names = ["ID", "音乐标题", "时长", "歌手", "专辑", "评论数"]
table.sortby = "评论数"
table.reversesort=True
data = sqlite3_execute("SELECT id, name, duration, singer, album, comment FROM info LIMIT {0:d} OFFSET {1:d}".format(page, offset))
offset += page
for item in data:
l = list(item)
l[2] = "{0:d}:{1:02d}".format(l[2]//60,l[2] - l[2]//60*60)
table.add_row(l)
print (table)
else:
print ("非法的指令")
if __name__ == '__main__':
sqlite3_init()
DisplayResults()
sqlite3_close()
网友评论