打开 目标网站,目的是获取图片中的数据

分析可知,除了评论数在html文档中,其他数据都可以从模拟发送ajax请求获取得到

构造映射字典
查看评论数据,打开该css文件,找到字体路径,下载该字体文件,文件格式为 woff


使用工具 FontCreator 打开 woff 字体文件,构建映射关系表
from fontTools.ttLib import TTFont
# 将woff转为xml格式
font = TTFont('D:/谷歌下载/4a.woff')
font.saveXML('D:/谷歌下载/4a.xml')
character = list('1234567890店中美家馆小车大市公酒行国品发电金心业商司超生装园场食有新限天面工服海华水房饰城乐汽香部利子老艺花专东肉菜学福饭人百餐茶务通味所山区门药银农龙停尚安广鑫一容动南具源兴鲜记时机烤文康信果阳理锅宝达地儿衣特产西批坊州牛佳化五米修爱北养卖建材三会鸡室红站德王光名丽油院堂烧江社合星货型村自科快便日民营和活童明器烟育宾精屋经居庄石顺林尔县手厅销用好客火雅盛体旅之鞋辣作粉包楼校鱼平彩上吧保永万物教吃设医正造丰健点汤网庆技斯洗料配汇木缘加麻联卫川泰色世方寓风幼羊烫来高厂兰阿贝皮全女拉成云维贸道术运都口博河瑞宏京际路祥青镇厨培力惠连马鸿钢训影甲助窗布富牌头四多妆吉苑沙恒隆春干饼氏里二管诚制售嘉长轩杂副清计黄讯太鸭号街交与叉附近层旁对巷栋环省桥湖段乡厦府铺内侧元购前幢滨处向座下県凤港开关景泉塘放昌线湾政步宁解白田町溪十八古双胜本单同九迎第台玉锦底后七斜期武岭松角纪朝峰六振珠局岗洲横边济井办汉代临弄团外塔杨铁浦字年岛陵原梅进荣友虹央桂沿事津凯莲丁秀柳集紫旗张谷的是不了很还个也这我就在以可到错没去过感次要比觉看得说常真们但最喜哈么别位能较境非为欢然他挺着价那意种想出员两推做排实分间甜度起满给热完格荐喝等其再几只现朋候样直而买于般豆量选奶打每评少算又因情找些份置适什蛋师气你姐棒试总定啊足级整带虾如态且尝主话强当更板知己无酸让入啦式笑赞片酱差像提队走嫩才刚午接重串回晚微周值费性桌拍跟块调糕')
with open('D:/谷歌下载/4a.xml') as f:
content = f.read()
dataset = re.findall(r'<GlyphID id="\d+" name="(.*?)"/>',content)[2:]
# 字符和数据的映射字典
mapdict = dict(zip(dataset,character))
分析ajax请求参数
除了token参数,其余参数都可以从HTML文档中提取。下面主要分析token参数的生成。

找到生成函数,断点调试,提取主要js函数,使用 pyexecjs 执行,得到参数token。
动态字体
一套规则不变的,只需要写好字符与数字的映射表就可以。动态字体可能有哪些变化?
- 比较简单的,字形数据不变,字符动态变化。同样构造一个映射表,key可以是字形数据的哈希值,value是数字。
- 难度升级,字形数据也在变化。可以使用KNN邻近算法,具体还没实践过。
动态字体使用base64字体信息,关于字体信息的处理:
# -*- coding: utf-8 -*-
import base64
from fontTools.ttLib import TTFont
from io import BytesIO
font_face = 'AAEAAAAKAIAAAwAgT1MvMv4RZcIAAAEoAAAAYGNtYXDp6MdpAAABpAAAAYpnbHlmBtQkKgAAA0gAAAQCaGVhZBaqEMUAAACsAAAANmhoZWEGzQE2AAAA5AAAACRobXR4ArwAAAAAAYgAAAAabG9jYQRpBWwAAAMwAAAAGG1heHABGABFAAABCAAAACBuYW1lUGhGMAAAB0wAAAJzcG9zdC/iblsAAAnAAAAAiAABAAAAAQAAe/5KEF8PPPUACQPoAAAAANnIUd8AAAAA2297PAAU/+wCQQLZAAAACAACAAAAAAAAAAEAAAQk/qwAfgJYAAAALwIpAAEAAAAAAAAAAAAAAAAAAAACAAEAAAALADkAAwAAAAAAAgAAAAoACgAAAP8AAAAAAAAABAIqAZAABQAIAtED0wAAAMQC0QPTAAACoABEAWkAAAIABQMAAAAAAAAAAAAAEAAAAAAAAAAAAAAAUGZFZABAooX1NAQk/qwAfgQkAVQAAAABAAAAAAAAAAAAAAAgAAAAZAAAAlgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAwAAAAMAAAAcAAEAAAAAAIQAAwABAAAAHAAEAGgAAAAWABAAAwAGooWjJqWRpxKzGbl45HnkifGU9TT//wAAooWjJqWRpxKzGbl45HnkifGU9TT//119XOJadljyTOxGiRuRG30OdQrPAAEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACwAhAD4AfACtAMAA7gFEAYcByAIBAAEAFP/sADIAFAACAAA3MxUUHhQoAAABAG8AAAFpAsoACQAAAQYGBxU2NxEzEQEpI2YxZUNSAsooPg5SHkT9mgLKAAIAGAAAAkECygAKAA4AAAEBFSEVMzUzNSMRBzMRIQGA/pgBZ050dFED/t4Cyv4mTqKiQwHla/6GAAACADP/8gImAtgAHAAoAAABIgYVFBcWMzY2NzMXFAcGIyInIxYzNjc2NTQnJgcyFxYUBiMiJjU0NgEkaYg9P2c/YhsEAS8xU34WUR3HeEhDP0J/Si0sXkVLV1kC2ItqZ0FEAT42GnhQU3q/AW9sqqZaYEUzMJVgXUtOYgAAAQA+AAACGgLYAB0AAAEiBgczNjc2FzIWFRQHBgcGBwYVITUhNjc2NzY0JgE2bIUBUgIqJ0pGTjccVHEqRwHc/okUiG8lRoAC2I94XzAzAUdCRTsdPE4xT2JJSFxMJ0u7cgABAEIAAAIXAsoABgAAExUhATMBNUIBgf72VwEHAspL/YECh0MAAAIAMv/yAiYC2QAMABkAAAEmBwYQFxYgNzYQJyYHMhcWFAcGIicmNDc2ASyBQDk5QAEAQjk5Qn9gKh8fKsAqHh4qAtgBcmD+vGBxcWABRGByR2dI+kpmZkr6SGcAAwAq//ICLgLYAB8ALAA4AAABIgcGFRQXFhcVBgcGFRQWMjc2NTQnJic1Njc2NTQnJgcyFxYUBwYiJyY0NzYTMhcWFAcGIiY0NzYBLHA/OhscOTgnKob3RUIqJzg3Hhs6P3BLLCUhKKcnISUqTVYwKysvrlosLgLYOjVOOScqFAIOMDNFX3Q7OV9FMzAOAhQqJzlONTpDJyFrIicnImshJ/7DKyeAJypRgCcrAAEAM//yAiYC2AArAAABIgcGBzM2Nhc2FxYUBiMjFTM2FhQHBiMmJyYnIxYXFhcyNjU0JyYnNjU0JgEzZj9CClEHUkhGJyVLSDc6S1MrLk1DLTUDUwlMQGZviSQhP3V9Atg6OmhITgEBJCF4QUABSH8qLAEkLFR4PzMBfWE/KioTKHhaaAAAAgAz//ICJgLZABsAKAAAASIHBhUUFxYXNjY0JiMiBwYHIyc0NzYXNhczJgM2FxYUBwYHIicmNDYBNnlHQz5CgmmIfWY/MDIbBAEvMFR8GFEeyUktLC0tSEotLF0C2HBrq6NcYAEBi9KDHx44GnpOVAEBe8D+tgEwLZgzMAEzMJZfAAEAM//yAiUCygAkAAATAzM2NzYzNhYVFAYjIicmJyMWFxYzMjc2NTQmIyYHBgcjNyE1aSZOFiooM05aYktCKzAGUQdLQl9qSE1/ZjEqLx4EGAFcAsr+disVFgFeVktgICREYjkzQUVsc4IBEhIk70kAAAAAAAASAN4AAQAAAAAAAAAXAAAAAQAAAAAAAQAMABcAAQAAAAAAAgAHACMAAQAAAAAAAwAUACoAAQAAAAAABAAUACoAAQAAAAAABQALAD4AAQAAAAAABgAUACoAAQAAAAAACgArAEkAAQAAAAAACwATAHQAAwABBAkAAAAuAIcAAwABBAkAAQAYALUAAwABBAkAAgAOAM0AAwABBAkAAwAoANsAAwABBAkABAAoANsAAwABBAkABQAWAQMAAwABBAkABgAoANsAAwABBAkACgBWARkAAwABBAkACwAmAW9DcmVhdGVkIGJ5IGZvbnQtY2Fycmllci5QaW5nRmFuZyBTQ1JlZ3VsYXIuUGluZ0ZhbmctU0MtUmVndWxhclZlcnNpb24gMS4wR2VuZXJhdGVkIGJ5IHN2ZzJ0dGYgZnJvbSBGb250ZWxsbyBwcm9qZWN0Lmh0dHA6Ly9mb250ZWxsby5jb20AQwByAGUAYQB0AGUAZAAgAGIAeQAgAGYAbwBuAHQALQBjAGEAcgByAGkAZQByAC4AUABpAG4AZwBGAGEAbgBnACAAUwBDAFIAZQBnAHUAbABhAHIALgBQAGkAbgBnAEYAYQBuAGcALQBTAEMALQBSAGUAZwB1AGwAYQByAFYAZQByAHMAaQBvAG4AIAAxAC4AMABHAGUAbgBlAHIAYQB0AGUAZAAgAGIAeQAgAHMAdgBnADIAdAB0AGYAIABmAHIAbwBtACAARgBvAG4AdABlAGwAbABvACAAcAByAG8AagBlAGMAdAAuAGgAdAB0AHAAOgAvAC8AZgBvAG4AdABlAGwAbABvAC4AYwBvAG0AAAIAAAAAAAAADgAAAAAAAAAAAAAAAAAAAAAAAAAAAAsACwAAAQIBBQEKAQMBCAELAQkBBAEHAQYHdW5pYjk3OAd1bmlhNzEyB3VuaWEzMjYHdW5pYTI4NQd1bmllNDc5B3VuaWYxOTQHdW5pYjMxOQd1bmlhNTkxB3VuaWY1MzQHdW5pZTQ4OQ=='
b = base64.b64decode(font_face)
font = TTFont(BytesIO(b))#从内存中读取字节内容
cmap = font.getBestCmap()
print(cmap)
with open('./fonts/font1.woff','wb') as f:
f.write(b)
font = TTFont(BytesIO(b))
font.saveXML(r'./fonts/font1.xml')
记录一个相对定位样例
def get_numbers(num,index):
print(num,index)
num1 = {}
Flag = False
MaxInterval = 0
for k, v in enumerate(num):
if k == 0:
num1[index[k]] = v
if index[k] in (18, 27):
Flag = True
MaxInterval = index[k] + MaxInterval
continue
if k == 1:
if Flag:
num1[index[k] + MaxInterval - 9 + 0.1] = v
Flag = False
else:
num1[MaxInterval + index[k] + 0.1] = v
if index[k] == 18:
Flag = True
if '-' not in str(index[k]):
MaxInterval = MaxInterval + index[k]
continue
if k == 2:
if Flag:
num1[index[k] - 9 + MaxInterval + 0.11] = v
Flag = False
else:
num1[index[k] + MaxInterval + 0.11] = v
if '-' not in str(index[k]):
MaxInterval = MaxInterval + index[k]
continue
if k == 3:
if Flag:
num1[index[k] - 9 + MaxInterval + 0.12] = v
else:
num1[index[k] + MaxInterval + 0.12] = v
keylist = list(num1.keys())
keylist.sort()
numlist = [str(num1[key]) for key in keylist]
numbers = ''.join(numlist)
print(numbers)
return numbers
参考文章
深入解读字体反爬虫
网友评论