美文网首页Python
猫眼电影评分爬取(字体反爬)

猫眼电影评分爬取(字体反爬)

作者: 凉风有信2020 | 来源:发表于2020-08-10 20:23 被阅读0次

最近从大佬那里听说了许多新知识,其中之一就是这个字体反爬。字体反爬是目前比较新的反爬技术,通过将网页中的一些关键信息用自设的字体渲染出来来实现反爬的效果。以猫眼这个网站来说,仅仅是我看到的反爬版本就已不下五个,而且在一直更新,爬取的难度是越来越高,本文即是记录一下我爬取的过程。

首先呢,爬虫得加请求头,不加请求头封一天(别问我怎么知道的)

加了请求头之后,如果访问频繁,依旧是封一天,所以建一个header池,随机选一个header带上。

headers_pool = [
    'Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; XH; rv:8.578.498) fr, Gecko/20121021 Camino/8.723+ (Firefox compatible)',
    'Mozilla/5.0 (X11; U; Linux i686; nl; rv:1.8.1b2) Gecko/20060821 BonEcho/2.0b2 (Debian-1.99+2.0b2+dfsg-1)',
    'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; Avant Browser; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0)',
    'Mozilla/5.0 (X11; U; UNICOS lcLinux; en-US) Gecko/20140730 (KHTML, like Gecko, Safari/419.3) Arora/0.8.0',
    'Mozilla/5.0 (compatible; MSIE 9.0; AOL 9.7; AOLBuild 4343.19; Windows NT 6.1; WOW64; Trident/5.0; FunWebProducts)',
    'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser 1.98.744; .NET CLR 3.5.30729)'
]
headers = random.choice(headers_pool)

对网页进行分析发现,像评分这样的地方,值是读不出来的,这是因为这些地方采用了网站自己的编码方式,这个unicode的码无法用utf-8解析出来,所以无法正常显示。(我猜)


猫眼电影评分无法显示

进一步分析网页源码,&#就是网站的编码方式,而x表示是16进制编码。


评分部分的源代码
通过看相关博客,发现这种网站的字体文件在font-face这个标签下
字体文件链接

通过链接下载字体文件
//vfile.meituan.net/colorstone/7069663754cc604f99f5b05d6791d33d2276.woff

stone_font = soup.select("style")[0]
url = parse.urljoin("https://", re.search("//.*\.woff", str(stone_font)).group())
res = requests.get(url)
with open("font.woff", "wb") as f:
    f.write(res.content)

然后通过FontCreater可以打开字体文件,能够看到,数字0-9是经过特殊编码了的,重点来了,猫眼网站之所以叫动态字体,是因为他的编码规则是随时变化的,两次访问获得的字体文件是不同的,相同的数字对应不同的编码,变态的是,连这个字的形状,都在发生微小的改变,这不影响我们的视觉,但是让计算机来识别这些具有微小区别的字形,就很难。


FontCreater

想要对字体文件进一步分析,就需要使用FontTool这个Python库。使用FontTool打开我们的字体文件(.woff),使用FontCreater肉眼识别并匹配后,获得数字0-9在字体集1中的编码。

 base_woff = TTFont("font.woff")
 base_uni_list = base_woff.getGlyphOrder()[2:]
 base_val_list = ["0", "9", "8", "1", "6", "3", "5", "2", "4", "7"]
 for uni1 in base_uni_list:
     obj1 = base_woff['glyf'][uni1]

循环中的任一obj1都是一个单独的字形文件,前两个不是我们要获取的字形所以舍去。再刷新网页,下载新的字体文件,并打开:

 stone_font = soup.select("style")[0]
 url = parse.urljoin("https://", re.search("//.*\.woff", 
 str(stone_font)).group())
 res = requests.get(url)
 with open("font2.woff", "wb") as f:
     f.write(res.content)

 current_woff = TTFont("font2.woff")
 current_uni_list = current_woff.getGlyphOrder()[2:]
 for uni2 in current_uni_list:
     obj2 = current_woff['glyf'][uni2]

通过对glaph对象使用.coordinates方法,获得字形文件的点集。obj1是我们的基准集,obj2是待识别的字体集

[(43, 333), (43, 468), (69, 544), (103, 624), (148, 667), (202, 704), (282, 716), (398, 710), (459, 617), (487, 574), (504, 508), (509, 447), (509, 335), (520, 271), (514, 213), (508, 166), (502, 126), (466, 46), (362, -39), (282, -39), (176, -39), (115, 37), (38, 127), (55, 335), (135, 350), (135, 154), (166, 95), (218, 36), (282, 35), (343, 35), (375, 96), (429, 155), (428, 515), (385, 576), (344, 629), (281, 635), (212, 635), (181, 569), (148, 507), (135, 335)]
[(142, 151), (159, 89), (181, 71), (215, 35), (264, 35), (307, 35), (349, 54), (368, 66), (389, 106), (408, 137), (422, 191), (428, 231), (432, 245), (435, 273), (435, 300), (435, 308), (435, 313), (434, 319), (409, 276), (360, 249), (309, 223), (266, 223), (168, 223), (43, 354), (43, 462), (50, 574), (108, 655), (173, 710), (273, 710), (343, 710), (463, 633), (494, 551), (524, 489), (524, 356), (509, 214), (494, 131), (450, 49), (403, 4), (343, -39), (262, -39), (175, -39), (122, 9), (67, 57), (56, 144), (425, 469), (425, 544), (383, 587), (341, 635), (284, 635), (223, 635), (135, 539), (131, 459), (135, 387), (177, 355), (220, 301), (344, 301), (425, 391), (425, 466)]
[(181, 371), (71, 412), (71, 521), (71, 614), (128, 656), (185, 710), (378, 710), (436, 654), (495, 583), (495, 524), (499, 412), (386, 371), (454, 349), (495, 300), (524, 251), (524, 184), (509, 88), (458, 25), (391, -39), (174, -39), (108, 25), (42, 86), (31, 186), (42, 257), (114, 354), (163, 524), (163, 470), (189, 439), (230, 406), (283, 406), (331, 414), (369, 439), (402, 470), (402, 567), (368, 601), (334, 626), (283, 635), (229, 650), (197, 603), (163, 570), (134, 186), (134, 146), (141, 111), (169, 75), (207, 55), (242, 35), (284, 35), (301, 46), (343, 46), (369, 57), (390, 77), (432, 112), (432, 183), (432, 248), (346, 332), (274, 332), (217, 332), (175, 290), (121, 249)]
[(381, -25), (292, -26), (292, 550), (259, 516), (206, 486), (153, 468), (106, 426), (120, 539), (186, 570), (244, 625), (271, 644), (292, 660), (324, 674), (330, 713), (381, 710), (382, -32)]
[(420, 521), (408, 562), (386, 597), (349, 635), (296, 635), (250, 635), (220, 612), (177, 580), (154, 522), (148, 449), (128, 407), (134, 352), (161, 408), (207, 425), (254, 449), (306, 449), (390, 449), (459, 382), (522, 316), (522, 211), (522, 143), (493, 83), (453, 24), (412, -7), (360, -39), (293, -39), (180, -27), (110, 43), (39, 124), (39, 317), (48, 530), (117, 626), (195, 710), (301, 710), (388, 710), (443, 661), (498, 614), (519, 528), (142, 211), (142, 166), (175, 122), (182, 79), (232, 57), (253, 35), (292, 35), (347, 35), (389, 81), (421, 127), (438, 206), (430, 273), (379, 326), (349, 370), (288, 370), (216, 370), (184, 326), (142, 282)]
[(128, 180), (148, 103), (190, 69), (221, 35), (276, 35), (340, 31), (384, 79), (427, 123), (427, 250), (387, 290), (345, 329), (284, 331), (257, 331), (220, 321), (230, 400), (236, 400), (243, 397), (245, 399), (303, 399), (394, 459), (394, 522), (394, 571), (330, 635), (274, 635), (222, 635), (187, 604), (153, 558), (142, 504), (52, 520), (76, 612), (127, 661), (186, 719), (272, 710), (332, 710), (371, 683), (433, 659), (487, 569), (487, 520), (501, 462), (462, 432), (447, 394), (386, 371), (451, 356), (523, 262), (523, 190), (523, 109), (466, 31), (383, -40), (276, -33), (179, -40), (110, 18), (46, 75), (43, 167), (134, 180)]
[(137, 173), (147, 103), (222, 35), (278, 35), (343, 35), (433, 136), (433, 226), (433, 292), (385, 341), (345, 387), (276, 380), (232, 380), (198, 366), (164, 341), (143, 297), (58, 320), (120, 697), (494, 697), (494, 611), (202, 611), (160, 414), (217, 436), (264, 460), (300, 460), (396, 460), (463, 398), (528, 328), (519, 223), (514, 124), (470, 51), (399, -39), (286, -39), (186, -39), (126, 17), (56, 74), (43, 170), (143, 173)]
[(515, 60), (515, -26), (31, -26), (41, 6), (42, 24), (52, 62), (66, 86), (89, 110), (100, 134), (121, 158), (179, 212), (208, 245), (278, 294), (358, 370), (380, 400), (422, 455), (422, 508), (415, 558), (384, 598), (345, 635), (284, 635), (219, 622), (180, 597), (133, 559), (140, 488), (48, 498), (43, 603), (108, 653), (180, 710), (391, 710), (453, 648), (514, 608), (514, 506), (514, 460), (480, 381), (436, 330), (416, 307), (381, 274), (361, 242), (298, 201), (270, 167), (231, 152), (206, 121), (182, 97), (173, 85), (163, 79), (156, 60)]
[(331, -13), (336, 161), (-1, 149), (16, 246), (347, 713), (421, 701), (412, 232), (534, 231), (511, 163), (416, 149), (421, -26), (331, -18), (331, 232), (336, 552), (101, 246), (321, 232)]
[(48, 625), (48, 696), (523, 698), (523, 627), (486, 590), (441, 546), (404, 491), (388, 429), (347, 367), (297, 241), (276, 175), (252, 83), (246, -26), (151, -11), (152, 17), (162, 69), (169, 114), (173, 183), (209, 306), (278, 419), (310, 478), (374, 572), (402, 624), (49, 611)]

[(515, 60), (515, -26), (31, -26), (31, 6), (42, 36), (41, 62), (66, 86), (80, 110), (121, 158), (147, 185), (218, 245), (278, 294), (333, 332), (358, 370), (385, 400), (422, 457), (422, 508), (422, 566), (384, 598), (345, 635), (284, 635), (219, 635), (181, 596), (141, 559), (151, 488), (48, 498), (73, 603), (110, 656), (180, 710), (391, 710), (514, 593), (514, 500), (514, 467), (497, 420), (480, 381), (436, 330), (429, 307), (381, 274), (347, 242), (298, 201), (261, 167), (245, 144), (206, 121), (183, 122), (182, 104), (173, 74), (164, 73), (140, 60), (515, 57)]
[(410, 521), (408, 574), (386, 597), (349, 650), (296, 635), (253, 635), (220, 612), (177, 580), (154, 522), (134, 492), (121, 449), (128, 407), (128, 352), (161, 406), (207, 425), (257, 449), (306, 449), (395, 449), (522, 316), (527, 219), (522, 143), (493, 83), (463, 12), (412, -7), (360, -39), (293, -39), (180, -39), (39, 124), (39, 317), (39, 530), (117, 626), (185, 710), (301, 710), (388, 710), (443, 661), (498, 614), (510, 533), (420, 521), (142, 209), (142, 166), (182, 79), (226, 57), (253, 35), (292, 35), (347, 50), (389, 81), (426, 127), (430, 206), (430, 281), (395, 321), (344, 372), (288, 370), (228, 370), (180, 326), (142, 291), (142, 211)]
[(374, -26), (280, -26), (288, 547), (265, 529), (206, 486), (156, 462), (125, 439), (112, 526), (190, 545), (234, 617), (258, 637), (292, 662), (303, 686), (325, 720), (367, 719), (382, -22)]
[(137, 173), (147, 103), (185, 69), (219, 44), (278, 35), (343, 35), (388, 98), (433, 136), (439, 214), (428, 292), (399, 336), (345, 380), (276, 392), (232, 380), (198, 361), (159, 332), (143, 310), (58, 320), (129, 689), (494, 697), (494, 622), (202, 611), (162, 414), (195, 437), (264, 460), (305, 450), (396, 460), (463, 394), (539, 328), (519, 223), (528, 134), (465, 51), (399, -34), (278, -39), (177, -39), (115, 17), (65, 74), (43, 166)]
[(48, 615), (48, 698), (523, 707), (523, 627), (495, 580), (461, 541), (420, 491), (383, 429), (344, 367), (338, 295), (297, 241), (278, 175), (252, 83), (244, -26), (151, -26), (152, 17), (161, 69), (169, 109), (185, 183), (217, 310), (265, 406), (296, 478), (341, 525), (374, 583), (418, 611), (48, 596)]
[(134, 151), (154, 89), (215, 35), (264, 35), (307, 35), (368, 73), (408, 137), (422, 191), (428, 217), (435, 273), (435, 296), (435, 308), (435, 299), (434, 319), (409, 276), (360, 249), (315, 223), (259, 220), (168, 223), (43, 354), (43, 570), (108, 641), (173, 710), (288, 710), (343, 710), (463, 633), (508, 562), (525, 489), (524, 356), (524, 214), (464, 49), (403, 4), (343, -32), (262, -29), (175, -39), (122, 9), (67, 57), (56, 144), (142, 151), (425, 466), (421, 544), (389, 589), (341, 635), (290, 635), (223, 648), (179, 586), (135, 539), (135, 459), (135, 387), (185, 345), (220, 301), (344, 301), (384, 345), (428, 391)]
[(133, 173), (148, 103), (185, 69), (221, 43), (276, 35), (340, 35), (398, 79), (427, 123), (427, 250), (387, 290), (345, 331), (284, 331), (257, 331), (228, 321), (230, 395), (236, 400), (239, 399), (245, 399), (303, 399), (348, 428), (408, 459), (394, 522), (394, 571), (346, 604), (319, 635), (274, 635), (222, 635), (153, 572), (142, 504), (52, 520), (69, 605), (127, 660), (188, 724), (272, 710), (332, 700), (383, 683), (447, 664), (460, 614), (487, 569), (487, 520), (487, 481), (462, 432), (436, 394), (386, 371), (451, 356), (487, 313), (523, 262), (523, 190), (523, 95), (453, 27), (397, -52), (276, -42), (184, -40), (116, 18), (52, 75), (43, 167), (133, 180)]
[(43, 335), (41, 468), (69, 544), (95, 624), (202, 710), (282, 710), (398, 722), (459, 617), (487, 574), (520, 442), (509, 349), (520, 271), (514, 222), (508, 166), (494, 126), (466, 55), (401, 4), (362, -39), (282, -49), (190, -49), (115, 29), (43, 127), (135, 335), (135, 154), (218, 35), (282, 35), (341, 29), (385, 96), (418, 155), (428, 335), (428, 515), (385, 561), (344, 635), (281, 643), (218, 635), (181, 583), (135, 515), (135, 348)]
[(323, -26), (335, 151), (22, 149), (14, 230), (332, 703), (416, 707), (435, 227), (525, 233), (520, 149), (421, 155), (412, -41), (331, -25), (331, 232), (331, 551), (96, 232), (331, 241)]
[(181, 371), (71, 412), (71, 521), (71, 603), (185, 710), (378, 710), (436, 647), (495, 596), (495, 519), (481, 412), (386, 373), (454, 349), (489, 291), (523, 251), (524, 184), (524, 98), (458, 25), (379, -52), (283, -39), (174, -39), (108, 25), (42, 86), (42, 186), (42, 257), (114, 354), (181, 386), (163, 524), (173, 470), (197, 439), (230, 406), (284, 406), (333, 406), (369, 439), (402, 470), (402, 567), (334, 635), (283, 642), (229, 635), (163, 570), (134, 198), (134, 140), (166, 111), (169, 75), (220, 55), (242, 35), (284, 35), (316, 35), (369, 57), (390, 77), (432, 117), (432, 248), (346, 332), (281, 332), (232, 330), (175, 290), (148, 249), (134, 188)]

可以看到,obj1[0]和obj2[7]是非常类似的,但是又有区别,将这条摘出来,

[(43, 333), (43, 468), (69, 544), (103, 624), (148, 667), (202, 704), (282, 716), (398, 710), (459, 617), (487, 574), (504, 508), (509, 447), (509, 335), (520, 271), (514, 213), (508, 166), (502, 126), (466, 46), (362, -39), (282, -39), (176, -39), (115, 37), (38, 127), (55, 335), (135, 350), (135, 154), (166, 95), (218, 36), (282, 35), (343, 35), (375, 96), (429, 155), (428, 515), (385, 576), (344, 629), (281, 635), (212, 635), (181, 569), (148, 507), (135, 335)]
[(43, 335), (41, 468), (69, 544), (95, 624), (202, 710), (282, 710), (398, 722), (459, 617), (487, 574), (520, 442), (509, 349), (520, 271), (514, 222), (508, 166), (494, 126), (466, 55), (401, 4), (362, -39), (282, -49), (190, -49), (115, 29), (43, 127), (135, 335), (135, 154), (218, 35), (282, 35), (341, 29), (385, 96), (418, 155), (428, 335), (428, 515), (385, 561), (344, 635), (281, 643), (218, 635), (181, 583), (135, 515), (135, 348)]

数据非常类似,但又有些许不同,而这两条数据的长度也是不同的,意味着,这些用来描绘数字的点的数目也是不同的,所以对他进行相似性的判断就很麻烦。目前网络上的博客上的解决办法似乎都是改版前的通过比较字体对象,或是字体点数一致进行的比较,没有找到较为完善的方案。我这里采用的K邻近算法(才学的),最后得到的判断效果还不错,但没有完美的识别,供大家参考。

首先,因为我们这里只有一组训练集,而且一集中每一项都是一独立状态,所以k取1。在想多弄几组训练集会不会好一些,之后再说吧。

import numpy as np

obj1 = TTFont("font.woff")
obj1_uni = obj1.getGlyphOrder()[2:]
obj1_points = [0]*len(obj1_uni)
for i in range(len(obj1_uni)):
    obj1_points[i] = list(obj1['glyf'][obj1_uni[i]].coordinates)
    for index in range(len(obj1_points[i])):
        obj1_points[i][index] = sum(obj1_points[i][index])
obj2 = TTFont("font2.woff")
obj2_uni = obj2.getGlyphOrder()[2:]
obj2_points = [0] * len(obj2_uni)
for i in range(len(obj2_uni)):
    obj2_points[i] = list(obj2['glyf'][obj2_uni[i]].coordinates)
    for index in range(len(obj2_points[i])):
        obj2_points[i][index] = sum(obj2_points[i][index])

因为坐标有两项,在这里没法把他加到矩阵里,所以求和,以横纵坐标和来替代这个点的坐标。

    len_max = 0
    for i in range(len(obj2_uni)):
        if len(obj2_points[i])>len_max: len_max = len(obj2_points[i])
    for i in range(len(obj1_uni)):
        if len(obj1_points[i])>len_max: len_max = len(obj1_points[i])
    for i in range(len(obj2_uni)):
        obj2_points[i].extend([0]*(len_max-len(obj2_points[i])))
    for i in range(len(obj1_uni)):
        obj1_points[i].extend([0]*(len_max-len(obj1_points[i])))

获取点数最多的那一项长度,并把其他项增加到这个长度(初始化数据)接下来就是通过K邻近算法获得与我们已知编码对应的电集最相似的未知点集。

for i in range(len(obj2_uni)):
    a = np.array(obj1_points)
    b = np.array(obj2_points[i])
    x = knn(a,b)
    print(x)

def knn(trainData, testData):
    rowSize = trainData.shape[0]
    diff = np.tile(testData, (rowSize, 1)) - trainData
    sqrDiff = diff ** 2
    sqrDiffSum = sqrDiff.sum(axis=1)
    distances = sqrDiffSum ** 0.5
    sortDistance = distances.argsort()
    return sortDistance[0]

np.array():生成矩阵
array.shape[0]:获取矩阵行数
np.tile(testData, (rowSize, 1)) - trainData :将testData在y轴上复制rowSize倍,再与trainData求差。
diff ** 2:矩阵按位做平方
sqrDiff.sum(axis=1):求和(axis=1按行求和)
sqrDiffSum ** 0.5:对和求根号
distances.argsort():从小到大排序后,输出对应的指针(最小数对应的指针最先输出)
sortDistance[0]:获取最小数指针(也就是最相近点集的指针)

最后得到的答案:
7,4,3,6,9,5,5,0,8,2
应该为:
7,4,3,6,9,1,5,0,8,2

让我们看看数据:
obj2[5],obj2[6]:

[285, 243, 250, 299, 342, 441, 545, 613, 645, 708, 731, 743, 734, 753, 685, 609, 538, 479, 391, 397, 613, 749, 883, 998, 1053, 1096, 1070, 1014, 880, 738, 513, 407, 311, 233, 136, 131, 124, 200, 293, 891, 965, 978, 976, 925, 871, 765, 674, 594, 522, 530, 521, 645, 729, 819, 0, 0, 0, 0, 0]
[306, 251, 254, 264, 311, 375, 477, 550, 677, 677, 676, 615, 588, 549, 625, 636, 638, 644, 702, 776, 867, 916, 965, 950, 954, 909, 857, 725, 646, 572, 674, 787, 912, 982, 1032, 1066, 1111, 1074, 1056, 1007, 968, 894, 830, 757, 807, 800, 785, 713, 618, 480, 345, 234, 144, 134, 127, 210, 313, 0, 0]

obj1[1],obj2[5]

[293, 248, 252, 250, 299, 342, 403, 434, 495, 545, 613, 659, 677, 708, 735, 743, 748, 753, 685, 609, 532, 489, 391, 397, 505, 624, 763, 883, 983, 1053, 1096, 1045, 1013, 880, 723, 625, 499, 407, 304, 223, 136, 131, 124, 200, 894, 969, 970, 976, 919, 858, 674, 590, 522, 532, 521, 645, 816, 891, 0]
[308, 251, 259, 256, 311, 371, 463, 550, 677, 677, 674, 615, 588, 541, 630, 636, 640, 644, 702, 853, 916, 965, 965, 909, 857, 791, 711, 646, 572, 688, 788, 905, 982, 1042, 1054, 1092, 1056, 1007, 963, 894, 841, 757, 807, 785, 713, 632, 497, 343, 243, 139, 128, 121, 210, 314, 0, 0, 0, 0, 0]

可以看到,肉眼看的话能看出obj2[5]和obj1[1]更相似。但由于obj2[5]跟obj1[5]的位数相同,导致后面的位数不产生K邻近法的距离,导致最后的数加起来很小。而obj2[5]和obj1[1]相差四位,而且后面都是大数,导致这几位产生的距离非常之大,虽然两者其他的数值接近产生的距离值小,但总距离偏大,答案错误。

完整代码:

import requests
from bs4 import BeautifulSoup
import random
from urllib import parse
import re
from fontTools.ttLib import TTFont
import numpy as np

def downloader():
    url = "https://maoyan.com/films/344264"
    headers_pool = [
        'Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; XH; rv:8.578.498) fr, Gecko/20121021 Camino/8.723+ (Firefox compatible)',
        'Mozilla/5.0 (X11; U; Linux i686; nl; rv:1.8.1b2) Gecko/20060821 BonEcho/2.0b2 (Debian-1.99+2.0b2+dfsg-1)',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; Avant Browser; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0)',
        'Mozilla/5.0 (X11; U; UNICOS lcLinux; en-US) Gecko/20140730 (KHTML, like Gecko, Safari/419.3) Arora/0.8.0',
        'Mozilla/5.0 (compatible; MSIE 9.0; AOL 9.7; AOLBuild 4343.19; Windows NT 6.1; WOW64; Trident/5.0; FunWebProducts)',
        'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser 1.98.744; .NET CLR 3.5.30729)'
    ]
    headers = random.choice(headers_pool)

    res = requests.get(url=url,headers={"User-Agent": headers})
    with open("maoyan.html","wb") as f:
        f.write(res.content)

    soup = BeautifulSoup(res.text,"html.parser")
    stone_font = soup.select("style")[0]
    url = parse.urljoin("https://", re.search("//.*\.woff", str(stone_font)).group())
    res = requests.get(url,headers={"User-Agent": headers})
    with open("font2.woff", "wb") as f:
        f.write(res.content)

def knn(trainData, testData):
    rowSize = trainData.shape[0]
    diff = np.tile(testData, (rowSize, 1)) - trainData
    sqrDiff = diff ** 2
    sqrDiffSum = sqrDiff.sum(axis=1)
    distances = sqrDiffSum ** 0.5
    sortDistance = distances.argsort()
    return sortDistance[0]

def prepare():
    obj1 = TTFont("font.woff")
    obj1_uni = obj1.getGlyphOrder()[2:]
    obj1_points = [0] * len(obj1_uni)
    for i in range(len(obj1_uni)):
        obj1_points[i] = list(obj1['glyf'][obj1_uni[i]].coordinates)
        for index in range(len(obj1_points[i])):
            obj1_points[i][index] = sum(obj1_points[i][index])

    obj2 = TTFont("font2.woff")
    obj2_uni = obj2.getGlyphOrder()[2:]
    obj2_points = [0] * len(obj2_uni)
    for i in range(len(obj2_uni)):
        obj2_points[i] = list(obj2['glyf'][obj2_uni[i]].coordinates)
        for index in range(len(obj2_points[i])):
            obj2_points[i][index] = sum(obj2_points[i][index])

    len_max = 0
    for i in range(len(obj2_uni)):
        if len(obj2_points[i]) > len_max: len_max = len(obj2_points[i])
    for i in range(len(obj1_uni)):
        if len(obj1_points[i]) > len_max: len_max = len(obj1_points[i])
    for i in range(len(obj2_uni)):
        obj2_points[i].extend([0] * (len_max - len(obj2_points[i])))
    for i in range(len(obj1_uni)):
        obj1_points[i].extend([0] * (len_max - len(obj1_points[i])))

    return obj1_points,obj2_points

def create_dic(obj1_points,obj2_points):
    base_woff = TTFont("font.woff")
    base_uni_list = base_woff.getGlyphOrder()[2:]
    base_val_list = ["0", "9", "8", "1", "6", "3", "5", "2", "4", "7"]
    # base_dic = dict(zip(base_uni_list, base_val_list))

    current_woff = TTFont("font2.woff")
    current_uni_list = current_woff.getGlyphOrder()[2:]

    mapping_dic = {}
    a = np.array(obj1_points)
    for i in range(len(obj2_points)):
        b = np.array(obj2_points[i])
        index = knn(a,b)
        mapping_dic[current_uni_list[i]] = base_val_list[index]
    return mapping_dic

def change_html():
    pass

if __name__ == "__main__":
    downloader()
    obj1_points, obj2_points = prepare()
    mapping_dic = create_dic(obj1_points,obj2_points)
    print(mapping_dic)

发现这次又对了,得到结果是:

{'uniE8DB': '4', 'uniF691': '5', 'uniE101': '9', 'uniF072': '7', 'uniEDAF': '3', 'uniEBB2': '0', 'uniE826': '6', 'uniEF4E': '1', 'uniF574': '8', 'uniEEEC': '2'}

很舒服,但是不能保证100%成功

所以最重要的问题还是解决相同字形点数不同这个问题,我还没想到合适的解决方案。如果你能解决这个问题,欢迎留言,欢迎指教。

参考文献:https://www.jianshu.com/p/79c4272c0969
https://blog.csdn.net/weixin_41861700/article/details/103108239
https://www.cnblogs.com/lyuzt/p/10471617.html

相关文章

网友评论

    本文标题:猫眼电影评分爬取(字体反爬)

    本文链接:https://www.haomeiwen.com/subject/khojdktx.html