应对字体加密的反爬

作者: WangLane | 来源:发表于2019-05-22 18:17 被阅读0次

应对字体加密的反爬
爬取58同城，解决反爬字体加密解析问题
python爬虫_从零开始破解js加密（一）
Python爬取猫眼电影：破解字体反爬
Python爬虫实例：爬取猫眼电影——破解字体反爬
Python爬虫杂记 - 字体文件反爬（二）
Python爬虫——学习字体反爬获取某招聘信息
保姆级反爬教学，JS逆向实现字体反爬
字体反爬的解决方案
字体反爬

快手页面

https://live.kuaishou.com/profile/maomei527

字体反爬

我们F12打开Dev Tools，在response中找到数据的部分可以看到

对应网页中的粉丝数部分：

页面返回的是这样的东西

显然，是字体加密。

分析

我们直接搜索woff，直接找找看有没有字体文件。然后就在页面中找到了一个链接，我们把这个woff文件下载到本地，用fontcreater打开，可以看到

考虑到每次字体的关系都是动态加载的，我们要写一个解析的模块。

这里我们使用fonttools模块，pip安装一下

pip install fonttools

我们把刚刚下载的字体文件转成xml文档，这样方便查看

from fontTools.ttLib import TTFont

font = TTFont('fonts.woff')
font.saveXML('fonts.xml')

打开xml我们在glyf标签下找到映射关系：

下载几个字体文件，通过对比glyf内容发现，字体的绘制数据是不变的，就是说，我们只需要保存glyf绘制对应的数字就可以，然后根据每次网站返回的woff字体文件的glyf映射去判断是哪个数字就可以了。看了下每个数字对应的最大最小坐标，发现是没有四个值都重复的。

所以，简单的判断一下就行了。

fonttools模块

建议以下的内容开个ipython，方便看到输出

导入

>>> from fontTools.ttLib import TTFont
>>> font = TTFont('fonts.woff')

>>> font
<fontTools.ttLib.ttFont.TTFont at 0x19c3a4e7b70>

>>> glyf = font.get('glyf')
>>> glyf
<'glyf' table at 19c3a567e80>

>>> glyf.glyphs
{'.notdef': <fontTools.ttLib.tables._g_l_y_f.Glyph at 0x19c3a567b00>,
 'glyph1': <fontTools.ttLib.tables._g_l_y_f.Glyph at 0x19c3a567da0>,
 'nonmarkingreturn': <fontTools.ttLib.tables._g_l_y_f.Glyph at 0x19c3a56b780>,
 'uni0001': <fontTools.ttLib.tables._g_l_y_f.Glyph at 0x19c3a4cda20>,
 'space': <fontTools.ttLib.tables._g_l_y_f.Glyph at 0x19c3a4cd828>,
 'uniABCE': <fontTools.ttLib.tables._g_l_y_f.Glyph at 0x19c3a4cd400>,
 'uniACCD': <fontTools.ttLib.tables._g_l_y_f.Glyph at 0x19c3a4cd080>,
 'uniAEDA': <fontTools.ttLib.tables._g_l_y_f.Glyph at 0x19c3a4cde80>,
 'uniAEFE': <fontTools.ttLib.tables._g_l_y_f.Glyph at 0x19c3a4cd0f0>,
 'uniAFED': <fontTools.ttLib.tables._g_l_y_f.Glyph at 0x19c3a4cdd68>,
 'uniBAAA': <fontTools.ttLib.tables._g_l_y_f.Glyph at 0x19c3a4cde48>,
 'uniBDDD': <fontTools.ttLib.tables._g_l_y_f.Glyph at 0x19c3a4cdbe0>,
 'uniBFAD': <fontTools.ttLib.tables._g_l_y_f.Glyph at 0x19c3a4cd5f8>,
 'uniBFAE': <fontTools.ttLib.tables._g_l_y_f.Glyph at 0x19c3a4cdb00>,
 'uniC44F': <fontTools.ttLib.tables._g_l_y_f.Glyph at 0x19c3a4cd358>}

>>> c = glyf.glyphs.get('uniACCD')
>>> c
<fontTools.ttLib.tables._g_l_y_f.Glyph at 0x19c3a4cd080>

>>> c.getCoordinates(glyf)
(GlyphCoordinates([(467, 670),(514, 608),(514, 542),(425, 542),(417, 584),(400, 608),(367, 653),(301, 653),(226, 653),(182, 584),(137, 514),(132, 384),(163, 429),(210, 452),(253, 472),(306, 472),(396, 472),(463, 415),(530, 357),(530, 243),(530, 145),(467, 70),(403, -5),(285, -5),(185, -5),(112, 71),(39, 147),(39, 328),(39, 461),(71, 554),(134, 732),(300, 732),(420, 732),(400, 123),(435, 170),(435, 235),(435, 290),(404, 340),(372, 390),(289, 390),(231, 390),(187, 352),(143, 313),(143, 235),(143, 167),(183, 121),(223, 75),(293, 75),(364, 75)]),
 [32, 48],
 array('B', [1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0]))

>>> c.yMax
732

>>> c.xMax
530

>>> c.yMin
-5

>>> c.xMin
35

撸代码

这四个值直接保存下来，每次对比就知道是哪个数字了。

for i in glyf.keys():
     if i.startswith('uni'):
         c = glyf[i]
         print('{} ({},{},{},{})'.format(i, c.yMax, c.xMax, c.yMin, c.xMin))
"""
uni0001 (0, 0, 0, 0)
uniABCE (731, 536, 13, 26)
uniACCD (732, 530, -5, 39)
uniAEDA (731, 525, -7, 33)
uniAEFE (729, 526, -6, 32)
uniAFED (730, 525, -6, 25)
uniBAAA (717, 526, -5, 33)
uniBDDD (726, 363, 13, 98)
uniBFAD (732, 527, 13, 32)
uniBFAE (730, 521, -7, 37)
uniC44F (717, 536, 13, 38)
"""

然后对照fontcreator中的数字，写个映射字典，顺便把程序也写出来。

import os
import requests
import re
import json
from fontTools.ttLib import TTFont

font_map = {
    (0, 0, 0, 0): ' ',
    (729, 526, -6, 32): '0',
    (726, 363, 13, 98): '1',
    (732, 527, 13, 32): '2',
    (730, 525, -6, 25): '3',
    (731, 536, 13, 26): '4',
    (717, 526, -5, 33): '5',
    (732, 530, -5, 39): '6',
    (717, 536, 13, 38): '7',
    (731, 525, -7, 33): '8',
    (730, 521, -7, 37): '9',
}


def decrypt_font(charater, font):
    s = charater.encode('unicode_escape').decode().strip('\\').upper().strip('U')
    res = font_map.get(s)
    return res if res else charater


def create_mapping(font_file):
    """ 打开字体文件并创建字符和数字之间的映射. """
    # 打开字体文件，加载glyf
    font = TTFont(font_file)
    glyf = font.get('glyf')
    current_map = {}
    # 创建当前字体文件的数字映射
    for i in glyf.keys():
        # 忽略不是uni开头的字符
        if not i.startswith('uni'):
            continue
        c = glyf[i]
        number = font_map.get((c.yMax, c.xMax, c.yMin, c.xMin))
        # 发现有字符不在已有的集合中, 抛出异常.
        if number is None:
            raise Exception
        current_map[i.strip('uni')] = number
    print(json.dumps(current_map, indent=4))
    return current_map


def decrypt_str(s):
    res = ''
    for c in s:
        res = res + decrypt_font(c)
    return res


def get_mapping(page):
    m = re.search('(http.*?.woff)', page)
    if m:
        woff_link = m.group(1)
        headers1 = {
            'Host': "static.yximgs.com",
            'Connection': "keep-alive",
            'Upgrade-Insecure-Requests': "1",
            'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36",
            'Accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
            'Accept-Encoding': "gzip, deflate, br",
            'Accept-Language': "zh-CN,zh;q=0.9",
            'Cache-Control': "no-cache",
            'Postman-Token': "0d97a0b9-6e4d-4eb4-8192-8cf365f77ef6,ca4b1512-7916-4238-8c94-7b1afb3fad56",
            'cache-control': "no-cache"
        }
        woff_res = requests.get(woff_link, headers=headers1)
        file_name = woff_link.split('/')[-1]
        with open(file_name, 'wb') as f:
            f.write(woff_res.content)
        mapping = create_mapping(file_name)
        os.remove(file_name)
        return mapping


def get_page(url):
    headers = {
        'Accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
        'Accept-Encoding': "gzip, deflate, br",
        'Accept-Language': "zh-CN,zh;q=0.9",
        'Cache-Control': "max-age=0",
        'Connection': "keep-alive",
        'Host': "live.kuaishou.com",
        'Referer': "https://live.kuaishou.com/search/?keyword=%23%E7%BE%8E%E9%A3%9F",
        'Upgrade-Insecure-Requests': "1",
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36",
        'Postman-Token': "9c4df31c-8cbd-490e-8625-0acc68aaf99f,a582159e-df9e-4d84-8788-928e982783f0",
        'cache-control': "no-cache"
    }
    r = requests.get(url, headers=headers)
    return r

if __name__ == '__main__':
    url = 'https://live.kuaishou.com/profile/maomei527'
    r = get_page(url)
    get_mapping(r.text)

对照一下字体文件的内容看看是否正确：

全部代码

加上自动下载字体和字体转换.
测试中发现还会出现另外一套映射，对比之前的映射发现只是y轴上进行了13个单位的位移，于是加了一个x，y轴的位移判断，最终的程序如下：

import os
import requests
import re
import json
from fontTools.ttLib import TTFont

font_map = {
    (0, 0, 0, 0): ' ',
    (729, 526, -6, 32): '0',
    (726, 363, 13, 98): '1',
    (732, 527, 13, 32): '2',
    (730, 525, -6, 25): '3',
    (731, 536, 13, 26): '4',
    (717, 526, -5, 33): '5',
    (732, 530, -5, 39): '6',
    (717, 536, 13, 38): '7',
    (731, 525, -7, 33): '8',
    (730, 521, -7, 37): '9',
}


def decrypt_font(charater, mapping):
    """ 解密单个字符，如果可以解密就输出解密后的数字，否则原样返回"""
    s = charater.encode('unicode_escape').decode().strip('\\').upper().strip('U')
    res = mapping.get(s)
    return res if res else charater


def get_number_offset(c, max_offset=20):
    """ 根据偏移量计算映射出来的数字 """
    number = None
    # i, j 分别代表y和x的偏移量
    for i in range(max_offset+1):
        for j in range(max_offset+1):
            # 正向偏移
            number = font_map.get((c.yMax+i, c.xMax+j, c.yMin+i, c.xMin+j))
            if number:
                # print('offset x:{} y:{}'.format(i,j))
                return number
            # 负向偏移
            number = font_map.get((c.yMax-i, c.xMax-j, c.yMin-i, c.xMin-j))
            if number:
                # print('offset x:{} y:{}'.format(i,j))
                return number
    return number


def create_mapping(font_file):
    """ 打开字体文件并创建字符和数字之间的映射. """
    # 打开字体文件，加载glyf
    font = TTFont(font_file)
    glyf = font.get('glyf')
    current_map = {}
    # 创建当前字体文件的数字映射
    for i in glyf.keys():
        # 忽略不是uni开头的字符
        if not i.startswith('uni'):
            continue
        c = glyf[i]
        number = get_number_offset(c)
        # 发现有字符不在已有的集合中, 抛出异常.
        if number is None:
            print((c.yMax, c.xMax, c.yMin, c.xMin))
            raise Exception
        current_map[i.strip('uni')] = number
    print(json.dumps(current_map, indent=4))
    return current_map


def decrypt_str(s, mapping):
    """ 解密字符串， 不需要解密的部分原样返回 """
    res = ''
    for c in s:
        res = res + decrypt_font(c, mapping)
    return res

def get_mapping(page):
    m = re.search('(http.*?.woff)', page)
    if m:
        woff_link = m.group(1)
        headers1 = {
            'Host': "static.yximgs.com",
            'Connection': "keep-alive",
            'Upgrade-Insecure-Requests': "1",
            'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36",
            'Accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
            'Accept-Encoding': "gzip, deflate, br",
            'Accept-Language': "zh-CN,zh;q=0.9",
            'Cache-Control': "no-cache",
            'Postman-Token': "0d97a0b9-6e4d-4eb4-8192-8cf365f77ef6,ca4b1512-7916-4238-8c94-7b1afb3fad56",
            'cache-control': "no-cache"
        }
        woff_res = requests.get(woff_link, headers=headers1)
        file_name = woff_link.split('/')[-1]
        with open(file_name, 'wb') as f:
            f.write(woff_res.content)
        mapping = create_mapping(file_name)
        os.remove(file_name)
        return mapping


def get_page(url):
    headers = {
        'Accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
        'Accept-Encoding': "gzip, deflate, br",
        'Accept-Language': "zh-CN,zh;q=0.9",
        'Cache-Control': "max-age=0",
        'Connection': "keep-alive",
        'Host': "live.kuaishou.com",
        'Referer': "https://live.kuaishou.com/search/?keyword=%23%E7%BE%8E%E9%A3%9F",
        'Upgrade-Insecure-Requests': "1",
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36",
        'Postman-Token': "9c4df31c-8cbd-490e-8625-0acc68aaf99f,a582159e-df9e-4d84-8788-928e982783f0",
        'cache-control': "no-cache"
    }
    r = requests.get(url, headers=headers)
    return r


if __name__ == '__main__':
    url = 'https://live.kuaishou.com/profile/maomei527'
    r = get_page(url)
    for i in range(5):
        try:
            mapping = get_mapping(r.text)
            break
        except:
            pass
    raw_s = re.search('"fan"\s*:\s*"(.*?)"', r.text).group(1)
    print(raw_s)
    print(decrypt_str(raw_s, mapping))

"""
{
    "0001": " ",
    "ABCF": "4",
    "ACED": "3",
    "AEDD": "8",
    "AEDE": "0",
    "AFCD": "6",
    "BDAA": "5",
    "BDCD": "1",
    "BFAD": "9",
    "CCDA": "2",
    "CFBE": "7"
}
쳚곭꿍곭.뷍w
2363.1w
"""

"""
{
    "0001": " ",
    "ABCB": "4",
    "ACCD": "3",
    "ACDA": "0",
    "AEFF": "8",
    "AFBB": "6",
    "BDCA": "1",
    "BDCC": "5",
    "BFEF": "9",
    "CCAA": "2",
    "CFBA": "7"
}
첪곍꾻곍.뷊w
2363.1w
"""

应对字体加密的反爬
快手页面 https://live.kuaishou.com/profile/maomei527 字体反爬我们F...
爬取58同城，解决反爬字体加密解析问题
【导语】我们在爬取数据中，会遇到字体乱码的下，其实是字体加密，本篇文章主要解决字体解密这种反爬方式。 1.在浏览器...
python爬虫_从零开始破解js加密（一）
除了一些类似字体反爬之类的奇淫技巧，js加密应该是反爬相当常见的一部分了，这也是一个分水岭，我能解决基本js加密的...
Python爬取猫眼电影：破解字体反爬
字体反爬字体反爬也就是自定义字体反爬，通过调用自定义的字体文件来渲染网页中的文字，而网页中的文字不再是文字，而是...
Python爬虫实例：爬取猫眼电影——破解字体反爬
字体反爬字体反爬也就是自定义字体反爬，通过调用自定义的字体文件来渲染网页中的文字，而网页中的文字不再是文字，而是...
Python爬虫杂记 - 字体文件反爬（二）
字体文件反爬在搞定静态字库反爬之后，可以解决部分字体文件的反爬，但动态字文件反爬是解决不掉的。此文章就是为解...
Python爬虫——学习字体反爬获取某招聘信息
网站的反爬措施有很多，例如：js反爬、ip反爬、css反爬、字体反爬、验证码反爬、滑动点击类验证反爬等等，今天我们...
保姆级反爬教学，JS逆向实现字体反爬
大家好，我是查理~网站的反爬措施有很多，例如：js反爬、ip反爬、css反爬、字体反爬、验证码反爬、滑动点击类验证...
字体反爬的解决方案
现在许多网站采用字体反爬策略，即替换一些字符的unicode编码并且将生成的字体文件加密后传输到前端，由前端解析并...
字体反爬
打开目标网站，目的是获取图片中的数据分析可知，除了评论数在html文档中，其他数据都可以从模拟发送ajax请求...