本文参考了谷歌量化白皮书的内容:https://arxiv.org/pdf/1806.08342.pdf

def quantize(self, x, max, min):
delta = (max - min) / (2 ** self.n_bits - 1)
zero_point = (- min / delta).round()
# we assume weight quantization is always signed
x_int = torch.round(x / delta)
x_quant = torch.clamp(x_int + zero_point, 0, self.n_levels - 1) #浮点转定点
x_float_q = (x_quant - zero_point) * delta #定点转浮点
return x_quant
上边是官方的非对称量化公式与代码,▲代表scale步长,当位宽为8时N=256,接下来我们以比较直白的话来讲:

接下来我们通过一个例子来练习一下:

网友评论