【Keras】从Keras的一个Pull Request看Bat

作者: ItchyHiker | 来源:发表于2019-01-23 19:30 被阅读0次

【Keras】从Keras的一个Pull Request看Bat
Keras学习记录
DL4J中文文档/Keras模型导入/归一化层
Keras学习资源汇总
DL4J中文文档/Keras模型导入/高级激活函数
自动部署深度神经网络模型TensorFlow（Keras）到生产
《Deep Learning with Python》第三章 3
【机器学习快速入门教程6】Keras神经网络
深度学习框架keras模块安装
Keras调研

这篇文章适合对Keras和深度学习有一定基础的读者

BatchNormalization 是我们在训练深度神经网络的时候常用方法，由Google在2015年提出:https://arxiv.org/pdf/1502.03167.pdf.
总结来说使用BatchNormalization有以下有点:

可以减少过拟合，一定成都上减少Dropout的使用
加速训练
使用更好的学习率

BatchNormalization原理

我们都知道在训练深度学习模型的时候是使用一个一个batch来进行随机梯度更新的，这样不用每次更新都需要计算所有数据的参数，同样对于batchnormalization:
假设输入的batch中有m个数据，对输入的m个数据计算均值和均方差，使用统计数据对输入进行normalization，然后再使用 $\gamma$ 和 $\beta$ 对归一化的输入 $\hat{x_i}$ 进行 scale 和 shift，其中scale和shift是可以学习的参数，也就是经过batchnormalization处理的batch数据不仅仅受到整个batch的mean和variance参数影响，也受到前面训练的数据集的影响(前面的数据训练影响 $\gamma$ 和 $\beta$ )
原文里有这样一句话，也是相同的意思:

The BN transform can be added to a network to manip- ulate any activation. In the notation y = BNγ ,β (x), we
indicate that the parameters γ and β are to be learned,
but it should be noted that the BN transform does not
independently process the activation in each training ex-
ample. Rather, BNγ,β(x) depends both on the training
example and the other examples in the mini-batch

image.png

Keras中BatchNormalization的参数:

keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True, beta_initializer='zeros', gamma_initializer='ones', moving_mean_initializer='zeros', moving_variance_initializer='ones', beta_regularizer=None, gamma_regularizer=None, beta_constraint=None, gamma_constraint=None)

Arguments

axis: Integer, the axis that should be normalized (typically the features axis). For instance, after a Conv2D layer with  data_format="channels_first", set axis=1 in BatchNormalization.
momentum: Momentum for the moving mean and the moving variance.
epsilon: Small float added to variance to avoid dividing by zero.
center: If True, add offset of beta to normalized tensor. If False, beta is ignored.
scale: If True, multiply by gamma. If False, gamma is not used. When the next layer is linear (also e.g. nn.relu), this can be disabled since the scaling will be done by the next layer.
beta_initializer: Initializer for the beta weight.
gamma_initializer: Initializer for the gamma weight.
moving_mean_initializer: Initializer for the moving mean.
moving_variance_initializer: Initializer for the moving variance.
beta_regularizer: Optional regularizer for the beta weight.
gamma_regularizer: Optional regularizer for the gamma weight.
beta_constraint: Optional constraint for the beta weight.
gamma_constraint: Optional constraint for the gamma weight.

通过使用BatchNormalization对网络内部的输入输出进行归一化，可以避免梯度消失或者爆炸的问题，而且可以增加网络的鲁棒性，可以参考对网络的输入进行归一化。

Keras 的BatchNormalization实现

从这里回归题目，Keras里面的BatchNormalization有什么不一样?
我们知道在训练的时候使用batch normalization来对输入进行归一化，在测试的时候使用的是一个样本如何获取mean和variance呢？
在测试的时候使用的是前面的训练的所有的min-batches的指数平均，具体这里不展开，可以参考这里:Ng的课程
，可看作前面所有的数据的mean和variance对当前测试样本的一个估计。

在Keras里面inference或者predict mode里面采用的也是这种方法。这个PR提出的问题是在进行迁移学习的时候Keras提供的这个接口有很大的问题，很多人在训练集和测试集上的准确度差异太大。

迁移学习一般在我们自己的样本数据过少，在别人训练好的模型基础上，使用我们自己的模型进行参数微调整。别人的模型解决的问题不是完全一样二是类似的问题，因为训练好的模型前面几层可能都会识别边缘和角点等信息。

在迁移学习的时候，通过frozen前面已经训练好的layer，然后在新加的layer上进行参数更新。Keras里面一般我们通过如下代码来fronzen一些层:

for layer in base_model.layers:
  layer.trainable=False

image.png

问题就出在这里，在进行finetune的时候，trainable=False的层计算mean和variance参数的时候使用的是新数据的min-batch计算得到的mean和variance进行参数更新，而在模型finetune好之后，在inference的时候使用的是原始的数据加权平均的mean和variance。总而言之，在finetune的时候trainable=False的batch normalization 统计参数来自于新数据(你现有的样本)，而finetune完成之后进行inference的时候统计参数来自于别人训练模型用的样本特性。在这之间就有一个gap，导致在finetune的训练准确度和测试准确度差异较大，Github上也有人提过issue。

那么正确的解决方案是怎样的？
在finetune的时候使用原始数据计算的统计参数对trainable=False的BatchNormalization参数进行更新，这样就可以保证训练和测试的时候行为一致。现在的Keras应该是不支持这一行为的。 这也导致了在pull request page的论战。

提出PR的人，在他的博客里面也做了对比实验，想仔细了解的人可以去参考的博客。

这个问题我之前也没有注意过，通过这个问题即更加深入的了解了BatchNormalization也对Keras的使用方法有所注意，也是Keras封装太多带来的问题，未来可能考虑转战Tensorflow或者Pytorch。希望对读者有所裨益。