Deep Residual Learning for Image Recognition

开门见山，抛问题

问题1、An obstacle to answering this question was the notorious problem of vanishing/exploding gradients , which hamper convergence from the beginning.

This problem, however, has been largely addressed by normalized initialization and intermediate normalization layers , which enable networks with tens of layers to start converging for stochastic gradient descent(SGD) with back- propagation

问题2、When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error.

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

BN可被认为是新型网络结构的标配，例如Inception V4, ResNet等网络结构都采用了BN。

the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities.
While stochastic gradient is simple and effective, it requires careful tuning of the model hyper-parameters, specifically the learning rate used in optimization, as well as the initial values for the model parameters.
The training is complicated by the fact that the inputs to each layer are affected by the parameters of all preceding layers —— so that small changes to the network parameters amplify as the network becomes deeper.
The change in the distributions of layers’ inputs presents a problem because the layers need to continuously adapt to the new distribution. When the input distribution to a learning system changes, it is said to experience covariate shift (Shimodaira, 2000). This is typically handled via domain adaptation (Jiang, 2008). However, the notion of covariate shift can be extended beyond the learning system as a whole, to apply to its parts, such as a sub-network or a layer.
对于网络的输入，我们用tanh这个激活函数来举例。如果不做normalization，那么当输入的绝对值比较大的时候，tanh激活函数的输入会比较大，那么tanh就会处于饱和阶段，此时的神经网络对于输入不在敏感，我们的网络基本上已经学不到东西了，激活函数的输出不是+1，就是-1.
因为网络层次的有序性，每一层的统计分布都会受到前面所有层处理过程的影响。因此在训练过程中，浅层因为噪声样本或者激活函数设计不当等原因造成的分布误差会随着前向传播至Loss function, 这一误差随后又在梯度反向传播中被进一步放大。这种训练过程中出现的分布与理想分布发生偏差的现象被成为Internal Covariate Shift. Covariate Shift的概念与2000年在统计学领域被提出，论文作者将原始的端到端的Covariate Shift引入到网络的每一层，因此称为Internal Covariate Shift。
归一化/正则化:数据归一化、正则化是非常重要的步骤，用于重新缩放输入的数值，确保在反向传播期间更好的收敛。一般来说，采用的方法是减去平均值在除以标准差。如果不这样做，某些特征的量纲比较大，在cost函数中将得到更大的加权。数据归一化使得所有特征的权重相等，量纲相同。

Batch Normalization，会其意知其形
 Batch Normalization(批标准化)

As such it is advantageous for the distribution of $x$ to remain fixed over time. $Θ2$ does not have to readjust to compensate for the change in the distribution of $x$ .