title: Top 100 cited deep learning paper-ReLU and PReLU

Paper:https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/He_Delving_Deep_into_ICCV_2015_paper.pdf
ICCV2015.

Tremendous improvements in recognition performance, mainly due to advances in two technical directions:

More powerful models
- depth
- width
- smaller strides/nonlinear activations
Effective strategies against overfitting
- regularization
- data augmentation

1.Problems

Deep network tends to be more difficult to train. So given a deep network, how to train it, how to make network perform well, and how to make it convergence faster?

2.Solution

In this paper, the author solves the above problems in two aspects. (1) More adaptive activation function; (2) Proper initialization methods.

2.1PReLU

image.png
Here, the

PReLU
The above equation is equal to

image.png

Table 1 shows the learned coefficients of PReLUs for each layer. There are two interesting phenomena in Table 1.
First, the first conv layer (conv1) has coefficients (0.681 and 0.596) significantly greater than 0. As the filters of conv1 are mostly Gabor-like filters such as edge or texture detectors, the learned results show that both positive and negative responses of the filters are respected.
Second, for the channel-wise version, the deeper conv layers, in general, have smaller coefficients. This implies that the activations gradually become “more nonlinear” at increasing depths. In words, the learned model tends to keep more information earlier stages and becomes more discriminative in stages.

3.3 Initialization of Filter Weights for Rectifiers

Rectifier networks are easier to train. But a bad initialization can still hamper the learning of a highly non-linear system. If $W_l$ and $b_l$ are sampled from (0, $\sigma^2)$ ，we denote the dim of $x_l$ of the $l-th$ as $n_l$ . In each conv layer，we have $n_l=k^2 c$ ， $k$ is the kernel size, $c$ is the number of channel：　　 $Var[y_l] = n_l Var[W_l x_l]=n_l Var[W_l]E[x_l^2].$

Here, $Var[x_l]=E[x_l^2]-E^2[x_l]$ . If $x_l$ has zero mean.
$Var[x_l]=E[x_l^2].$

If we let $w_{l-1}$ have a symmetric distribution around zero and $b_{l-1}=0$ , then $y_{l-1}$ has zero mean and has a symmetric distribution around zero. This leads to $E[x_l^2]=\frac{1}{2}Var[y_{l-1}]$ :

$Var[y_l]=\frac{1}{2}n_l Var[w_l]Var[y_{l-1}]$

With $L$ layers put together, we have:
$Var[y_l]=Var[y_1](\Pi_{l=2}^L\frac{1}{2}n_lVar[w_l])$

This product is the key to the initialization design. A proper initialization method should avoid reducing or magnifying the magnitudes of input signals exponentially. So the above function is expected to a proper scalar:
$\frac{1}{2}n_lVar[W_l]=1$

This leads to a zero-mean Gaussian distribution whose standard deviation (std) is $\sqrt{2/n_l}$ .

4.Related work
Activation Functions

Saturated
- tanh
- sigmoid
Non-saturated
- ReLU
- ELU
- Leaky ReLU
- PReLU
  Non-saturated can solve the problems of gradient vanish and convergence faster.

Reference:
[1]Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
[2]https://www.cnblogs.com/everyday-haoguo/p/Note-PRelu.html
[3]https://github.com/happynear/gitbook/blob/master/DeepLearning/PReLU.md