title: Top 100 cited deep learning paper-ReLU and PReLU
Paper:https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/He_Delving_Deep_into_ICCV_2015_paper.pdf
ICCV2015.
Tremendous improvements in recognition performance, mainly due to advances in two technical directions:
- More powerful models
- depth
- width
- smaller strides/nonlinear activations
- Effective strategies against overfitting
- regularization
- data augmentation
1.Problems
Deep network tends to be more difficult to train. So given a deep network, how to train it, how to make network perform well, and how to make it convergence faster?
2.Solution
In this paper, the author solves the above problems in two aspects. (1) More adaptive activation function; (2) Proper initialization methods.
2.1PReLU
image.pngHere, the PReLU
The above equation is equal to image.png
Table 1 shows the learned coefficients of PReLUs for each layer. There are two interesting phenomena in Table 1.
First, the first conv layer (conv1) has coefficients (0.681 and 0.596) significantly greater than 0. As the filters of conv1 are mostly Gabor-like filters such as edge or texture detectors, the learned results show that both positive and negative responses of the filters are respected.
Second, for the channel-wise version, the deeper conv layers, in general, have smaller coefficients. This implies that the activations gradually become “more nonlinear” at increasing depths. In words, the learned model tends to keep more information earlier stages and becomes more discriminative in stages.
3.3 Initialization of Filter Weights for Rectifiers
Rectifier networks are easier to train. But a bad initialization can still hamper the learning of a highly non-linear system. If and are sampled from (0, ,we denote the dim of of the as . In each conv layer,we have , is the kernel size, is the number of channel:
Here, . If has zero mean.
If we let have a symmetric distribution around zero and , then has zero mean and has a symmetric distribution around zero. This leads to :
With layers put together, we have:
This product is the key to the initialization design. A proper initialization method should avoid reducing or magnifying the magnitudes of input signals exponentially. So the above function is expected to a proper scalar:
This leads to a zero-mean Gaussian distribution whose standard deviation (std) is .
4.Related work
Activation Functions
- Saturated
- tanh
- sigmoid
- Non-saturated
- ReLU
- ELU
- Leaky ReLU
- PReLU
Non-saturated can solve the problems of gradient vanish and convergence faster.
Reference:
[1]Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
[2]https://www.cnblogs.com/everyday-haoguo/p/Note-PRelu.html
[3]https://github.com/happynear/gitbook/blob/master/DeepLearning/PReLU.md
网友评论