美文网首页
Too 100 Deep learning paper: ReL

Too 100 Deep learning paper: ReL

作者: 吐舌小狗 | 来源:发表于2018-10-23 21:45 被阅读17次

    title: Top 100 cited deep learning paper-ReLU and PReLU

    Paper:https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/He_Delving_Deep_into_ICCV_2015_paper.pdf
    ICCV2015.

    Tremendous improvements in recognition performance, mainly due to advances in two technical directions:

    • More powerful models
      • depth
      • width
      • smaller strides/nonlinear activations
    • Effective strategies against overfitting
      • regularization
      • data augmentation

    1.Problems

    Deep network tends to be more difficult to train. So given a deep network, how to train it, how to make network perform well, and how to make it convergence faster?

    2.Solution

    In this paper, the author solves the above problems in two aspects. (1) More adaptive activation function; (2) Proper initialization methods.

    2.1PReLU

    image.png
    Here, the PReLU
    The above equation is equal to image.png

    Table 1 shows the learned coefficients of PReLUs for each layer. There are two interesting phenomena in Table 1.
    First, the first conv layer (conv1) has coefficients (0.681 and 0.596) significantly greater than 0. As the filters of conv1 are mostly Gabor-like filters such as edge or texture detectors, the learned results show that both positive and negative responses of the filters are respected.
    Second, for the channel-wise version, the deeper conv layers, in general, have smaller coefficients. This implies that the activations gradually become “more nonlinear” at increasing depths. In words, the learned model tends to keep more information earlier stages and becomes more discriminative in stages.

    3.3 Initialization of Filter Weights for Rectifiers

    Rectifier networks are easier to train. But a bad initialization can still hamper the learning of a highly non-linear system. If W_l and b_l are sampled from (0, \sigma^2),we denote the dim of x_l of the l-th as n_l. In each conv layer,we have n_l=k^2 ck is the kernel size, c is the number of channel:    Var[y_l] = n_l Var[W_l x_l]=n_l Var[W_l]E[x_l^2].

    Here, Var[x_l]=E[x_l^2]-E^2[x_l]. If x_l has zero mean.
    Var[x_l]=E[x_l^2].

    If we let w_{l-1} have a symmetric distribution around zero and b_{l-1}=0, then y_{l-1} has zero mean and has a symmetric distribution around zero. This leads to E[x_l^2]=\frac{1}{2}Var[y_{l-1}]:

    Var[y_l]=\frac{1}{2}n_l Var[w_l]Var[y_{l-1}]

    With L layers put together, we have:
    Var[y_l]=Var[y_1](\Pi_{l=2}^L\frac{1}{2}n_lVar[w_l])

    This product is the key to the initialization design. A proper initialization method should avoid reducing or magnifying the magnitudes of input signals exponentially. So the above function is expected to a proper scalar:
    \frac{1}{2}n_lVar[W_l]=1

    This leads to a zero-mean Gaussian distribution whose standard deviation (std) is \sqrt{2/n_l}.

    4.Related work
    Activation Functions

    • Saturated
      • tanh
      • sigmoid
    • Non-saturated
      • ReLU
      • ELU
      • Leaky ReLU
      • PReLU
        Non-saturated can solve the problems of gradient vanish and convergence faster.

    Reference:
    [1]Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
    [2]https://www.cnblogs.com/everyday-haoguo/p/Note-PRelu.html
    [3]https://github.com/happynear/gitbook/blob/master/DeepLearning/PReLU.md

    相关文章

      网友评论

          本文标题:Too 100 Deep learning paper: ReL

          本文链接:https://www.haomeiwen.com/subject/ubxzzftx.html