美文网首页
Layer Normalization反向传播推导

Layer Normalization反向传播推导

作者: YoungLittleFat | 来源:发表于2019-03-28 20:53 被阅读0次

    Layer Normalization作用及公式

    Layer Normalization来源于这篇文章:

    《Layer Normalization》
    https://arxiv.org/pdf/1607.06450.pdf

    其目的为减少深度神经网络中层与层之间的Covariate Shift,增加网络收敛速度。与Batch Normalization对比,Layer Normalization可显著减少参数量,特别适用于RNN系结构。
    Layer Normalization的公式如下所示:
    \begin{equation} \mu = \frac{1}{H} \sum_{i=1}^{H}x_i\\ \sigma = \sqrt{ \frac{1}{H} \sum_{i=1}^{H}{ \left(x_i-\mu\right)^2 } }\\ y = \mathrm{g} \odot \frac{\left( x-\mu \right)}{\sqrt{\sigma^2+\varepsilon }} + \mathrm{b} \end{equation}


    其中\mathrm{g}\mathrm{b} 是可学习的参数,\odot 为element-wise乘法。

    Layer Normalization的反向传播推导

    Layer Normalization的梯度分三个部分:输入的部分和两个可学习参数的部分

    可学习参数部分推导

    可学习参数包括\mathrm{g}\mathrm{b}。此处令\hat{x}=\left( x-\mu \right) \cdot (\sigma^2+\varepsilon)^{-\frac{1}{2}}


    \begin{equation} \begin{aligned} \frac{\partial{L}}{\partial{g_i}} &= \frac{\partial{L}}{\partial{y_i}} \cdot \frac{\partial{y_i}}{\partial{g_i}}\\ &= \frac{\partial{L}}{\partial{y_i}}\hat{x}_i \end{aligned} \end{equation}

    \begin{equation} \begin{aligned} \frac{\partial{L}}{\partial{b_i}} & = \frac{\partial{L}}{\partial{y_i}} \cdot \frac{\partial{y_i}}{\partial{b_i}}\\ &= \frac{\partial{L}}{\partial{y_i}} \cdot1 \end{aligned} \end{equation}

    输入梯度推导

    原公式可变为:
    y_i = g_i \hat{x}_i + b_i

    则输入的梯度为:
    \begin{equation} \begin{aligned} \frac{\partial{L}}{\partial{x_i}} &= \sum_{j=1}^{H}{ \frac{\partial{L}}{\partial{y_j}} \cdot \frac{\partial{y_j}}{\partial{\hat{x}_j}} \cdot \frac{\partial{\hat{x}_j}}{\partial{x_i}} }\\ &=\sum_{j=1}^{H}{ \frac{\partial{L}}{\partial{y_j}} \cdot g_j \cdot \frac{\partial{\hat{x}_j}}{\partial{x_i}} } \end{aligned} \end{equation}

    下面重点分析最右部分:

    \begin{equation} \begin{aligned} \frac{\partial{\hat{x}_j}}{\partial{x_i}} &= \frac{\partial{}}{\partial{x_i}} \left[ \left( x_j-\mu \right) \cdot (\sigma^2+\varepsilon)^{-\frac{1}{2}} \right]\\ &= (\sigma^2+\varepsilon)^{-\frac{1}{2}} \cdot \left( \delta_{ij} - \frac{1}{H } \right) \cdots (1)\\ &+\left( x_j-\mu \right) \cdot \left(-\frac{1}{2}\right) \cdot (\sigma^2+\varepsilon)^{-\frac{3}{2}} \cdot \frac{\partial{\sigma^2}}{\partial{x_i}} \cdots(2) \end{aligned} \end{equation}

    其中\delta_{ij}=1 if i=j,否则为0。易得\frac{\partial{\sigma^2}}{\partial{x_i}}=\frac{2}{H}(x_i-\mu)。下面将式(1)和式(2)分别代回原式:

    式(1):
    \begin{equation} \begin{aligned} &\sum_{j=1}^{H}{ \frac{\partial{L}}{\partial{y_j}} \cdot g_j \cdot (\sigma^2+\varepsilon)^{-\frac{1}{2}} \cdot \left( \delta_{ij} - \frac{1}{H } \right) } \\ &=(\sigma^2+\varepsilon)^{-\frac{1}{2}} \cdot \left( \frac{\partial{L}}{\partial{y_i}} \cdot g_i - \frac{1}{H} \sum_{j=1}^{H}{ \frac{\partial{L}}{\partial{y_j}} g_j } \right) \end{aligned} \end{equation}

    式(2):
    \begin{equation} \begin{aligned} & \sum_{j=1}^{H}{ \frac{\partial{L}}{\partial{y_j}} \cdot g_j \left( x_j-\mu \right) \cdot \left(-\frac{1}{2}\right) \cdot (\sigma^2+\varepsilon)^{-\frac{3}{2}} \cdot \frac{2}{H}(x_i-\mu) } \\ &=-\frac{1}{H} (\sigma^2+\varepsilon)^{-\frac{1}{2}} \cdot \left[(x_i-\mu) (\sigma^2+\varepsilon)^{-\frac{1}{2}} \right] \cdot \sum_{j=1}^{H}{ \frac{\partial{L}}{\partial{y_j}} \cdot g_j \cdot \left[( x_j-\mu ) (\sigma^2+\varepsilon)^{-\frac{1}{2}}\right] }\\ &= -\frac{1}{H} (\sigma^2+\varepsilon)^{-\frac{1}{2}} \cdot \hat{x}_i \cdot \sum_{j=1}^{H}{ \frac{\partial{L}}{\partial{y_j}} g_j \hat{x}_j } \end{aligned} \end{equation}

    注意式中\hat{x}=\left( x-\mu \right) \cdot (\sigma^2+\varepsilon)^{-\frac{1}{2}}。合并两式:

    \begin{equation} \begin{aligned} \frac{\partial{L}}{\partial{x_i}} &= (\sigma^2+\varepsilon)^{-\frac{1}{2}} \left[ \frac{\partial{L}}{\partial{y_i}} g_i -\frac{1}{H} \left (\sum_{j=1}^{H}{\frac{\partial{L}}{\partial{y_j}} g_j } + \hat{x}_i \sum_{j=1}^{H}{\frac{\partial{L}}{\partial{y_j}} g_j \hat{x}_j } \right) \right] \end{aligned} \end{equation}

    公式总结

    \begin{equation} \begin{aligned} &\frac{\partial{L}}{\partial{x_i}} = (\sigma^2+\varepsilon)^{-\frac{1}{2}} \left[ \frac{\partial{L}}{\partial{y_i}} g_i -\frac{1}{H} \left (\sum_{j=1}^{H}{\frac{\partial{L}}{\partial{y_j}} g_j } + \hat{x}_i \sum_{j=1}^{H}{\frac{\partial{L}}{\partial{y_j}} g_j \hat{x}_j } \right) \right]\\ &\frac{\partial{L}}{\partial{g_i}} = \frac{\partial{L}}{\partial{y_i}} \cdot \hat{x_i}\\ &\frac{\partial{L}}{\partial{b_i}} = \frac{\partial{L}}{\partial{y_i}} \cdot1\\ \end{aligned} \end{equation}

    码公式辛苦,转载请注明出处。

    相关文章

      网友评论

          本文标题:Layer Normalization反向传播推导

          本文链接:https://www.haomeiwen.com/subject/wxwnvqtx.html