美文网首页
Lecture 4 | The Backpropagation

Lecture 4 | The Backpropagation

作者: Ysgc | 来源:发表于2019-10-19 14:01 被阅读0次

    vector activation vs scalar activation

    \frac{\partial y_i}{\partial z_i} = \frac{exp(z_i)}{\sum_j exp(z_j)} - \frac{exp(z_i)^2}{(\sum_j exp(z_j))^2} = y_i(1 - y_i)

    \frac{\partial y_i}{\partial z_j} = - \frac{exp(z_i)exp(z_j)}{(\sum_j exp(z_j))^2} = y_i y_j

    sigmoid output -> prob of classification

    how to define the error???

    first choice: square euclidean distance

    L2 divergence -> differentiation is just y_i - d_i

    gradient<0 => y_i should increase to reduce the div

    arithmetically wrong, but label smoothing will help gradient descent!

    avoid overshooting

    https://leimao.github.io/blog/Label-Smoothing/

    it's a heuristic

    forward NN

    backward NN

    (1) trivial: grad of output


    (2) grad of the final activation layer



    (3) grad of the last group of weights

    [grad (W_{ij}^n)] = [y_0^{n-1}, y_1^{n-1}, ...,y_i^{n-1}]^T\cdot [grad(Z_0^n),grad(Z_1^n),...,grad(Z_j^n)]

    (4) grad of the second last group of y


    [grad (y_{i}^{n-1})]^T = [W_{ij}]\cdot [grad(Z_0^n),grad(Z_1^n),...,grad(Z_j^n)]^T

    (5) 综上 pseudocode & backward forward comparision


    backward: in each loop, apply an affine transformation (transposed W) to the derivative, then times the derivative of the activation func forward: in each iteration, apply an affine transformation to the input and an activation function
    step (2) no longer element wise multiplication

    [grad(z_0),grad(z_1),...grad(z_i)]^T = [\frac{\partial y_j}{\partial z_i}]\times [grad(y_0),grad(y_1),...grad(y_i)]^T


    相关文章

      网友评论

          本文标题:Lecture 4 | The Backpropagation

          本文链接:https://www.haomeiwen.com/subject/rasgmctx.html