vector activation vs scalar activation
sigmoid output -> prob of classification
how to define the error???
first choice: square euclidean distance
L2 divergence -> differentiation is just
gradient<0 => y_i should increase to reduce the div
arithmetically wrong, but label smoothing will help gradient descent!
avoid overshooting
https://leimao.github.io/blog/Label-Smoothing/
it's a heuristic
forward NN
backward NN
(1) trivial: grad of output
(2) grad of the final activation layer
(3) grad of the last group of weights
(4) grad of the second last group of y
(5) 综上 pseudocode & backward forward comparision
backward: in each loop, apply an affine transformation (transposed W) to the derivative, then times the derivative of the activation func forward: in each iteration, apply an affine transformation to the input and an activation function
step (2) no longer element wise multiplication
网友评论