accuracy (counting) is not differentiable! and cross entropy error is just an approx of the accuracy
sometimes, minimizing the cross entropy is not minimizing the accuracy
perceptron and sigmoid NN can find the decision boundary successfully
Now one more point. Perception -> 100% accuracy, while sigmoid NN can not reach 100% accuracy (assume NN's weights are bounded => length of weights vector is 1)
high dim -> no one knows -> only hypothesis
saddle point -> some eigen values of the hessian matrix are positive, and some are negative
R => how fast it converges
R > 1 => getting worse
R = 1 => no better no worse
R<1 => better
First consider the quadratic cases
Newton's method 参考 https://zhuanlan.zhihu.com/p/83320557 chapter4.1
注意不同的是,4.1里面是函数本身求根,这里是要求导数的根,所以多加一次导数形式就匹配了。optimal step for grad is the second order derivative (hessian matrix)'s inverse
difference dim may have different optimal -> may converge in one direction, but diverge in the other -> have to get the min of all optimal
coupled solution -> normalization of data quadratic term is approximated by Hessian Matrix if eta = 1 ~> equals to Newton's method curse of dimbut we dont need capture the whole Hessian matrix, right?
Hessian matrix and quadratic approximation may not be in the right direction a number of methods to approximate the Hessianall these 2nd order method fail in high dim
does bfgs and LM solves the stability issue?
why not using multi step information??
inverse of hessian -> inverse of partial derivative
网友评论