



off policy ~> having a suggested step*grad at each step, but using the average of history instead to move






assume weights take one more step with momentum, calculate the grad then, and use that predicted grad as the current grad for updating


这里好像不太对啊,每个do until loop里面先那还怎么在第五行更新Wk呢
In Momentum => no assumption of the Hessian to be diagonal
A brief summary of different optimization methods in DNN

GD -> SGD









criterion of how far away from the optima
- abs(w - w*), euclidean distance
- step decreasing

converge faster: <<
but SGD updates with each instance

Batch GD converges faster and variance of error is lower


f(x) -> target func
g(x;w) -> current NN
sample from f(x) -> min empirical error

having difference set of samples -> different updating behavior -> the meaning of variance of estimation








https://www.stat.cmu.edu/~ryantibs/convexopt/lectures/stochastic-gd.pdf

网友评论