美文网首页
Lecture 6 | Acceleration, Regula

Lecture 6 | Acceleration, Regula

作者: Ysgc | 来源:发表于2019-10-20 05:49 被阅读0次

off policy ~> having a suggested step*grad at each step, but using the average of history instead to move

Vanilla GD Momentum

assume weights take one more step with momentum, calculate the grad then, and use that predicted grad as the current grad for updating

blue: momentum; green: Nesterov
这里好像不太对啊,每个do until loop里面先那还怎么在第五行更新Wk呢

In Momentum => no assumption of the Hessian to be diagonal


A brief summary of different optimization methods in DNN


GD -> SGD

in each iteration, with the number of instances or batches increasing, the LR should shrink!! otherwise, bouncing all the time

criterion of how far away from the optima

  • abs(w - w*), euclidean distance
  • step decreasing

converge faster: O(log(1/\epsilon)) << log(1/\epsilon)
but SGD updates with each instance

Batch GD converges faster and variance of error is lower

f(x) -> target func
g(x;w) -> current NN
sample from f(x) -> min empirical error

having difference set of samples -> different updating behavior -> the meaning of variance of estimation

https://www.stat.cmu.edu/~ryantibs/convexopt/lectures/stochastic-gd.pdf

相关文章

网友评论

      本文标题:Lecture 6 | Acceleration, Regula

      本文链接:https://www.haomeiwen.com/subject/hhdomctx.html