美文网首页
Lecture 6 | Acceleration, Regula

Lecture 6 | Acceleration, Regula

作者: Ysgc | 来源:发表于2019-10-20 05:49 被阅读0次

    off policy ~> having a suggested step*grad at each step, but using the average of history instead to move

    Vanilla GD Momentum

    assume weights take one more step with momentum, calculate the grad then, and use that predicted grad as the current grad for updating

    blue: momentum; green: Nesterov
    这里好像不太对啊,每个do until loop里面先那还怎么在第五行更新Wk呢

    In Momentum => no assumption of the Hessian to be diagonal


    A brief summary of different optimization methods in DNN


    GD -> SGD

    in each iteration, with the number of instances or batches increasing, the LR should shrink!! otherwise, bouncing all the time

    criterion of how far away from the optima

    • abs(w - w*), euclidean distance
    • step decreasing

    converge faster: O(log(1/\epsilon)) << log(1/\epsilon)
    but SGD updates with each instance

    Batch GD converges faster and variance of error is lower

    f(x) -> target func
    g(x;w) -> current NN
    sample from f(x) -> min empirical error

    having difference set of samples -> different updating behavior -> the meaning of variance of estimation

    https://www.stat.cmu.edu/~ryantibs/convexopt/lectures/stochastic-gd.pdf

    相关文章

      网友评论

          本文标题:Lecture 6 | Acceleration, Regula

          本文链接:https://www.haomeiwen.com/subject/hhdomctx.html