美文网首页
Lecture 7 | Optimization and Gen

Lecture 7 | Optimization and Gen

作者: Ysgc | 来源:发表于2019-10-20 11:29 被阅读0次

    http://www.cs.cmu.edu/~bhiksha/courses/deeplearning/Spring.2019/www/slides.spring19/lecture_7_optimizations.pdf

    Convergence rate:

    • SGD: one sample, one update => then decay the LR
    • mini Batch => every several samples, one decaying of the LR
    minibatch momentum minibatch nestorov

    let's first look back

    smoothing by averaging: core of momentum and its variants scale the movement in different dim, according to the average and variation

    be careful of the notation: not 2nd order of derivative, but the square of the 1st order derivative


    RMSprop in the idea case, the grad and the denominator will cancel out, so only the sign of the grad will remain.

    the momentum only looks at the average, but rms looks at the mean square of grad.

    any method using both the first and second order of grad history???? -> Adam

    initially, w_0 = 0, and delta is closed to 1, so that the early steps will be too small.

    want \sqrt{1-\gamma} \approx 1-\delta in adam ???

    beals function adadelta is the fastest in this case, sgd is slow sgd is still the slowest

    if the case is more complex, SGD may not be that bad.

    methods considering second order of grad are usually fast enough and not swinging

    designing of objection func!!! hopefully use the right one

    both L2 and KL div are convex


    KL div => prior knowledge that output's range is (0,1)

    regression => L2
    classification => KL

    L2 vs KL in Perception => KL is better, L2 curve is flat on the left, so not that convex

    batch normalization???

    mini-batch => assumption is that every batch covers the same region
    however, they can be apart from each other

    two steps of batch normalizaiton:

    • move to the origin, normalization
    • shift to the common location

    batch normalization between the activation and the affine transformation
    cant be done in SGD => can only make sense in minibatch (whole batch dont, data exists in the same region)

    derivative computing can be painful now...

    \frac{\partial Div}{\partial z_i} = \frac{\partial Div}{\partial u_i}\cdot \frac{\partial u_i}{\partial z_i} + \frac{\partial Div}{\partial u_i}\cdot \frac{\partial u_i}{\partial \sigma_B^2} \cdot \frac{\partial \sigma_B^2}{\partial z_i} + \frac{\partial Div}{\partial u_i}\cdot \frac{\partial u_i}{\partial \mu_B} \cdot \frac{\partial \mu_B}{\partial z_i}

    \frac{\partial Div}{\partial u_i} 在上上一张slide里面已经给出了


    https://www.quora.com/How-does-batch-normalization-behave-differently-at-training-time-and-test-time

    my question here is that will two groups of inputs with different labels output the same results, if two group has dramatically different mean and the same shape of distribution?


    for test data ->

    • batch -> mean and var of batch????? (not sure)
    • single input -> from historical mean of mean and var

    regularization and overfitting

    10^30 possible inputs -> full description of these points
    even we have 10^15 samples, the space is still nearly vacuous

    sigmoid permits these steep curves


    another way to smooth output -> Deeper!

    rearrange the structures 660 params -> 3layers 220NNs -> 4layers 165NN -> ... -> prefer narrow but deep NN

    Drop out

    drop out is similar to bagging

    different inputs may see different NNs

    pseudo code

    with dropout -> force NN to learn more robust model

    grad is high at some region -> blow up


    Workflow

    相关文章

      网友评论

          本文标题:Lecture 7 | Optimization and Gen

          本文链接:https://www.haomeiwen.com/subject/dgdomctx.html