美文网首页
Lecture 7 | Optimization and Gen

Lecture 7 | Optimization and Gen

作者: Ysgc | 来源:发表于2019-10-20 11:29 被阅读0次

http://www.cs.cmu.edu/~bhiksha/courses/deeplearning/Spring.2019/www/slides.spring19/lecture_7_optimizations.pdf

Convergence rate:

  • SGD: one sample, one update => then decay the LR
  • mini Batch => every several samples, one decaying of the LR
minibatch momentum minibatch nestorov

let's first look back

smoothing by averaging: core of momentum and its variants scale the movement in different dim, according to the average and variation

be careful of the notation: not 2nd order of derivative, but the square of the 1st order derivative


RMSprop in the idea case, the grad and the denominator will cancel out, so only the sign of the grad will remain.

the momentum only looks at the average, but rms looks at the mean square of grad.

any method using both the first and second order of grad history???? -> Adam

initially, w_0 = 0, and delta is closed to 1, so that the early steps will be too small.

want \sqrt{1-\gamma} \approx 1-\delta in adam ???

beals function adadelta is the fastest in this case, sgd is slow sgd is still the slowest

if the case is more complex, SGD may not be that bad.

methods considering second order of grad are usually fast enough and not swinging

designing of objection func!!! hopefully use the right one

both L2 and KL div are convex


KL div => prior knowledge that output's range is (0,1)

regression => L2
classification => KL

L2 vs KL in Perception => KL is better, L2 curve is flat on the left, so not that convex

batch normalization???

mini-batch => assumption is that every batch covers the same region
however, they can be apart from each other

two steps of batch normalizaiton:

  • move to the origin, normalization
  • shift to the common location

batch normalization between the activation and the affine transformation
cant be done in SGD => can only make sense in minibatch (whole batch dont, data exists in the same region)

derivative computing can be painful now...

\frac{\partial Div}{\partial z_i} = \frac{\partial Div}{\partial u_i}\cdot \frac{\partial u_i}{\partial z_i} + \frac{\partial Div}{\partial u_i}\cdot \frac{\partial u_i}{\partial \sigma_B^2} \cdot \frac{\partial \sigma_B^2}{\partial z_i} + \frac{\partial Div}{\partial u_i}\cdot \frac{\partial u_i}{\partial \mu_B} \cdot \frac{\partial \mu_B}{\partial z_i}

\frac{\partial Div}{\partial u_i} 在上上一张slide里面已经给出了


https://www.quora.com/How-does-batch-normalization-behave-differently-at-training-time-and-test-time

my question here is that will two groups of inputs with different labels output the same results, if two group has dramatically different mean and the same shape of distribution?


for test data ->

  • batch -> mean and var of batch????? (not sure)
  • single input -> from historical mean of mean and var

regularization and overfitting

10^30 possible inputs -> full description of these points
even we have 10^15 samples, the space is still nearly vacuous

sigmoid permits these steep curves


another way to smooth output -> Deeper!

rearrange the structures 660 params -> 3layers 220NNs -> 4layers 165NN -> ... -> prefer narrow but deep NN

Drop out

drop out is similar to bagging

different inputs may see different NNs

pseudo code

with dropout -> force NN to learn more robust model

grad is high at some region -> blow up


Workflow

相关文章

网友评论

      本文标题:Lecture 7 | Optimization and Gen

      本文链接:https://www.haomeiwen.com/subject/dgdomctx.html