Lecture 7 | Optimization and Gen

作者: Ysgc | 来源:发表于2019-10-20 11:29 被阅读0次

Lecture 7 | Optimization and Gen
Lecture 3：Loss functions and Opt
Lecture 7
Week 4-Experimental Technologies
NLU+_7
Lecture 7 - Backpropagation
ISL视频课程学习笔记
Lecture 7：Convolutional Neural N
Week 4 真核转录：RNA聚合酶和DNA元件-起始和延展
Princeton Alogorithm COS226 Week

http://www.cs.cmu.edu/~bhiksha/courses/deeplearning/Spring.2019/www/slides.spring19/lecture_7_optimizations.pdf

Convergence rate:

SGD: one sample, one update => then decay the LR
mini Batch => every several samples, one decaying of the LR

minibatch momentum

minibatch nestorov

let's first look back

smoothing by averaging: core of momentum and its variants

scale the movement in different dim, according to the average and variation

be careful of the notation: not 2nd order of derivative, but the square of the 1st order derivative

RMSprop in the idea case, the grad and the denominator will cancel out, so only the sign of the grad will remain.

the momentum only looks at the average, but rms looks at the mean square of grad.

any method using both the first and second order of grad history???? -> Adam

initially, $w_0 = 0$ , and delta is closed to 1, so that the early steps will be too small.

want $\sqrt{1-\gamma} \approx 1-\delta$ in adam ???

beals function

adadelta is the fastest in this case, sgd is slow

sgd is still the slowest

if the case is more complex, SGD may not be that bad.

methods considering second order of grad are usually fast enough and not swinging

designing of objection func!!! hopefully use the right one

both L2 and KL div are convex

KL div => prior knowledge that output's range is (0,1)

regression => L2
classification => KL

L2 vs KL in Perception => KL is better, L2 curve is flat on the left, so not that convex

batch normalization???

mini-batch => assumption is that every batch covers the same region
however, they can be apart from each other

two steps of batch normalizaiton:

move to the origin, normalization
shift to the common location

batch normalization between the activation and the affine transformation
cant be done in SGD => can only make sense in minibatch (whole batch dont, data exists in the same region)

derivative computing can be painful now...

$\frac{\partial Div}{\partial z_i} = \frac{\partial Div}{\partial u_i}\cdot \frac{\partial u_i}{\partial z_i} + \frac{\partial Div}{\partial u_i}\cdot \frac{\partial u_i}{\partial \sigma_B^2} \cdot \frac{\partial \sigma_B^2}{\partial z_i} + \frac{\partial Div}{\partial u_i}\cdot \frac{\partial u_i}{\partial \mu_B} \cdot \frac{\partial \mu_B}{\partial z_i}$

$\frac{\partial Div}{\partial u_i}$ 在上上一张slide里面已经给出了

https://www.quora.com/How-does-batch-normalization-behave-differently-at-training-time-and-test-time

my question here is that will two groups of inputs with different labels output the same results, if two group has dramatically different mean and the same shape of distribution?