![](https://img.haomeiwen.com/i11683600/8c7293b6a9a93270.png)
Convergence rate:
- SGD: one sample, one update => then decay the LR
- mini Batch => every several samples, one decaying of the LR
![](https://img.haomeiwen.com/i11683600/ca09faeb815cea3b.png)
![](https://img.haomeiwen.com/i11683600/ec9d294cee4f4554.png)
![](https://img.haomeiwen.com/i11683600/aa847a7bb911714c.png)
![](https://img.haomeiwen.com/i11683600/df2f68420b288131.png)
let's first look back
![](https://img.haomeiwen.com/i11683600/5b97b0ed981ab3d0.png)
![](https://img.haomeiwen.com/i11683600/fadd5834a5f170db.png)
![](https://img.haomeiwen.com/i11683600/c8a6aaa7ce1ddcbe.png)
![](https://img.haomeiwen.com/i11683600/ec85859604eb1132.png)
be careful of the notation: not 2nd order of derivative, but the square of the 1st order derivative
![](https://img.haomeiwen.com/i11683600/c4643b534dd63035.png)
![](https://img.haomeiwen.com/i11683600/e3f81fab46b2ac79.png)
RMSprop in the idea case, the grad and the denominator will cancel out, so only the sign of the grad will remain.
![](https://img.haomeiwen.com/i11683600/2402d1b221fb98ad.png)
the momentum only looks at the average, but rms looks at the mean square of grad.
any method using both the first and second order of grad history???? -> Adam
![](https://img.haomeiwen.com/i11683600/3bc59211d72c17be.png)
![](https://img.haomeiwen.com/i11683600/539681c87cefec60.png)
initially, , and delta is closed to 1, so that the early steps will be too small.
![](https://img.haomeiwen.com/i11683600/a6d2442f0939f7c2.png)
want in adam ???
![](https://img.haomeiwen.com/i11683600/71360b4d9d5bb826.png)
![](https://img.haomeiwen.com/i11683600/e6a1e8193240c54d.png)
![](https://img.haomeiwen.com/i11683600/8d0b181ad772a791.png)
if the case is more complex, SGD may not be that bad.
methods considering second order of grad are usually fast enough and not swinging
![](https://img.haomeiwen.com/i11683600/58d7b327451fe7b6.png)
![](https://img.haomeiwen.com/i11683600/baebb67d36ef463a.png)
![](https://img.haomeiwen.com/i11683600/1a5fbb4a0a1409a9.png)
![](https://img.haomeiwen.com/i11683600/43ffb1643c68caa6.png)
![](https://img.haomeiwen.com/i11683600/4bdad6be7e0b3a00.png)
designing of objection func!!! hopefully use the right one
![](https://img.haomeiwen.com/i11683600/d2e0cc0edb417638.png)
both L2 and KL div are convex
![](https://img.haomeiwen.com/i11683600/de6a5caa6d40d94b.png)
![](https://img.haomeiwen.com/i11683600/18974eb46e6389e7.png)
regression => L2
classification => KL
![](https://img.haomeiwen.com/i11683600/785406ba5b26f54c.png)
![](https://img.haomeiwen.com/i11683600/d17b1ec7b372a550.png)
![](https://img.haomeiwen.com/i11683600/e4f05eb80226e354.png)
![](https://img.haomeiwen.com/i11683600/de48bcabb23e9446.png)
![](https://img.haomeiwen.com/i11683600/a4570a4ba4ae8452.png)
![](https://img.haomeiwen.com/i11683600/e4e9f44622229136.png)
batch normalization???
mini-batch => assumption is that every batch covers the same region
however, they can be apart from each other
![](https://img.haomeiwen.com/i11683600/4dc94698c5046c4c.png)
![](https://img.haomeiwen.com/i11683600/c69d0ebf4b829623.png)
![](https://img.haomeiwen.com/i11683600/8e3dfec642a658a5.png)
two steps of batch normalizaiton:
- move to the origin, normalization
- shift to the common location
![](https://img.haomeiwen.com/i11683600/338b43e96ea8d37b.png)
batch normalization between the activation and the affine transformation
cant be done in SGD => can only make sense in minibatch (whole batch dont, data exists in the same region)
![](https://img.haomeiwen.com/i11683600/c406dbe083b257c7.png)
![](https://img.haomeiwen.com/i11683600/1bc2d6123fbca176.png)
derivative computing can be painful now...
![](https://img.haomeiwen.com/i11683600/08d02dca2bae9b02.png)
![](https://img.haomeiwen.com/i11683600/e95eef701aa8e73f.png)
在上上一张slide里面已经给出了
![](https://img.haomeiwen.com/i11683600/245af191e22c353b.png)
![](https://img.haomeiwen.com/i11683600/0285261f517ee349.png)
![](https://img.haomeiwen.com/i11683600/291041765d271ad3.png)
https://www.quora.com/How-does-batch-normalization-behave-differently-at-training-time-and-test-time
my question here is that will two groups of inputs with different labels output the same results, if two group has dramatically different mean and the same shape of distribution?
![](https://img.haomeiwen.com/i11683600/7b5ca85c03824649.png)
![](https://img.haomeiwen.com/i11683600/6968c10bf083c62e.png)
for test data ->
- batch -> mean and var of batch????? (not sure)
- single input -> from historical mean of mean and var
![](https://img.haomeiwen.com/i11683600/6f53398289ee4fac.png)
![](https://img.haomeiwen.com/i11683600/3d4f3e3c30c76898.png)
![](https://img.haomeiwen.com/i11683600/74a4606cb17a2d7f.png)
![](https://img.haomeiwen.com/i11683600/de03691ff6b6394d.png)
regularization and overfitting
![](https://img.haomeiwen.com/i11683600/244e1513095abf9f.png)
![](https://img.haomeiwen.com/i11683600/ac36931f0b14b993.png)
![](https://img.haomeiwen.com/i11683600/362953c1264d94d8.png)
10^30 possible inputs -> full description of these points
even we have 10^15 samples, the space is still nearly vacuous
![](https://img.haomeiwen.com/i11683600/268a750b418ca728.png)
![](https://img.haomeiwen.com/i11683600/3a3ae03f3110aa5a.png)
![](https://img.haomeiwen.com/i11683600/282a6062647ac6b4.png)
![](https://img.haomeiwen.com/i11683600/a8f5ce3948c845cc.png)
sigmoid permits these steep curves
![](https://img.haomeiwen.com/i11683600/f81dff5d915360fd.png)
![](https://img.haomeiwen.com/i11683600/5aa57a927a044f34.png)
![](https://img.haomeiwen.com/i11683600/dd9a50eff80769ed.png)
![](https://img.haomeiwen.com/i11683600/0f1bc01d78f21f77.png)
![](https://img.haomeiwen.com/i11683600/d01e039213b8c9f6.png)
![](https://img.haomeiwen.com/i11683600/f99dd04c153b0450.png)
another way to smooth output -> Deeper!
![](https://img.haomeiwen.com/i11683600/133d1b6fbb0f9445.png)
![](https://img.haomeiwen.com/i11683600/fc95b8fb355350de.png)
![](https://img.haomeiwen.com/i11683600/0afd3297354d6be9.png)
![](https://img.haomeiwen.com/i11683600/41ce80c916d6a6de.png)
Drop out
![](https://img.haomeiwen.com/i11683600/18b00606893e6e25.png)
drop out is similar to bagging
![](https://img.haomeiwen.com/i11683600/a59e204b77301519.png)
![](https://img.haomeiwen.com/i11683600/cbfdae8738f07967.png)
![](https://img.haomeiwen.com/i11683600/fdd95ac1e93206fd.png)
![](https://img.haomeiwen.com/i11683600/14abcefb5f7ea511.png)
pseudo code
![](https://img.haomeiwen.com/i11683600/b6645a10a3f2c3d3.png)
![](https://img.haomeiwen.com/i11683600/c34c610b5370aea1.png)
![](https://img.haomeiwen.com/i11683600/18d73fc90023ded1.png)
![](https://img.haomeiwen.com/i11683600/43288c476e90da7a.png)
![](https://img.haomeiwen.com/i11683600/e0a335b5dc3f3da0.png)
![](https://img.haomeiwen.com/i11683600/a8caed1f4a088d47.png)
![](https://img.haomeiwen.com/i11683600/01265a0b75703a09.png)
with dropout -> force NN to learn more robust model
![](https://img.haomeiwen.com/i11683600/ce8916cefb6c8e18.png)
![](https://img.haomeiwen.com/i11683600/8277b9021192a0c6.png)
![](https://img.haomeiwen.com/i11683600/875d24fe65fcef42.png)
![](https://img.haomeiwen.com/i11683600/cf101c09a1c78316.png)
grad is high at some region -> blow up
![](https://img.haomeiwen.com/i11683600/09c3832fa39c00ed.png)
![](https://img.haomeiwen.com/i11683600/976ef3246a08eee6.png)
Workflow
![](https://img.haomeiwen.com/i11683600/e2dc7367479d5ada.png)
网友评论