2014, before batch normalization was invented, training NN was hard.
For example, VGG was trained for 11 layers first, and then randomly added more layers inside, so that it could converge.
Another example: Google net used early output
bad! only bp within a batchsimilar to mini batch
learned to recite the GNU license675 Mass Ave -> central square ???
not perfectsoft attention -> weighted combination of all img location
hard attention -> forcing the model select only one location to look at -> more tricky -> not differentiable -> talk later in RL lecture
网友评论