Setting up your Machine Learning Application

Train/dev/test sets distribution

Probloem
Mismatched train/test distribution
Solution
Make sure dev and test come from same distribution

2.Bias/Variance

image.png

Basic Recipe for Machine Learning

image.png

Regularizing your neural network

Regularization

image.png

2.Why regularization reduces overfitting?
把λ设置非常大，使得W接近于0，相当于将某些隐藏单元的影响消除了。使这个巨大的神经网络变成一个很小的神经网络

image.png
[注意] 如果你使用梯度下降方法调试程序的一个步骤就是画出代价函数J关于梯度下降的迭代次数的图像可以看到的是每次迭代后代价函数J都会单调递减如果你实现了正则化部分那么请记住J现在有了新的定义如果你仍然使用原来定义的J 就像这里的第一项你可能看不到单调递减的函数图像所以为了调试梯度下降程序请确保你画的图像是利用这个新定义的J函数它包含了这里第二个项否则J不会在每次迭代后都单调递减

Dropout Regularization
随机失活正则化

image.png
image.png
实现dropout （反向随机失活）
[注1] 为了不减少z4的期望值我们需要除以0.8 因为它能提供你所需要的大约20%的校正值这样a3的期望值就不会被改变这就是所谓的反向随机失活技术(inverted dropout),它简化了神经网络的测试部分因为它减少了可能引入的缩放问题
image.png
[注2] 预测时，不使用随机失活
Understanding Dropout
Intuition:Can't rely on any one feature,so have to spread out weights
Other regularization
Data augmentation:水平翻转、随机裁切、随机扭曲
Early stopping：同时考虑减小J和避免过拟合，使事情变得复杂

image.png

Setting up your optimization problem

Normalizing inputs
use same μ σ to normalize test set

image.png
Vanishing/Exploding gradients
对于权重系数W, 如果他们只比1大一点点，或只比单位矩阵大一点点，那在一个非常深的网络，激活函数就会爆炸，另外如果W只比单位矩阵小一点点，而你有一个很深的网络，激活函数就会指数级的减少。
Weight Initialization for Deep Networks
relu 选择2/n，其他选择2/n

image.png
Numerical approximation of gradients
双侧差值更精确
Gradient Checking

image.png
image.png