Stochastic Gradient Descent
如何训练神经网络
训练神经网络就是调整权重。
这里介绍了损失函数:measures how good the network's predictions are
An "optimizer" that can tell the network how to change its weights.
The Loss Function
损失函数衡量真实值和预测值的不同。
针对回归问题一个常见的损失函数是:mean absolute error or MAE
For each prediction y_pred, MAE measures the disparity from the true target y_true by an absolute difference abs(y_true - y_pred).
还有两个损失函数:mean-squared error (MSE) or the Huber loss (both available in Keras).
In other words, the loss function tells the network its objective.
随机梯度下降
The optimizer is an algorithm that adjusts the weights to minimize the loss.
Virtually all of the optimization algorithms used in deep learning belong to a family called stochastic gradient descent.
One step of training goes like this:
- Sample some training data and run it through the network to make predictions.
- Measure the loss between the predictions and the true values.
- Finally, adjust the weights in a direction that makes the loss smaller.
Then just do this over and over until the loss is as small as you like (or until it won't decrease any further.)
Each iteration's sample of training data is called a minibatch (or often just "batch"), while a complete round of the training data is called an epoch. The number of epochs you train for is how many times the network will see each training example.
Learning Rate and Batch Size
训练的时候,图中的线会慢慢的转动,因为在调整权重,但是什么控制着转动的快慢呢?是learning rate.A smaller learning rate means the network needs to see more minibatches before its weights converge to their best values.
影响随机梯度下降的最大因素:The learning rate and the size of the minibatches are the two parameters that have the largest effect on how the SGD training proceeds.
Their interaction is often subtle and the right choice for these parameters isn't always obvious.
幸运的是,对于大多数工作来说,不需要进行广泛的超参数搜索即可获得令人满意的结果。 Adam 是一种 SGD 算法,它具有自适应学习率,使其适用于大多数问题而无需任何参数调整(从某种意义上说,它是“自我调整”)。 Adam 是一个伟大的通用优化器。
Fortunately, for most work it won't be necessary to do an extensive hyperparameter search to get satisfactory results. Adam is an SGD algorithm that has an adaptive learning rate that makes it suitable for most problems without any parameter tuning (it is "self tuning", in a sense). Adam is a great general-purpose optimizer.
Adding the Loss and Optimizer
加损失函数:
model.compile(
optimizer="adam",
loss="mae",
)
有时候我们需要缩放特征,因为:neural networks tend to perform best when their inputs are on a common scale.
定义好模型之后,model.compile:
model.compile(
optimizer='adam',
loss='mae',
)
然后
history = model.fit(
X_train, y_train,
validation_data=(X_valid, y_valid),
batch_size=256,
epochs=10,
)
plot loss:
import pandas as pd
convert the training history to a dataframe
history_df = pd.DataFrame(history.history)
use Pandas native plot method
history_df['loss'].plot();
3) Evaluate Training
If you trained the model longer, would you expect the loss to decrease further?
This depends on how the loss has evolved during training: if the learning curves have levelled off, there won't usually be any advantage to training for additional epochs. Conversely, if the loss appears to still be decreasing, then training for longer could be advantageous.
With the learning rate and the batch size, you have some control over:
- How long it takes to train a model
- How noisy the learning curves are
- How small the loss becomes
You probably saw that smaller batch sizes gave noisier weight updates and loss curves. This is because each batch is a small sample of data and smaller samples tend to give noisier estimates.Smaller batches can have an "averaging" effect though which can be beneficial.
Smaller learning rates make the updates smaller and the training takes longer to converge. Large learning rates can speed up training, but don't "settle in" to a minimum as well. When the learning rate is too large, the training can fail completely. (Try setting the learning rate to a large value like 0.99 to see this.)
重点:
损失函数
model.compile(
optimizer='adam',
loss='mae',
)
网友评论