The Regularization method to red

作者: 墨道院 | 来源:发表于2018-09-18 19:41 被阅读16次

The Regularization method to red
regularization
Regularization
Regularization
lecture 3
DS Q&A
DROPOUT
[NN] Regularization Summary
regularization strength in Logi
“数据融合”总结2

Why we need regularization

As the deep neural network becomes more and more complicated, the over-fitting problem will appear. Therefore we need some tricks to overcome over-fitting problem. One of solutions to tackle it is doing regularization. There are several regularization methods, the general version will be discussed in this essay.

How to do regularization

Regularization sounds very noble and mysterious, but it is just an adding item to the original cost function. So let's review what is cost function without regularization:

$J = -\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} \tag{1}$

Then, let's view the cost function with regularization:

$J_{regularized} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} }_\text{cross-entropy cost} + \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum\limits_l\sum\limits_k\sum\limits_j W_{k,j}^{[l]2} }_\text{L2 regularization cost} \tag{2}$

Inside this big equation, $\lambda$ is called regularization parameter, apparently it's a kind of hyper-parameters. Different values of $lambda$ will generate different models.

The effects to gradient descent method

In Deep-learning, the Gradient Descent method is usually used to find the most optimal parameters matrix: W. Let's review the gradient descent method on W firstly:

$w := w - \alpha\frac{\partial J(w, b)}{\partial w} \tag{3}$

$\frac{\partial J}{\partial w} = \frac{1}{m}X(A-Y)^T \tag{4}$

If we want to take derivatives on the new version of the cost function, the new partial derivative is:

$\frac{\partial J}{\partial w} = \frac{1}{m}X(A-Y)^T + \frac{\lambda}{m}w \tag{5}$

Now we take the equation 5 into the equation 3, we can get:

$w := w(1 - \alpha\frac{\lambda}{m}) - \frac{\alpha}{m}X(A-Y)^T \tag{6}$

From the equation 5, we can know that $1 - \alpha\frac{\lambda}{m}$ is less than 1, so the final value of W will be smaller than before(without regularization). If the value of $\lambda$ becomes larger, the final value of W would be smaller.

Why Regularization can reduce over-fitting

In order to answer this question intuitively, we start with a fundamental problem: there are only three cases for machine learning models trained by us: "High Bias", "Just right" and "High variance".

Machine-learning cases

Our target is "Just right", and the regularization is used to reduce the third one: "High variance".

According to the deduction from the last section, the $\lambda$ gets bigger, the final W would be smaller. If the $\lambda$ becomes large enough, the value of W will approach zero. That means the whole network becomes a very simple network like Logistic Regression because the majority of network weights becomes 0. So we can find a middle value of $\lambda$ to get the "Just right" case.

The Regularization method to red
Why we need regularization As the deep neural network bec...
regularization
regularization 监督机器学习问题无非就是“minimizeyour error while regu...
Regularization
regularization的几种方法（防止overfit）： 1. add term to loss 2. dr...
Regularization
overfitting 如果特征过多，但是训练集不够时，很有可能会出现overfitting 解决overfitt...
lecture 3
Regularization: Model should be "simple", so it works on ...
DS Q&A
What is regularization? The differences between Lasso vs ...
DROPOUT
important technique for regularization 流程 Imagine that yo...
[NN] Regularization Summary
Dropout: Dropout is a regularization technique. You only ...
regularization strength in Logi
Q: What is the inverse of regularization strength in Logi...
“数据融合”总结2
Feature fusion with covariance matrix regularization in f...