美文网首页Mohism AI Lab Transactions
The Regularization method to red

The Regularization method to red

作者: 墨道院 | 来源:发表于2018-09-18 19:41 被阅读16次

Why we need regularization

As the deep neural network becomes more and more complicated, the over-fitting problem will appear. Therefore we need some tricks to overcome over-fitting problem. One of solutions to tackle it is doing regularization. There are several regularization methods, the general version will be discussed in this essay.

How to do regularization

Regularization sounds very noble and mysterious, but it is just an adding item to the original cost function. So let's review what is cost function without regularization:

J = -\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} \tag{1}

Then, let's view the cost function with regularization:

J_{regularized} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} }_\text{cross-entropy cost} + \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum\limits_l\sum\limits_k\sum\limits_j W_{k,j}^{[l]2} }_\text{L2 regularization cost} \tag{2}

Inside this big equation, \lambda is called regularization parameter, apparently it's a kind of hyper-parameters. Different values of lambda will generate different models.

The effects to gradient descent method

In Deep-learning, the Gradient Descent method is usually used to find the most optimal parameters matrix: W. Let's review the gradient descent method on W firstly:

w := w - \alpha\frac{\partial J(w, b)}{\partial w} \tag{3}

\frac{\partial J}{\partial w} = \frac{1}{m}X(A-Y)^T \tag{4}

If we want to take derivatives on the new version of the cost function, the new partial derivative is:

\frac{\partial J}{\partial w} = \frac{1}{m}X(A-Y)^T + \frac{\lambda}{m}w \tag{5}

Now we take the equation 5 into the equation 3, we can get:

w := w(1 - \alpha\frac{\lambda}{m}) - \frac{\alpha}{m}X(A-Y)^T \tag{6}

From the equation 5, we can know that 1 - \alpha\frac{\lambda}{m} is less than 1, so the final value of W will be smaller than before(without regularization). If the value of \lambda becomes larger, the final value of W would be smaller.

Why Regularization can reduce over-fitting

In order to answer this question intuitively, we start with a fundamental problem: there are only three cases for machine learning models trained by us: "High Bias", "Just right" and "High variance".

Machine-learning cases

Our target is "Just right", and the regularization is used to reduce the third one: "High variance".

According to the deduction from the last section, the \lambda gets bigger, the final W would be smaller. If the \lambda becomes large enough, the value of W will approach zero. That means the whole network becomes a very simple network like Logistic Regression because the majority of network weights becomes 0. So we can find a middle value of \lambda to get the "Just right" case.

相关文章

网友评论

    本文标题:The Regularization method to red

    本文链接:https://www.haomeiwen.com/subject/pmbmnftx.html