1.Logistic Regression（classification regression）

Linear Regression may be not suited well for some classification problem,such as classifying the email `which is spam or not ,or judging the cancer's condition depend on its size.

So,there is another algorithm——logistic regression,which has several features Xi,and the output y only two conditions——zero or one.

Hypothesis Representation

In the linear regression,the hypothesis result is θ'x which can be larger than 1 or smaller than 0,so we use sigmoid function to modify the hypothesis result during 1 and 0.

Decision Boundary

The decision boundary is the line that separates the area where y = 0 and where y = 1. It is created by our hypothesis function.

decision boundary can be linear or nonlinear ,sometimes even complicated curve.

As we can seen above,if we define:

h(z) > 0.5 —> y = 1 ;

h(z) < 0.5 —> y = 0 ;

which means, z > 0 is the boundary.

so,if z = θ'x ,then θ'x > 0 is the boundary which divide the area into two parts——y = 0 and y = 1; θ'x = θ0*x0 + θ1*x1 + θ2*x2 (this is a linear boundary)

Cost Function

We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima. In other words, it will not be a convex function.

so, we define the cost function of logistic regression as this :

c.f of logistic function

We can rewrite the cost equation into the form:

cost(h(x),y) = -ylog(h(x)) - (1-y)log(1-h(x))

Gradient Descent

The form is same as the gradient descent of linear regression.

A vectorized implementation is:

vectorized

Advanced Optimization

"Conjugate gradient", "BFGS", and "L-BFGS" are more sophisticated, faster ways to optimize θ that can be used instead of gradient descent. We suggest that you should not write these more sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the libraries instead, as they're already tested and highly optimized. Octave provides them.

2.Multi-class Classification: One-vs-others

if we have more than two categories,instead of y = {0,1} we will expand our definition so that y = {0,1...n}.We divide our problem into n+1 (+1 because the index starts at 0) binary classification problems.

one vs all

To summarize:

Train a logistic regression classifier hθ(x)for each class to predict the probability that y = i .

To make a prediction on a new x, pick the class that maximizes hθ(x).

3.PROBLEM ： Over-fitting

The hypothesis function may predict the examples in the training set very well,but can not predict the unseen data well.

three conditions with different features

As is shown in the picture above,the first curve has few features so it does not fit the data well,which called "under-fitting" or "high bias".The second curve is right well.And the last curve fitting all the examples in the training set but it looks like a unreasonable and complicate drawing may can not predict the unseen data.So,under this condition,the curve is called "over-fitting" or "high-variance" .

What are the reasons of over-fitting?

1).too many features

2).too complicate hypothesis function

How to solve it?

1).reduce the features

2).regularization (正则化)

.Keep all the features, but reduce the magnitude of parameters θj.

.Regularization works well when we have a lot of slightly useful features.