6. Classification

作者: 玄语梨落 | 来源:发表于2020-08-17 13:57 被阅读0次

Classification

Logistic Regression:0\le h_\theta(x)\le 1

Hypothesis Representation

Want 0\le h_\theta(x)\le 1
Hypothesis: h_\theta(x)=g(\theta^Tx)\ g(z)=\frac{1}{1+e^{-z}}
Sigmoid funciton(g(z)=\frac{1}{1+e^{-z}}) = Logistic funciton
h_\theta(x)=P(y=1|x;\theta) probability that y=1 given x , parameterized \theta

Decision Boundary

h_\theta\ge 0.5 ,means z\ge 0, then \theta^Tx\ge 0
the line which represnt h_\theta(x)=0.5 names decision boundary.
The decision boundary is a property of the hypothesis not a property of the data set.
We use data set to find a \theta ,each \theta defines a decision boundary.

Cost Function

How to fit the parameters theta for logitic Regression.
Liner Regression:
j(\theta)=\frac{1}{m}\sum_{i=1}^m\frac{1}{2}(h_\theta(x^{(i)})-y^{(i)})^2 \newline Cost(h_\theta(x),y)=\frac{1}{2}(h_\theta(x)-y)^2 \newline j(\theta)=\frac{1}{m}\sum_{i=1}^mCost(h_\theta(x),y)
When used for logistic regression, this function will be a no-convex function.
Cost(h_\theta(x),y)=\left\{ \begin{aligned} &-\log(h_\theta(x))(y=1) \\ &-\log(1-h_\theta(x))(y=0) \end{aligned}\right.

The topic of convexity analysis is beyond the scope of this course.

Simplified cost function and gradient descent

Cost(h_\theta(x),y)=\left\{ \begin{aligned} &-\log(h_\theta(x))(y=1) \\ &-\log(1-h_\theta(x))(y=0) \end{aligned}\right. \newline Cost(h_\theta(x),y)=-y\log(h_\theta(x))-(1-y)\log(1-h_\theta(x))

the priniclple of maximum likelihood estimation
代价函数:
J(\theta)=-\frac{1}{m}[\sum_{i=1}^my^{(i)}\log h_\theta(x^{(i)})+(1-y^{(i)})\log(1-h_\theta(x^{(i)}))]
Want min_\theta J(\theta):
Repetat {
\theta_j:=\theta_j-\alpha\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}
}
Because of the h_\theta(x) of the logistic regression is not as same as the liner regression ,the algorithms are no the same thing.

Advanced Optimization

Given \theta, we have code that can compute

  • J(\theta)
  • \frac{\delta}{\delta \theta_j}J(\theta)

Optimizatin algorithms:

  • Gradient descent
  • Conjugate gradient
  • BFGS
  • L-BFGS

Advantages:

  • No need to manuallly pick \alpha
  • Often faster than gradient descent.

Disadvantages:

  • more complex

An example:

  • the function
function[jVal,gradient] = costFunction(theta)

jVal = (theta(1)-5)^2+(theta(2)-5)^2

gradient = zeros(2,1);
gradient(1) = 2*(theta(1)-5);
gradient(2) = 2*(theta(2)-5);

code in Octave:

Options = optimset('GradObj','on','MaxIter','100');
initialTheta = zeros(2,1)
[optTheta,functionVal,exitFlag] = fminunc(@costFunction,initialTheta,options)

The function 'fminunc' is not a gradient descent, but can be seen as it.
There must be at lest 2 paragramers or 2 dimension (\theta) in using 'fminunc'. To get more information, use 'help fminunc'.

Multiclass classification

one versus all classification (one versus reset)

Train a logistic regression classifier h_\theta^{(i)}(x) for each class i to predict the probability that y=i.
On a new input x, to make a prediction, pick the class i that maximizes.

相关文章

网友评论

    本文标题:6. Classification

    本文链接:https://www.haomeiwen.com/subject/rmxgdktx.html