6. Classification

作者: 玄语梨落 | 来源:发表于2020-08-17 13:57 被阅读0次

    Classification

    Logistic Regression:0\le h_\theta(x)\le 1

    Hypothesis Representation

    Want 0\le h_\theta(x)\le 1
    Hypothesis: h_\theta(x)=g(\theta^Tx)\ g(z)=\frac{1}{1+e^{-z}}
    Sigmoid funciton(g(z)=\frac{1}{1+e^{-z}}) = Logistic funciton
    h_\theta(x)=P(y=1|x;\theta) probability that y=1 given x , parameterized \theta

    Decision Boundary

    h_\theta\ge 0.5 ,means z\ge 0, then \theta^Tx\ge 0
    the line which represnt h_\theta(x)=0.5 names decision boundary.
    The decision boundary is a property of the hypothesis not a property of the data set.
    We use data set to find a \theta ,each \theta defines a decision boundary.

    Cost Function

    How to fit the parameters theta for logitic Regression.
    Liner Regression:
    j(\theta)=\frac{1}{m}\sum_{i=1}^m\frac{1}{2}(h_\theta(x^{(i)})-y^{(i)})^2 \newline Cost(h_\theta(x),y)=\frac{1}{2}(h_\theta(x)-y)^2 \newline j(\theta)=\frac{1}{m}\sum_{i=1}^mCost(h_\theta(x),y)
    When used for logistic regression, this function will be a no-convex function.
    Cost(h_\theta(x),y)=\left\{ \begin{aligned} &-\log(h_\theta(x))(y=1) \\ &-\log(1-h_\theta(x))(y=0) \end{aligned}\right.

    The topic of convexity analysis is beyond the scope of this course.

    Simplified cost function and gradient descent

    Cost(h_\theta(x),y)=\left\{ \begin{aligned} &-\log(h_\theta(x))(y=1) \\ &-\log(1-h_\theta(x))(y=0) \end{aligned}\right. \newline Cost(h_\theta(x),y)=-y\log(h_\theta(x))-(1-y)\log(1-h_\theta(x))

    the priniclple of maximum likelihood estimation
    代价函数:
    J(\theta)=-\frac{1}{m}[\sum_{i=1}^my^{(i)}\log h_\theta(x^{(i)})+(1-y^{(i)})\log(1-h_\theta(x^{(i)}))]
    Want min_\theta J(\theta):
    Repetat {
    \theta_j:=\theta_j-\alpha\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}
    }
    Because of the h_\theta(x) of the logistic regression is not as same as the liner regression ,the algorithms are no the same thing.

    Advanced Optimization

    Given \theta, we have code that can compute

    • J(\theta)
    • \frac{\delta}{\delta \theta_j}J(\theta)

    Optimizatin algorithms:

    • Gradient descent
    • Conjugate gradient
    • BFGS
    • L-BFGS

    Advantages:

    • No need to manuallly pick \alpha
    • Often faster than gradient descent.

    Disadvantages:

    • more complex

    An example:

    • the function
    function[jVal,gradient] = costFunction(theta)
    
    jVal = (theta(1)-5)^2+(theta(2)-5)^2
    
    gradient = zeros(2,1);
    gradient(1) = 2*(theta(1)-5);
    gradient(2) = 2*(theta(2)-5);
    

    code in Octave:

    Options = optimset('GradObj','on','MaxIter','100');
    initialTheta = zeros(2,1)
    [optTheta,functionVal,exitFlag] = fminunc(@costFunction,initialTheta,options)
    

    The function 'fminunc' is not a gradient descent, but can be seen as it.
    There must be at lest 2 paragramers or 2 dimension (\theta) in using 'fminunc'. To get more information, use 'help fminunc'.

    Multiclass classification

    one versus all classification (one versus reset)

    Train a logistic regression classifier h_\theta^{(i)}(x) for each class i to predict the probability that y=i.
    On a new input x, to make a prediction, pick the class i that maximizes.

    相关文章

      网友评论

        本文标题:6. Classification

        本文链接:https://www.haomeiwen.com/subject/rmxgdktx.html