美文网首页逻辑回归
机器学习:逻辑回归(Logistic Regression)

机器学习:逻辑回归(Logistic Regression)

作者: 风吹往事散 | 来源:发表于2018-09-01 16:44 被阅读25次

    1.利用逻辑回归进行二分类

    1.1 批量梯度上升方法

    from math import *
    from numpy import *
    import matplotlib.pyplot as plt
    
    def loadDataSet():
        dataMat = []; labelMat = []
        fr = open("testSet.txt")
        for line in fr:
            lineArr = line.strip().split()
            dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])
            labelMat.append(int(lineArr[2]))
        return dataMat, labelMat
    
    
    def sigmoid(inX):
        return 1.0 / (1+exp(-inX))
    
    
    def gradAscent(dataMatIn, classLabels):
        dataMatrix = mat(dataMatIn)
        labelMat = mat(classLabels).transpose()
        m, n = shape(dataMatrix)
        alpha = 0.001
        maxCycle = 500
        weights = ones((n, 1))
        for k in range(maxCycle):
            h = sigmoid(dataMatrix * weights)
            error = labelMat - h
            weights = weights + alpha*dataMatrix.transpose()*error
        return weights
    
    
    def plotBestFit(wei):
        weights = wei.getA()
        dataMat, labelMat = loadDataSet()
        dataArr = array(dataMat)
        n = shape(dataMat)[0]
        xcord1 = []; ycord1 = []
        xcord2 = []; ycord2 = []
        for i in range(n):
            if int(labelMat[i]) == 1:
                xcord1.append(dataArr[i, 1]); ycord1.append(dataArr[i, 2])
            else:
                xcord2.append(dataArr[i, 1]); ycord2.append(dataArr[i, 2])
        fig = plt.figure()
        ax = fig.add_subplot(111)
        ax.scatter(xcord1, ycord1, s=30, c='red', marker='s')
        ax.scatter(xcord2, ycord2, s=30, c='green')
        x = arange(-3.0, 3.0, 0.1)
        y = (-weights[0]-weights[1]*x)/weights[2]
        ax.plot(x,y)
        plt.xlabel('X1'); plt.ylabel('X2')
        plt.show()
    
    dataArr, labelMat = loadDataSet()
    weight = gradAscent(dataArr, labelMat)
    print(weight)
    plotBestFit(weight)
    

    代码中将数据添加第一列变量全部为1,是因为考虑到分割方程的0次项目,因为假设的分割线方程为

    z = w_0+w_1*x_1+w_2*x_2

    sigmoid = \frac{1}{1+e^{-z}}
    所以当z>0时具有sigmoid函数接近1,否则接近0

    所以我们最终求出来的weigt的三个变量对应的是 w_0,w_1,w_2,并且x_2是y轴坐标,所以分割线方程就是源码中的形式了,如图是分类的结果(批量梯度上升300次):

    Figure_1.png
    1.2随机梯度上升
    def stocGraAscent0(dataMatrix, classLabels):
        dataMatrix = array(dataMatrix)
        m, n= shape(dataMatrix)
        alpha = 0.01
        weights = ones(n)
        for i in range(m):
            h = sigmoid(sum(dataMatrix[i]*weights))
            error = classLabels[i] - h
            weights = weights + alpha*dataMatrix[i]*error
        return weights
    

    随机梯度上升的记过如图所示,可能没有批量梯度上升的好,但是由于样本数目少,所以体现不出优越性(有三分之一的样本出现问题),但是当样本数目变大时,随机梯度上升的运算量和速度就明显了。

    Figure_2.png
    1.3改进随机梯度上升

    随机梯度上升的步长固定且随机性没有得到充分发挥,因此有一定的局限性,导致权重值会在一定的范围内徘徊,所以可以对随机梯度上升的方法进行改进,自适应步长,下图3是改进梯度上升法迭代20次的结果,因此可以充分体现优越性

    def stocGraAscent1(dataMatrix, classLabels, numIter=150):
        dataMatrix = array(dataMatrix)
        m, n = shape(dataMatrix)
        weights = ones(n)
        for j in range(numIter):
            dataIndex = range(m)
            for i in range(m):
                alpha = 4/(1.0+j+i)+0.01
                ranIndex = int(random.uniform(0, len(dataIndex)))
                h = sigmoid(sum(dataMatrix[ranIndex]*weights))
                error = classLabels[ranIndex] - h
                weights = weights + alpha*error*dataMatrix[ranIndex]
                # del(dataIndex[ranIndex])
        return weights
    
    Figure_3.png

    2.1预测病马的死亡率

    def classifyVector(inX, weights):
        prob = sigmoid(sum(inX*weights))
        if prob > 0.5: return 1.0
        else: return 0.0
    
    
    def colicTest():
        frTrain = open('horseColicTraining.txt')
        frTest = open('horseColicTest.txt')
        trainingSet = []; trainingLabels = []
        for line in frTrain:
            currLine = line.strip().split('\t')
            lineArr = []
            for i in range(21):
                lineArr.append(float(currLine[i]))
            trainingSet.append(lineArr)
            trainingLabels.append(float(currLine[21]))
        trainWeights = stocGraAscent1(trainingSet, trainingLabels, 500)
        errorCount = 0; numTestVec = 0.0
        for line in frTest:
            numTestVec += 1.0
            currLine = line.strip().split()
            lineArr = []
            for i in range(21):
                lineArr.append(float(currLine[i]))
            if int(classifyVector(array(lineArr), trainWeights)) !=int(int(currLine[21])):
                errorCount += 1
        errorRate = (float(errorCount)/numTestVec)
        print("The error rate of this test is: %f" % errorRate)
        return errorRate
    
    
    def multiTest():
        numTests = 10; errorSum = 0.0
        for k in range(numTests):
            errorSum += colicTest()
        print("After %d iterration the average error rate is: %f" % (numTests, errorSum/float(numTests)))
    

    最终预测结果为

    The error rate of this test is: 0.373134
    The error rate of this test is: 0.328358
    The error rate of this test is: 0.253731
    The error rate of this test is: 0.343284
    The error rate of this test is: 0.417910
    The error rate of this test is: 0.313433
    The error rate of this test is: 0.507463
    The error rate of this test is: 0.298507
    The error rate of this test is: 0.223881
    The error rate of this test is: 0.522388
    After 10 iterration the average error rate is: 0.358209
    
    Process finished with exit code 0
    

    由于样本中信息缺失达到百分之30,所以结果还算可以接受,最低可以调整到百分之20左右,通过调整迭代次数。

    相关文章

      网友评论

        本文标题:机器学习:逻辑回归(Logistic Regression)

        本文链接:https://www.haomeiwen.com/subject/bksuwftx.html