KNN

作者: 我是嘻哈大哥 | 来源:发表于2018-05-15 07:23 被阅读10次

    一、算法概述

    1.1 KNN(K-Nearest Neighbor)工作原理
    存在一个样本数据集合,也称为训练样本集,并且样本集中每个数据都存在标签,即我们知道样本集中每一数据与所属分类对应的关系。输入没有标签的数据后,将新数据中的每个特征与样本集中数据对应的特征进行比较,提取出样本集中特征最相似数据(最近邻)的分类标签。一般来说,我们只选择样本数据集中前k个最相似的数据,这就是k近邻算法中k的出处,通常k是不大于20的整数。最后选择k个最相似数据中出现次数最多的分类作为新数据的分类。
    说明:KNN没有显示的训练过程,它是“懒惰学习”的代表,它在训练阶段只是把数据保存下来,训练时间开销为0,等收到测试样本后进行处理。
    举例:以电影分类作为例子,电影题材可分为爱情片,动作片等,那么爱情片有哪些特征?动作片有哪些特征呢?也就是说给定一部电影,怎么进行分类?这里假定将电影分为爱情片和动作片两类,如果一部电影中接吻镜头很多,打斗镜头较少,显然是属于爱情片,反之为动作片。有人曾根据电影中打斗动作和接吻动作数量进行评估,数据如下:



     给定一部电影数据(18,90)打斗镜头18个,接吻镜头90个,如何知道它是什么类型的呢?KNN是这样做的,首先计算未知电影与样本集中其他电影的距离(这里使用曼哈顿距离),数据如下:



    现在我们按照距离的递增顺序排序,可以找到k个距离最近的电影,加入k=3,那么来看排序的前3个电影的类别,爱情片,爱情片,动作片,下面来进行投票,这部未知的电影爱情片2票,动作片1票,那么我们就认为这部电影属于爱情片。
    1.2 KNN算法优缺点
      优点:精度高,对异常值不敏感、无数据输入假定
      缺点:计算复杂度高、空间复杂度高
    1.3 KNN算法python代码实现

      实现步骤:
        (1)计算距离
        (2)选择距离最小的k个点
        (3)排序
    二、案列(使用KNN实现手写数字识别)

    from numpy import *
    import operator
    from os import listdir
    
    def classify0(inX, dataSet, labels, k):
        dataSetSize = dataSet.shape[0]
        diffMat = tile(inX, (dataSetSize,1)) - dataSet
        sqDiffMat = diffMat**2
        sqDistances = sqDiffMat.sum(axis=1)
        distances = sqDistances**0.5
        sortedDistIndicies = distances.argsort()     
        classCount={}          
        for i in range(k):
            voteIlabel = labels[sortedDistIndicies[i]]
            classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
        sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
        return sortedClassCount[0][0]
        
    def img2vector(filename):
        returnVect = zeros((1,1024))
        fr = open(filename)
        for i in range(32):
            lineStr = fr.readline()
            for j in range(32):
                returnVect[0,32*i+j] = int(lineStr[j])
        return returnVect
    
    def handwritingClassTest():
        hwLabels = []
        trainingFileList = listdir('trainingDigits')           #load the training set
        m = len(trainingFileList)
        trainingMat = zeros((m,1024))
        for i in range(m):
            fileNameStr = trainingFileList[i]
            fileStr = fileNameStr.split('.')[0]     #take off .txt
            classNumStr = int(fileStr.split('_')[0])
            hwLabels.append(classNumStr)
            trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr)
        testFileList = listdir('testDigits')        #iterate through the test set
        errorCount = 0.0
        mTest = len(testFileList)
        for i in range(mTest):
            fileNameStr = testFileList[i]
            fileStr = fileNameStr.split('.')[0]     #take off .txt
            classNumStr = int(fileStr.split('_')[0])
            vectorUnderTest = img2vector('testDigits/%s' % fileNameStr)
            classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
            print ("the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr))
            if (classifierResult != classNumStr): errorCount += 1.0
        print ("\nthe total number of errors is: %d" % errorCount)
        print ("\nthe total error rate is: %f" % (errorCount/float(mTest)))
    if __name__=='__main__':
        handwritingClassTest()
    

    运行结果:

    the classifier came back with: 9, the real answer is: 9
    the classifier came back with: 9, the real answer is: 9
    the classifier came back with: 9, the real answer is: 9
    the classifier came back with: 9, the real answer is: 9
    the classifier came back with: 9, the real answer is: 9
    the classifier came back with: 9, the real answer is: 9
    the classifier came back with: 9, the real answer is: 9
    the classifier came back with: 9, the real answer is: 9
    the classifier came back with: 9, the real answer is: 9
    the classifier came back with: 9, the real answer is: 9
    the classifier came back with: 9, the real answer is: 9
    the classifier came back with: 9, the real answer is: 9
    the classifier came back with: 9, the real answer is: 9
    the classifier came back with: 9, the real answer is: 9
    the classifier came back with: 9, the real answer is: 9
    the classifier came back with: 9, the real answer is: 9
    the classifier came back with: 9, the real answer is: 9
    the classifier came back with: 9, the real answer is: 9
    the classifier came back with: 9, the real answer is: 9
    the classifier came back with: 9, the real answer is: 9
    the classifier came back with: 9, the real answer is: 9
    the classifier came back with: 9, the real answer is: 9
    the classifier came back with: 9, the real answer is: 9
    the classifier came back with: 9, the real answer is: 9
    the classifier came back with: 9, the real answer is: 9
    the classifier came back with: 9, the real answer is: 9
    the total number of errors is: 11
    the total error rate is: 0.011628
    

    相关文章

      网友评论

          本文标题:KNN

          本文链接:https://www.haomeiwen.com/subject/vtkjdftx.html