美文网首页
kNN算法初识

kNN算法初识

作者: 触琴触景即伤情 | 来源:发表于2019-01-16 13:36 被阅读0次

    kNN 算法

    基本原理

    最近邻居法KNN算法,又译K-近邻算法)是一种用于分类和回归的非参数统计方法。

    • k-NN分类中,输出是一个分类族群。一个对象的分类是由其邻居的“多数表决”确定的,k个最近邻居(k为正整数,通常较小)中最常见的分类决定了赋予该对象的类别。若k = 1,则该对象的类别直接由最近的一个节点赋予。

    简单的来说就是把将要预测的样本放在特征空间中(为了方便下面延时二维特征空间),然后找出其中距离带预测样本点最近的k个点,再由这几个点的标签来投票预测带预测样本的标签值。

    其中距离为了方便表示演示,我们采用欧拉距离,即
    distance = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}
    并且暂时忽略距离的权重值来简化理解,并且不区分训练集和测试集

    应用简单实现

    根据下述过程可以看到,可以说kNN是一个不需要训练过程的算法。

    k近邻算法是非常特殊的,可以被认为是没有模型的算法。

    为了和其他算法统一,可以认为训练数据集就是模型本身。

    首先创建一个虚拟的数据集,和一个带预测样本 x

    import numpy as np
    import matplotlib.pyplot as plt
    
    raw_data_X = [[3.393, 2.331],
                 [3.110, 1.782],
                 [1.343, 3.368],
                 [3.582, 4.679],
                 [2.280, 2.967],
                 [7.423, 4.697],
                 [5.745, 3.534],
                 [9.172, 2.511],
                 [7.792, 3.424],
                 [7.940, 0.792]]
    
    raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
    
    X_train = np.array(raw_data_X)
    y_train = np.array(raw_data_y)
    
    x = np.array([8.094, 3.366])
    
    plt.scatter(X_train[y_train==0,0], X_train[y_train==0,1], color='g')
    plt.scatter(X_train[y_train==1,0], X_train[y_train==1,1], color='r')
    plt.scatter(x[0], x[1], color='b')
    plt.show()
    

    将数据集和带预测样本在坐标轴上画出来

    其中我们假设 标签0 (即绿色点)代表肿瘤早期, 标签1 (即红色点)代表肿瘤晚期,蓝色点代表带预测样本,可以看到带预测点距离最近的几个点都是红色的,很遗憾,患者可能是晚期

    output_2_0.png

    编写代码进行预测

    kNN_simple_implement.py

    import numpy as np
    from math import sqrt
    from collections import Counter
    
    def kNN_classify(k, X_train, y_train, x):
        assert 1 <= k <= X_train.shape[0], "k must be vaild"
        assert X_train.shape[0] == y_train.shape[0], ("the size"
            + " of X_train must be equal to the size of y_train")
        assert X_train.shape[1] == x.shape[0], ("the feature "
            + "number of x must be equal to X_train")
        distances = [sqrt(np.sum((x_train - x) ** 2)) for x_train in X_train]
        nearest = np.argsort(distances)
    
        topK_y = [y_train[i] for i in nearest[:k]]
        votes = Counter(topK_y)
    
        return votes.most_common(1)[0][0]
    

    可以看到关键代码都比较短只有简单的5行

        distances = [sqrt(np.sum((x_train - x) ** 2)) for x_train in X_train]
        nearest = np.argsort(distances)
    
        topK_y = [y_train[i] for i in nearest[:k]]
        votes = Counter(topK_y)
    
        votes.most_common(1)[0][0]
    

    将代码加载到 jupyter 中

    %run kNN_simple_implement.py
    

    进行预测

    predict_y = kNN_classify(k=3, X_train=X_train, y_train=y_train, x=x)
    predict_y
    
        1
    

    可以看到预测的标签和我们猜测的相近,是代表晚期的1,我这里采用的k值为3,其实最近的5个应该都是晚期样本,代码预测准确

    使用scikit-learn中的kNN

    下面我们可以看看 scikit-learn 中如何使用kNN算法,其中预测样本放到一个矩阵中可以做批量预测

    from sklearn.neighbors import KNeighborsClassifier
    
    kNN_classifier = KNeighborsClassifier(n_neighbors=6)
    
    kNN_classifier.fit(X_train, y_train)
    
    KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
               metric_params=None, n_jobs=1, n_neighbors=6, p=2,
               weights='uniform')
    
    # 将要预测的数据放入一个矩阵
    X_predict = x.reshape(1, -1)
    y_predict = kNN_classifier.predict(X_predict)
    y_predict[0]
    
    1
    

    重新整理kNN代码

    根据上述可以看到,我们预测结果和 sklearn 中的是一样的,但是我们的代码只是一个方法,并没有封装成面向对象的组织模式,下面仿照 sklearn 我们改写一下代码

    kNN.py

    import numpy as np
    from math import sqrt
    from collections import Counter
    
    class kNNClassifier():
        
        def __init__(self, k):
            assert k >=1, "k must valid"
            self.k = k
            self._X_train = None
            self._y_train = None
    
        def fit(self, X_train, y_train):
            assert X_train.shape[0] == y_train.shape[0], ("the "
            + "size of X_train must be equal to the size of "
            + "y_train")
            assert self.k <= X_train.shape[0], ("the size of "
                + "X_train must be at least k")
            self._X_train = X_train
            self._y_train = y_train
            return self
    
        def predict(self, X_predict):
            assert (self._X_train is not None and 
                self._y_train is not None),("must fit "
                + "before predict")
            assert X_predict.shape[1] == self._X_train.shape[1],\
                ("the feature number of X_predict must be equal"
                + " to X_train")
            y_predict = [self._predict(x) for x in X_predict]
            return np.array(y_predict)
    
        def _predict(self, x):
            distances = [sqrt(np.sum((x_train - x) ** 2)) for x_train in self._X_train]
            nearest = np.argsort(distances)
    
            topK_y = [self._y_train[i] for i in nearest[:self.k]]
            votes = Counter(topK_y)
    
            return votes.most_common(1)[0][0]
            
        def __repr__(self):
            return "kNN(k=%d)" % self.k
    
    
    %run kNN.py
    
    kNN_classifier_re = kNNClassifier(6)
    kNN_classifier_re.fit(X_train, y_train)
    
    kNN(k=6)
    
    y_predict_re = kNN_classifier_re.predict(X_predict)
    y_predict_re[0]
    
    1
    

    可以看到结果一样的,很多时候我们可以仿照,学习别人的框架的封装来学习编程思想。

    相关文章

      网友评论

          本文标题:kNN算法初识

          本文链接:https://www.haomeiwen.com/subject/sjapdqtx.html