美文网首页
KNN(K-Nearest Neighbor)算法

KNN(K-Nearest Neighbor)算法

作者: Muggle01 | 来源:发表于2018-09-01 19:03 被阅读0次

    算法背景

    K最近邻(K-Nearest Neighbor,KNN)算法,是著名的模式识别统计学方法,在机器学习分类算法中占有相当大的地位。它是一个理论上比较成熟的方法。既是最简单的机器学习算法之一,也是基于实例的学习方法中最基本的,又是最好的文本分类算法之一。

    KNN工作原理

    存在一个样本数据集合,也称为训练样本集,并且样本集中每个数据都存在标签,即我们知道样本集中每一数据与所属分类对应的关系。输入没有标签的数据后,将新数据中的每个特征与样本集中数据对应的特征进行比较,提取出样本集中特征最相似数据(最近邻)的分类标签。一般来说,我们只选择样本数据集中前k个最相似的数据,这就是k近邻算法中k的出处,通常k是不大于20的整数。最后选择k个最相似数据中出现次数最多的分类作为新数据的分类。

    说明:KNN没有显示的训练过程,它是“懒惰学习”的代表,它在训练阶段只是把数据保存下来,训练时间开销为0,等收到测试样本后进行处理。

    KNN算法的优缺点

    KNN算法的优点:

    1)简单、有效。
    2)重新训练的代价较低(类别体系的变化和训练集的变化,在Web环境和电子商务应用中是很常见的)。
    3)计算时间和空间线性于训练集的规模(在一些场合不算太大)。
    4)由于KNN方法主要靠周围有限的邻近的样本,而不是靠判别类域的方法来确定所属类别的,因此对于类域的交叉或重叠较多的待分样本集来说,KNN方法较其他方法更为适合。
    5)该算法比较适用于样本容量比较大的类域的自动分类,而那些样本容量较小的类域采用这种算法比较容易产生误分。

    KNN算法缺点:

    1)KNN算法是懒散学习方法(lazy learning,基本上不学习),一些积极学习的算法要快很多。
    2)类别评分不是规格化的(不像概率评分)。
    3)输出的可解释性不强,例如决策树的可解释性较强。
    4)该算法在分类时有个主要的不足是,当样本不平衡时,如一个类的样本容量很大,而其他类样本容量很小时,有可能导致当输入一个新样本时,该样本的K个邻居中大容量类的样本占多数。该算法只计算“最近的”邻居样本,某一类的样本数量很大,那么或者这类样本并不接近目标样本,或者这类样本很靠近目标样本。无论怎样,数量并不能影响运行结果。可以采用权值的方法(和该样本距离小的邻居权值大)来改进。
    5)计算量较大。目前常用的解决方法是事先对已知样本点进行剪辑,事先去除对分类作用不大的样本。

    Python实现

    采用均值方差归一化方法,数据进行归一化之后,考虑距离加权,计算鸢尾花的分类。

    源程序

    
    #!/usr/bin/env python 
    # -*- coding: UTF-8 -*-
    
    import numpy as np
    from math import sqrt
    from collections import Counter
    from sklearn import datasets
    from sklearn.metrics import accuracy_score
    
    class KNN:
    
        def __init__(self, k, p):
            """初始化kNN分类器"""
            assert k >= 1, "k must be valid"
            self.k = k
            self.p = p
            self._X_train = None
            self._y_train = None
            self.y_possible = None
        
        def data_split (self, X, y, test_ratio=0.2, seed=None):
            """根据test_ratio分隔数据集"""
            assert X.shape[0] == y.shape[0],\
            "the size of X must equal to the size of y"
            assert 0.0 <=test_ratio <=1.0, \
            "test_ratio must be valid"
    
            if seed:
                np.random.seed(seed)
    
            #将原始数据打乱
            shuffle_indexes = np.random.permutation(len(X)) 
            test_size = int(len(X)*test_ratio)
            test_indexes = shuffle_indexes[:test_size]
            train_indexes = shuffle_indexes[test_size:]
    
            X_train = X[train_indexes]
            y_train = y[train_indexes]
    
            X_test = X[test_indexes]
            y_test = y[test_indexes]
    
            #y的所有可能取值
            y_list = y_train.tolist() #将训练y值矩阵转换为列表
            y_possible_list = sorted(set(y_list),key=y_list.index)  #去除列表中重复数据,数据排列不变
            self.y_possible = np.array(y_possible_list) #将列表转换为矩阵
    
    
            # 数据归一化
            X_train, X_test= self._scaling(X_train, X_train), self._scaling(X_train, X_test)
    
            return X_train, y_train, X_test, y_test
    
        def _scaling(self, X_train, original_X):
    #        """最值归一化"""
    #        scaling_X = np.ones(shape=original_X.shape)
    #        for i in range(original_X.shape[1]):
    #            scaling_X[:,i] = (original_X[:,i] - np.min(X_train[:,i]))/(np.max(X_train[:,i])-np.min(X_train[:,i]))
    
            """均值方差归一化"""
            scaling_X = np.zeros(shape=original_X.shape)
            for i in range(original_X.shape[1]):
                scaling_X[:,i] = (original_X[:,i] - np.mean(X_train[:,i]))/(np.std(X_train[:,i]))
            return scaling_X
    
        def fit(self, X_train, y_train):
            """根据训练数据集X_train和y_train训练kNN分类器"""
            assert X_train.shape[0] == y_train.shape[0], \
                "the size of X_train must be equal to the size of y_train"
            assert self.k <= X_train.shape[0], \
                "the size of X_train must be at least k."
    
            self._X_train = X_train
            self._y_train = y_train
            return self
    
        def predict(self, X_predict):
            """给定待预测数据集X_predict,返回表示X_predict的结果向量"""
            assert self._X_train is not None and self._y_train is not None, \
                    "must fit before predict!"
            assert X_predict.shape[1] == self._X_train.shape[1], \
                    "the feature number of X_predict must be equal to X_train"
    
            y_predict = [self._predict(x) for x in X_predict]
            return np.array(y_predict)
    
        def _predict(self, x):
            """给定单个待预测数据x,返回x的预测结果值"""
            assert x.shape[0] == self._X_train.shape[1], \
                "the feature number of x must be equal to X_train"
    
            distances = [pow(np.sum((abs(x_train - x)) ** self.p), 1/self.p) 
                        for x_train in self._X_train]
            nearest = np.argsort(distances)
            topK_y = [self._y_train[i] for i in nearest[:self.k]]
            topK_distance = [distances[i] for i in nearest[:self.k]]
            votes_count = np.zeros(shape=(self.y_possible.shape))
    
            for j in range(self.k):
                vote_id = np.argwhere(self.y_possible == topK_y[j])
                votes_count[vote_id] += 1/(topK_distance[j]+1e-8)     #防止分母为零
                    
            predict_value_id = np.argmax(votes_count)
            predict_value = self.y_possible[predict_value_id]
    
            return predict_value
    
        def score(self, X_test, y_test):
            """根据测试数据集 X_test 和 y_test 确定当前模型的准确度"""
    
            y_predict = self.predict(X_test)
            return accuracy_score(y_test, y_predict)
    
        def __repr__(self):
            return "KNN(k=%d)" % self.k
    
    
    #KNN算法用于鸢尾花种类判定
    
    iris = datasets.load_iris()
    X = iris.data
    y = iris.target
    
    iris_knn = KNN(3,7)
    X_train, y_train, X_test, y_test = iris_knn.data_split (X, y)
    iris_knn.fit(X_train, y_train)
    score = iris_knn.score(X_test, y_test)
    print(score)
    
    

    调用sklearn

    #!/usr/bin/env python 
    # -*- coding: UTF-8 -*-
    import numpy as np
    from sklearn import datasets
    from sklearn.metrics import accuracy_score
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import GridSearchCV
    from sklearn.preprocessing import StandardScaler
    
    
    if __name__=='__main__':
        iris = datasets.load_iris()
        print(iris.DESCR) #显示数据具体信息
        X = iris.data
        y = iris.target
    
        #数据分割
        X_train, X_test, y_train, y_test = train_test_split (X, y, test_size=0.2, random_state=None)
    
        #数据归一化
        StandardScaler = StandardScaler()
        StandardScaler.fit(X_train)
        X_train = StandardScaler.transform(X_train)
        X_test = StandardScaler.transform(X_test)
    
        #KNN算法处理
        knn_clf = KNeighborsClassifier(n_neighbors=3)
        knn_clf.fit(X_train, y_train)
        print(knn_clf.score(X_test, y_test))
    
        param_grid = [ 
         { 
             'weights':['uniform'],
             'n_neighbors':[i for i in range(1,6)]
          },
         { 
             'weights':['distance'],
              'n_neighbors':[i for i in range(1,6)],
              'p':[i for i in range(1,4)]
          }
        ]
    
        #优化超参数
    
        grid_search = GridSearchCV(knn_clf, param_grid, n_jobs=-1, verbose=2)
        grid_search.fit(X_train, y_train)
    
        print(grid_search.best_score_)
        print(grid_search.best_params_)
    
        knn_clf = grid_search.best_estimator_
        print(knn_clf.score(X_test, y_test))
    

    python tips

    将原始数据打乱:
    shuffle_indexes = np.random.permutation(len(X))
    test_size = int(len(X)*test_ratio)
    test_indexes = shuffle_indexes[:test_size]
    train_indexes = shuffle_indexes[test_size:]

    列表和矩阵的互相转换:
    y_list = y_train.tolist() #将训练y值矩阵转换为列表
    self.y_possible = np.array(y_possible_list) #将列表转换为矩阵

    相关文章

      网友评论

          本文标题:KNN(K-Nearest Neighbor)算法

          本文链接:https://www.haomeiwen.com/subject/jxduwftx.html