跟着cs231n assignment1的knn部分的notebook引导,把这个作业做完了。knn的算法本身很简单,习题主要目的是希望让学习者对下面几个方面有所熟悉:
- 一些numpy api的使用
- 使用numpy矩阵运算提高算法效率的体会
- 使用cross validation的方法,来选择hyper-parameter(这个习题中是knn算法中的k值)
虽然算法原理很简单,不过实际完成时还是遇到不少问题,花了3、4个小时,主要是对numpy的api不熟悉,一边查资料,一边写简单的demo代码验证。同时在实现方法 def compute_distances_no_loops(self, X)时,由于对矩阵数学公式不了解,也花了不少时间。下面顺着该notebook的顺序,记录下本次作业的具体过程。
一、compute_distances_two_loops
首先是使用最简单的两层for循环,计算test样本与training样本的L2距离。
L2距离的定义如下:
def compute_distances_two_loops(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using a nested loop over both the training data and the
test data.
Inputs:
- X: A numpy array of shape (num_test, D) containing test data.
Returns:
- dists: A numpy array of shape (num_test, num_train) where dists[i, j]
is the Euclidean distance between the ith test point and the jth training
point.
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
for i in xrange(num_test):
for j in xrange(num_train):
#####################################################################
# TODO: #
# Compute the l2 distance between the ith test point and the jth #
# training point, and store the result in dists[i, j]. You should #
# not use a loop over dimension. #
#####################################################################
dists[i, j] = np.sqrt(np.sum(np.square(X[i] - self.X_train[j])))
#####################################################################
# END OF YOUR CODE #
#####################################################################
return dists
此算法的效率很低,从后面的习题中可以看到在我机器上的具体耗时大约为28s。
二、predict_labels
接下来,对test样本进行预测,具体是实现下面的函数
def predict_labels(self, dists, k=1):
"""
Given a matrix of distances between test points and training points,
predict a label for each test point.
Inputs:
- dists: A numpy array of shape (num_test, num_train) where dists[i, j]
gives the distance betwen the ith test point and the jth training point.
Returns:
- y: A numpy array of shape (num_test,) containing predicted labels for the
test data, where y[i] is the predicted label for the test point X[i].
"""
num_test = dists.shape[0]
y_pred = np.zeros(num_test)
y_pred2 = np.zeros(num_test)
for i in xrange(num_test):
# A list of length k storing the labels of the k nearest neighbors to
# the ith test point.
#closest_y = []
#########################################################################
# TODO: #
# Use the distance matrix to find the k nearest neighbors of the ith #
# testing point, and use self.y_train to find the labels of these #
# neighbors. Store these labels in closest_y. #
# Hint: Look up the function numpy.argsort. #
#########################################################################
sorted_index = np.argsort(dists[i])
closest_y = self.y_train[sorted_index[:k]]
#########################################################################
# TODO: #
# Now that you have found the labels of the k nearest neighbors, you #
# need to find the most common label in the list closest_y of labels. #
# Store this label in y_pred[i]. Break ties by choosing the smaller #
# label. #
#########################################################################
# xh: can't name variable with i in the for comprehensive, as it's conflicted with above i
# timeLabel = sorted([(np.sum(closest_y == i), i) for i in set(closest_y)])[-1]
timeLabel = sorted([(np.sum(closest_y == y_), y_) for y_ in set(closest_y)])[-1]
y_pred[i] = timeLabel[1]
#########################################################################
# END OF YOUR CODE #
#########################################################################
return y_pred
该方法就是knn算法的主要部分,对test集中每个样本,由上面得到的其与training集中每个样本的距离,找出前k个距离最近的样本,然后在这k个样本中,对相同的label进行计算,数量最多的那个label作为对test样本的预测label。
在实现这段代码时,花了不少时间,主要是错误的估计了python中for推导式中变量的作用域,第一次写的时候,使用了变量名 i 用于推导式中
timeLabel = sorted([(np.sum(closest_y == i), i) for i in set(closest_y)])[-1]
结果这里i 和推导式外的上层for循环的i冲突
for i in xrange(num_test):
使得最终计算得到的准确率始终不对。查了半天才找出问题所在+_+
三、评估准确率
将上面两个步骤结合起来,得到的预测值,与test集的实际值进行比较,可以得到knn模型的准确率
dists = classifier.compute_distances_two_loops(X_test)
# Now implement the function predict_labels and run the code below:
# We use k = 1 (which is Nearest Neighbor).
y_test_pred = classifier.predict_labels(dists, k=1)
# Compute and print the fraction of correctly predicted examples
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)
这里得到的准确率大约为27%
Got 137 / 500 correct => accuracy: 0.274000
然后将k值设置为 k = 5 再试一次,准确率大约提升到28%
Got 143 / 500 correct => accuracy: 0.286000
四、提升距离计算的效率
习题接着分别实现了两种距离计算方法,用以展示不同算法在计算距离上的巨大效率差异。
- compute_distances_one_loop
一层for循环实现:
def compute_distances_one_loop(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using a single loop over the test data.
Input / Output: Same as compute_distances_two_loops
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
for i in xrange(num_test):
#######################################################################
# TODO: #
# Compute the l2 distance between the ith test point and all training #
# points, and store the result in dists[i, :]. #
#######################################################################
#dists[i] = np.sqrt(np.sum(np.square(X[i] - self.X_train), axis=1))
dists[i, :] = np.sqrt(np.sum(np.square(X[i] - self.X_train), axis=1))
#######################################################################
# END OF YOUR CODE #
#######################################################################
return dists
- compute_distances_no_loops
numpy的矩阵运算方式。
def compute_distances_no_loops(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using no explicit loops.
Input / Output: Same as compute_distances_two_loops
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
#########################################################################
# TODO: #
# Compute the l2 distance between all test points and all training #
# points without using any explicit loops, and store the result in #
# dists. #
# #
# You should implement this function using only basic array operations; #
# in particular you should not use functions from scipy. #
# #
# HINT: Try to formulate the l2 distance using matrix multiplication #
# and two broadcast sums. #
#########################################################################
# infer to http://blog.csdn.net/zhyh1435589631/article/details/54236643
dists = np.sqrt(
self.getNormMatrix(X, num_train).T + self.getNormMatrix(self.X_train, num_test) - 2 * np.dot(X, self.X_train.T))
#########################################################################
# END OF YOUR CODE #
#########################################################################
return dists
def getNormMatrix(self, x, lines_num):
"""
Get a lines_num x size(x, 1) matrix
"""
return np.ones((lines_num, 1)) * np.sum(np.square(x), axis=1)
这里运用了一个矩阵公式来计算test样本集和training样本集的距离矩阵。公式的推导如下。
设两个矩阵P(M, D)和C(N, D),下图中每项代表矩阵的一行。
1.PNG
首先对P的 i 行和C的 j 行的距离公式进行展开后,重新合并
2.PNG
最后推广到矩阵形式,如下
3.PNG
最终,分别计算三种距离计算方式的耗时如下:
Two loop version took 29.605000 seconds
One loop version took 67.478000 seconds
No loop version took 0.277000 seconds
可见,利用numpy的矩阵运算效率极其的高
五、cross validation交叉验证
接着,使用cross validation的方法,来选择hyper-parameter超参数k的值。
cross validation的原理是,将training样本集分成n份(如下图中的例子,是5份),每一份叫做一个fold,然后依次迭代这n个fold,将其作为validation集合,其余的n-1个fold一起作为training集合,然后进行训练并计算准确率。
选择一组候选k值,依次迭代执行上面描述的过程,最终根据准确率,进行评估选择最合适的k值。
num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]
X_train_folds = []
y_train_folds = []
################################################################################
# TODO: #
# Split up the training data into folds. After splitting, X_train_folds and #
# y_train_folds should each be lists of length num_folds, where #
# y_train_folds[i] is the label vector for the points in X_train_folds[i]. #
# Hint: Look up the numpy array_split function. #
################################################################################
X_train_folds = np.array_split(X_train, num_folds)
y_train_folds = np.array_split(y_train, num_folds)
################################################################################
# END OF YOUR CODE #
################################################################################
# A dictionary holding the accuracies for different values of k that we find
# when running cross-validation. After running cross-validation,
# k_to_accuracies[k] should be a list of length num_folds giving the different
# accuracy values that we found when using that value of k.
k_to_accuracies = {}
################################################################################
# TODO: #
# Perform k-fold cross validation to find the best value of k. For each #
# possible value of k, run the k-nearest-neighbor algorithm num_folds times, #
# where in each case you use all but one of the folds as training data and the #
# last fold as a validation set. Store the accuracies for all fold and all #
# values of k in the k_to_accuracies dictionary. #
################################################################################
for k in k_choices:
for f in range(num_folds):
X_train_tmp = np.array(X_train_folds[:f] + X_train_folds[f + 1:])
y_train_tmp = np.array(y_train_folds[:f] + y_train_folds[f + 1:])
X_train_tmp = X_train_tmp.reshape(-1, X_train_tmp.shape[2])
y_train_tmp = y_train_tmp.reshape(-1)
X_va = np.array(X_train_folds[f])
y_va = np.array(y_train_folds[f])
classifier.train(X_train_tmp, y_train_tmp)
dists = classifier.compute_distances_no_loops(X_va)
y_test_pred = classifier.predict_labels(dists, k)
# Compute and print the fraction of correctly predicted examples
num_correct = np.sum(y_test_pred == y_va)
accuracy = float(num_correct) / y_va.shape[0]
if (k in k_to_accuracies.keys()):
k_to_accuracies[k].append(accuracy)
else:
k_to_accuracies[k] = []
k_to_accuracies[k].append(accuracy)
################################################################################
# END OF YOUR CODE #
################################################################################
# Print out the computed accuracies
for k in sorted(k_to_accuracies):
for accuracy in k_to_accuracies[k]:
print 'k = %d, accuracy = %f' % (k, accuracy)
输出结果是
k = 1, accuracy = 0.263000
k = 1, accuracy = 0.257000
k = 1, accuracy = 0.264000
k = 1, accuracy = 0.278000
k = 1, accuracy = 0.266000
k = 3, accuracy = 0.252000
k = 3, accuracy = 0.281000
k = 3, accuracy = 0.266000
k = 3, accuracy = 0.290000
k = 3, accuracy = 0.281000
k = 5, accuracy = 0.266000
k = 5, accuracy = 0.285000
k = 5, accuracy = 0.290000
k = 5, accuracy = 0.303000
k = 5, accuracy = 0.284000
k = 8, accuracy = 0.270000
k = 8, accuracy = 0.310000
k = 8, accuracy = 0.281000
k = 8, accuracy = 0.290000
k = 8, accuracy = 0.291000
k = 10, accuracy = 0.276000
k = 10, accuracy = 0.298000
k = 10, accuracy = 0.296000
k = 10, accuracy = 0.289000
k = 10, accuracy = 0.288000
k = 12, accuracy = 0.268000
k = 12, accuracy = 0.302000
k = 12, accuracy = 0.287000
k = 12, accuracy = 0.280000
k = 12, accuracy = 0.280000
k = 15, accuracy = 0.269000
k = 15, accuracy = 0.299000
k = 15, accuracy = 0.294000
k = 15, accuracy = 0.291000
k = 15, accuracy = 0.283000
k = 20, accuracy = 0.265000
k = 20, accuracy = 0.291000
k = 20, accuracy = 0.290000
k = 20, accuracy = 0.282000
k = 20, accuracy = 0.282000
k = 50, accuracy = 0.274000
k = 50, accuracy = 0.289000
k = 50, accuracy = 0.276000
k = 50, accuracy = 0.264000
k = 50, accuracy = 0.273000
k = 100, accuracy = 0.265000
k = 100, accuracy = 0.274000
k = 100, accuracy = 0.265000
k = 100, accuracy = 0.259000
k = 100, accuracy = 0.265000
由上可以看出,这里k = 8的准确率比较高。
然后,设置k = 8,再讲所用的training集合合并到一起,再次训练,并最终使用test集合来计算准确率
best_k = 8
classifier = KNearestNeighbor()
classifier.train(X_train, y_train)
y_test_pred = classifier.predict(X_test, k=best_k)
# Compute and display the accuracy
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)
结果,最终的准确率为
Got 145 / 500 correct => accuracy: 0.290000
可以看到,即使经过调优,knn算法的准确率也只有29%,可以知道knn算法并不适合用于图像分类学习任务。
网友评论