使用k-邻近算法改进约会网站的配对效果

作者: 梦vctor | 来源:发表于2018-10-17 15:53 被阅读0次

使用k-邻近算法改进约会网站的配对效果
机器学习（一）——KNN算法之约会网站匹配效果
（二）k-邻近算法（约会网站配对效果）
《机器学习实战》kNN
《机器学习实战》——使用k-近邻算法改进约会网站的配对效果
机器学习02：基于k-近邻算法改进约会网站的配对效果
第二章 k-邻近算法
无标题文章
K近邻算法实例-约会问题
KNN约会网站配对

使用k-邻近算法改进约会网站的配对效果

首先要找数据出处，海伦约会数据放在文本文件datingTestSet.txt中，每个数据样本数据占据一行，总共有1000行。
下载网址：[http:/www.manning.com/MachineLearninginAction]
点击左侧Source Code
或者：http://www.ituring.com.cn/book/1021 点击右侧“随书下载”

准备数据：从文本文件中解析数据

#将文本记录到转换NumPy的解析程序
def file2matrix(filename):
    fr=open(filename)
    # 得到文本行数
    arrayOLines=fr.readlines()
    numberOfLines=len(arrayOLines)
    returnMat=zeros((numberOfLines,3))
    classLabelVector=[]     #创建返回的Numpy库
    index=0
    for line in arrayOLines:
        # 截取掉所有的回车字符
        line = line.strip()
        # 将上一行得到的整行数据分割成一个元素列表
        listFromLine=line.split('\t')       #解析文本数据到列表
        # 取前3个数据
        returnMat[index,:]=listFromLine[0:3]
        # 将列表最后一列存储到向量classLabelVector中
        classLabelVector.append(int(listFromLine[-1]))
        index+=1
    return returnMat,classLabelVector

Debug:
ValueError: invalid literal for int() with base 10: 'largeDoses'
书上示例datingTestSet.txt应改为datingTestSet2.txt
datingDataMat,datingLabels=KNN.file2matrix('datingTestSet.txt')改为datingDataMat,datingLabels=KNN.file2matrix('datingTestSet2.txt')

分析数据：使用Matplotlib创建散点图

>>> import matplotlib
>>> import matplotlib.pyplot as plt
>>> fig=plt.figure()
>>> ax=fig.add_subplot(111)
>>> ax.scatter(datingDataMat[:,1],datingDataMat[:,2])
<matplotlib.collections.PathCollection object at 0x0000029E3B9C57F0>
>>> plt.show()

输出：

image.png

#coding: utf-8
from numpy import *                     #导入numpy包，这里要用到numpy中的array
from KNN import *                 #导入产生数据的包
import matplotlib                       #导入绘图的库
import matplotlib.pyplot as plt         #将绘图的函数重命名
from mpl_toolkits.mplot3d import Axes3D  #导入3维图像的包
plt.rcParams['font.sans-serif'] = ['FangSong'] # 指定字体为仿宋
plt.rcParams['axes.unicode_minus'] = False # 解决图像显示为方块的问题


datingDataMat,datingLabels=file2matrix('datingTestSet2.txt')
fig1=plt.figure()   #创建图形

#-------------------二维-----------
#第一个子图，是第2个特征和第3个特征的散点图  但是没有颜色标识
ax=fig1.add_subplot(2,2,1)  #代表创建2行2列从上到下的第一块的子图
ax.scatter(datingDataMat[:,1],datingDataMat[:,2])
plt.xlabel(u'x 玩视频游戏所消耗时间半分比')
plt.ylabel(u'y 每周消费的冰淇淋公升数')
plt.title(u'图一（2&&3）')

#定义三个类别的空列表
type1_x=[]
type1_y=[]
type2_x=[]
type2_y=[]
type3_x=[]
type3_y=[]


#第二个子图 是第2个特征和第3个特征的散点图
ax=fig1.add_subplot(2,2,2)  #代表创建2行2列从上到下的第二块的子图

#循环获得每个列表中的值
for i in range(len(datingLabels)):
    if datingLabels[i]==1:  #不喜欢
        type1_x.append(datingDataMat[i][1])
        type1_y.append(datingDataMat[i][2])

    if datingLabels[i]==2:  #魅力一般
        type2_x.append(datingDataMat[i][1])
        type2_y.append(datingDataMat[i][2])

    if datingLabels[i]==3:  #极具魅力
        type3_x.append(datingDataMat[i][1])
        type3_y.append(datingDataMat[i][2])

type1=ax.scatter(type1_x,type1_y,s=20,c='red')
type2=ax.scatter(type2_x,type2_y,s=40,c='green')
type3=ax.scatter(type3_x,type3_y,s=50,c='blue')
ax.legend((type1,type2,type3),(u'不喜欢',u'魅力一般',u'极具魅力'),loc=2)   #显示图例 1 右上 2 左上 3 左下 4 右下 逆时针
#ax.scatter(datingDataMat[:,0],datingDataMat[:,2],15.0*array(datingLabels),15.0*array(datingLabels))
plt.xlabel(u'x 玩视频游戏所消耗时间半分比')
plt.ylabel(u'y 每周消费的冰淇淋公升数')
plt.title(u'图二（2&&3）')

#第三个子图 是第1个特征和第2个特征的散点图
ax=fig1.add_subplot(2,2,3)  #代表创建2行2列从上到下的第三块的子图
#循环获得每个列表中的值
for i in range(len(datingLabels)):
    if datingLabels[i]==1:  #不喜欢
        type1_x.append(datingDataMat[i][0])
        type1_y.append(datingDataMat[i][1])

    if datingLabels[i]==2:  #魅力一般
        type2_x.append(datingDataMat[i][0])
        type2_y.append(datingDataMat[i][1])

    if datingLabels[i]==3:  #极具魅力
        type3_x.append(datingDataMat[i][0])
        type3_y.append(datingDataMat[i][1])

type1=ax.scatter(type1_x,type1_y,s=20,c='red')
type2=ax.scatter(type2_x,type2_y,s=40,c='green')
type3=ax.scatter(type3_x,type3_y,s=50,c='blue')
ax.legend((type1,type2,type3),(u'不喜欢',u'魅力一般',u'极具魅力'),loc=2)   #显示图例 1 右上 2 左上 3 左下 4 右下 逆时针
#ax.scatter(datingDataMat[:,0],datingDataMat[:,2],15.0*array(datingLabels),15.0*array(datingLabels))
plt.xlabel(u'x 每年获取的飞行常客里程数')
plt.ylabel(u'y 玩视频游戏所耗时间半分比')
plt.title(u'图三（1&&2）')

#第四个子图 是第1个特征和第3个特征的散点图
ax=fig1.add_subplot(2,2,4)  #代表创建2行2列从上到下的第四块的子图
#循环获得每个列表中的值
for i in range(len(datingLabels)):
    if datingLabels[i]==1:  #不喜欢
        type1_x.append(datingDataMat[i][0])
        type1_y.append(datingDataMat[i][2])

    if datingLabels[i]==2:  #魅力一般
        type2_x.append(datingDataMat[i][0])
        type2_y.append(datingDataMat[i][2])

    if datingLabels[i]==3:  #极具魅力
        type3_x.append(datingDataMat[i][0])
        type3_y.append(datingDataMat[i][2])

type1=ax.scatter(type1_x,type1_y,s=20,c='red')
type2=ax.scatter(type2_x,type2_y,s=40,c='green')
type3=ax.scatter(type3_x,type3_y,s=50,c='blue')
ax.legend((type1,type2,type3),(u'不喜欢',u'魅力一般',u'极具魅力'),loc=2)   #显示图例 1 右上 2 左上 3 左下 4 右下 逆时针
#ax.scatter(datingDataMat[:,0],datingDataMat[:,2],15.0*array(datingLabels),15.0*array(datingLabels))
plt.xlabel(u"x 每年获取的飞行常客里程数")
plt.ylabel(u'y 每周消费的冰淇淋公升数')
plt.title(u'图四（1&&3）')

#------------三维-----------
fig2=plt.figure()   #创建图形
ax=fig2.add_subplot(111)
ax=Axes3D(fig2)
ax.scatter(datingDataMat[:,0],datingDataMat[:,2],15.0*array(datingLabels),15.0*array(datingLabels))
ax.set_xlabel(u'x 每年获取的飞行常客里程数')
ax.set_ylabel(u'y 视频游戏所耗时间半分比')
ax.set_zlabel(u'z 每周消费的冰淇淋公升数')

plt.title(u'图四（1&&2&&3）')
plt.show()

Debug:绘图中坐标轴中文乱码问题
添加代码：
plt.rcParams['font.sans-serif'] = ['FangSong'] # 指定字体为仿宋
plt.rcParams['axes.unicode_minus'] = False # 解决图像显示为方块的问题
输出：

image.png

准备数据：归一化数值

归一化数值是为了不让某个特殊的属性对计算结果影响比较大，而应该每一个特征都是等权重的。

#归一化特征值
def autoNorm(dataSet):
    #最小值放在minVals中，从列中选取最小值，而不是选取当前行的最小值
    minVals=dataSet.min(0)
    #最大值放在maxVals中
    maxVals=dataSet.max(0)
    ranges=maxVals-minVals
    normDataSet=zeros(shape(dataSet))
    m=dataSet.shape[0]
    normDataSet=dataSet-tile(minVals,(m,1))
    normDataSet=normDataSet/tile(ranges,(m,1))
    return normDataSet,ranges,minVals

#检测函数输出结果
normMat,ranges,minVals=KNN.autoNorm(datingDataMat)
print(normMat)
print(ranges)
print(minVals)

Debug:<class 'range'>
autoNorm函数返回range改为ranges

输出：

[[0.44832535 0.39805139 0.56233353]
 [0.15873259 0.34195467 0.98724416]
 [0.28542943 0.06892523 0.47449629]
 ...
 [0.29115949 0.50910294 0.51079493]
 [0.52711097 0.43665451 0.4290048 ]
 [0.47940793 0.3768091  0.78571804]]
[9.1273000e+04 2.0919349e+01 1.6943610e+00]
[0.       0.       0.001156]

测试算法

#分类器针对约会网站的测试代码
def datingClassTest():
    hoRatio = 0.50      #hold out 10%
    datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')       #load data setfrom file
    normMat, ranges, minVals = autoNorm(datingDataMat)
    m = normMat.shape[0]
    numTestVecs = int(m*hoRatio)
    errorCount = 0.0
    for i in range(numTestVecs):
        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
        print("the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i]))
        if (classifierResult != datingLabels[i]): errorCount += 1.0
    print("the total error rate is: %f" % (errorCount/float(numTestVecs)))
    print(errorCount)

#代码输出：
print(KNN.datingClassTest())

输出：

the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
......
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the total error rate is: 0.066000

使用算法：构建完整可用系统

#约会网站预测函数
def classifyPerson():
    resultList=['not at all','in small doses','in large doses']
    percentTats=float(input("percentage of time spent playing video games?"))
    ffMiles=float(input("frequent flier miles earned per year?"))
    iceCream=float(input("liters of ice cream consumed per year?"))
    datingDataMat,datingLabels=file2matrix('datingTestSet2.txt')
    normMat,ranges,minVals=autoNorm(datingDataMat)
    inArr=array([ffMiles,percentTats,iceCream])
    classifierResult=classify0((inArr-minVals)/ranges,normMat,datingLabels,3)
    print("You will probably like this person: ",resultList[classifierResult -1] )

#输出结果：
print(KNN.classifyPerson())

输出：

percentage of time spent playing video games?10
frequent flier miles earned per year?10000
liters of ice cream consumed per year?0.5
You will probably like this person:  in small doses

使用k-邻近算法改进约会网站的配对效果
使用k-邻近算法改进约会网站的配对效果首先要找数据出处，海伦约会数据放在文本文件datingTestSet.tx...
机器学习（一）——KNN算法之约会网站匹配效果
【实验目的】为了熟悉和掌握K近邻算法，改进约会网站的匹配策略。【实验要求】用KNN算法改进约会网站的配对效果...
（二）k-邻近算法（约会网站配对效果）
机器学习之k-邻近算法本文主要根据2013年6月出版的《机器学习实战》中所讲述内容，使用python3.6实现了...
《机器学习实战》kNN
k-邻近算法基本样例约会网站示例手写识别系统
《机器学习实战》——使用k-近邻算法改进约会网站的配对效果
原文转载自我的博客benym.cn 实例：在约会网站上使用k-近邻算法 (1) 收集数据：提供文本文件。(2) 准...
机器学习02：基于k-近邻算法改进约会网站的配对效果
一、前言我的朋友mike一直使用在线约会网站寻找适合自己的约会对象。尽管约会网站会推荐不同的人选，但他并不是喜欢...
第二章 k-邻近算法
2.1 k-邻近算法概述 2.1.1 原理 k-邻近算法（k-Nearest Neighbor，KNN），存在一个...
无标题文章
机器学习实践-K邻近算法本章内容 - K- 邻近算法概述 -
K近邻算法实例-约会问题
实例：改进约会网站的匹配效果案例描述：Hellen一直使用在线约会网站寻找适合自己的约会对象，尽管约会网站会推荐...
KNN约会网站配对
示例：在约会网站上使用k-近邻算法 (1) 收集数据：提供文本文件。(2) 准备数据：使用Python解析文本文件...