美文网首页机器学习实战
机器学习实战-利用K-均值聚类算法对未标注数据分组

机器学习实战-利用K-均值聚类算法对未标注数据分组

作者: mov觉得高数好难 | 来源:发表于2017-08-12 20:30 被阅读0次

    聚类是一种无监督学习,它将相似的对象归到同一个簇中。他有点像全自动分类。
    簇识别给出聚类结果的含义。假定一些数据,现在将相似数据归到一起,簇识别会告诉我们这些簇到底都是些什么。

    K-均值聚类
    优点:容易实现
    缺点:可能收敛到局部最小值,在大规模数据集上收敛较慢

    k-均值算法的伪代码如下:

    创建k个点作为起始质心(经常是随机选择)
    当任意一个点的簇分配结果发生改变时
      对数据集中的每个数据点
        对每个质心
          计算质心与数据点之间的距离
        将数据点分配到距其最近的簇
      对每一个簇,计算簇中所有点的均值并将均值作为质心
    

    下面给出代码实现:

    #kMeans.py
    
    from numpy import *
    
    def loadDataSet(fileName):     
        dataMat = []             
        fr = open(fileName)
        for line in fr.readlines():
            curLine = line.strip().split('\t')
            fltLine = map(float,curLine) 
            dataMat.append(fltLine)
        return dataMat
    
    def distEclud(vecA, vecB):
        return sqrt(sum(power(vecA - vecB, 2))) #两个向量的欧氏距离
    
    def randCent(dataSet, k):
        n = shape(dataSet)[1]
        centroids = mat(zeros((k,n)))
        for j in range(n):
            minJ = min(dataSet[:,j]) 
            rangeJ = float(max(dataSet[:,j]) - minJ)
            centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1))#最小值与最大值
        return centroids
    

    下面开始测试代码

    >>> import kMeans
    >>> from numpy import *
    >>> datMat = mat(kMeans.loadDataSet('testSet.txt'))#构建矩阵
    >>> min(datMat[:,0])
    matrix([[-5.379713]])
    >>> min(datMat[:,1])
    matrix([[-4.232586]])
    >>> max(datMat[:,1])
    matrix([[ 5.1904]])
    >>> max(datMat[:,0])
    matrix([[ 4.838138]])
    >>> kMeans.randCent(datMat,2)
    matrix([[-4.4796995 , -0.70940004],
            [ 0.81371123,  2.24200682]])
    

    最后测试一下距离计算方法

    >>> kMeans.distEclud(datMat[0],datMat[1])
    5.184632816681332
    

    所有支持函数正常运行之后,就可以准备试下完整的k-均值算法了。

    def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent):
        m = shape(dataSet)[0]
        clusterAssment = mat(zeros((m,2)))#创建一个矩阵来存储每一次分配的结果,索引值和误差(点到簇质心的距离)
        centroids = createCent(dataSet, k)
        clusterChanged = True
        while clusterChanged:#直到不再改变为止
            clusterChanged = False
            for i in range(m):#寻找最近的质心
                minDist = inf; minIndex = -1
                for j in range(k):
                    distJI = distMeas(centroids[j,:],dataSet[i,:])
                    if distJI < minDist:
                        minDist = distJI; minIndex = j
                if clusterAssment[i,0] != minIndex: clusterChanged = True
                clusterAssment[i,:] = minIndex,minDist**2
            print centroids
            for cent in range(k):#recalculate centroids
                ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]#获得簇中所有点
                centroids[cent,:] = mean(ptsInClust, axis=0) #计算所有值的均值。axis=0表示沿着矩阵的列方向进行均值计算
        return centroids, clusterAssment
    

    接下来看看运行效果

    In [4]: import kMeans
       ...: datMat = mat(kMeans.loadDataSet('testSet.txt'))
       ...: myCentroids, clustAssing = kMeans.kMeans(datMat,4)
       ...: 
    [[-1.07551094 -1.37091103]
     [ 3.16596464 -2.76092215]
     [ 4.21540817 -2.58163381]
     [-3.9285101   2.94186313]]
    [[-1.93673761 -1.58827286]
     [ 2.61992629 -2.31851518]
     [ 3.92807891  1.38550982]
     [-1.74200079  3.04124525]]
    [[-3.38237045 -2.9473363 ]
     [ 2.72031426 -2.83200232]
     [ 2.91014783  2.71954072]
     [-1.94392522  2.96291883]]
    [[-3.38237045 -2.9473363 ]
     [ 2.80293085 -2.7315146 ]
     [ 2.6265299   3.10868015]
     [-2.46154315  2.78737555]]
    
    

    上面给出了四个质心,可以看到,经过三次迭代以后该算法收敛。
    SSE,sum of squared error,误差平方和。SSE值越小,数据点越接近他的质心,聚类效果越好。
    为了克服K-均值孙发收敛于局部最小值的问题,有人提出了二分K-均值算法(bisecting K-means)。伪代码如下:

    将所有点看成一个簇
    当簇数目小于k时
      对于每一个簇
        计算总误差
        在给定的簇上面进行K-均值聚类(k=2)
        计算将该簇一分为二之后的总误差
       选择使得误差最小的那个簇进行划分操作
    

    另一种做法是选择SSE最大的簇进行划分,直到簇数目达到用户指定的数目为止。这种做法不难实现,下面看看实际效果:

    def biKmeans(dataSet, k, distMeas=distEclud):
        m = shape(dataSet)[0]
        clusterAssment = mat(zeros((m,2)))
        centroid0 = mean(dataSet, axis=0).tolist()[0] #axis=0计算每一列的均值,axis=1计算每一行的均值
        centList =[centroid0] 
        for j in range(m):#计算初始误差
            clusterAssment[j,1] = distMeas(mat(centroid0), dataSet[j,:])**2
        while (len(centList) < k):
            lowestSSE = inf
            for i in range(len(centList)):
                ptsInCurrCluster = dataSet[nonzero(clusterAssment[:,0].A==i)[0],:]#.A,转化为数组
                centroidMat, splitClustAss = kMeans(ptsInCurrCluster, 2, distMeas)
                sseSplit = sum(splitClustAss[:,1])
                sseNotSplit = sum(clusterAssment[nonzero(clusterAssment[:,0].A!=i)[0],1])
                print "sseSplit, and notSplit: ",sseSplit,sseNotSplit
                if (sseSplit + sseNotSplit) < lowestSSE:#比较SSE的值
                    bestCentToSplit = i
                    bestNewCents = centroidMat
                    bestClustAss = splitClustAss.copy()
                    lowestSSE = sseSplit + sseNotSplit
            bestClustAss[nonzero(bestClustAss[:,0].A == 1)[0],0] = len(centList) #更新簇结果
            bestClustAss[nonzero(bestClustAss[:,0].A == 0)[0],0] = bestCentToSplit
            print 'the bestCentToSplit is: ',bestCentToSplit
            print 'the len of bestClustAss is: ', len(bestClustAss)
            centList[bestCentToSplit] = bestNewCents[0,:].tolist()[0] 
            centList.append(bestNewCents[1,:].tolist()[0])#更新质心
            clusterAssment[nonzero(clusterAssment[:,0].A == bestCentToSplit)[0],:]= bestClustAss
        return mat(centList), clusterAssment
    

    下面看一下实际运行结果:

    In [15]: import kMeans
        ...: datMat3 = mat(kMeans.loadDataSet('testSet2.txt'))
        ...: centList,myNewAssments = kMeans.biKmeans(datMat3,3)
        ...: 
    [[-3.03755889 -1.89572181]
     [-3.22628749 -2.41623849]]
    [[-0.10841893  1.56828455]
     [-0.84797625 -3.576032  ]]
    [[-0.03082922  3.12682161]
     [-0.43154563 -2.87788837]]
    [[-0.00675605  3.22710297]
     [-0.45965615 -2.7782156 ]]
    sseSplit, and notSplit:  453.033489581 0.0
    the bestCentToSplit is:  0
    the len of bestClustAss is:  60
    [[-4.17486974  4.05506292]
     [ 3.04575421  4.02649076]]
    [[-2.94737575  3.3263781 ]
     [ 2.93386365  3.12782785]]
    sseSplit, and notSplit:  77.5922493178 29.1572494441
    [[-1.43611292 -1.57953903]
     [ 0.46179716 -0.89904296]]
    [[-0.93324907 -2.59690843]
     [ 0.645394   -3.20126567]]
    [[-1.07894467 -2.43015258]
     [ 0.46927663 -3.30031012]]
    [[-1.12616164 -2.30193564]
     [ 0.35496167 -3.36033556]]
    sseSplit, and notSplit:  12.7532631369 423.876240137
    the bestCentToSplit is:  0
    the len of bestClustAss is:  40
    

    现在看看质心结果:

    In [17]: centList
    Out[17]: 
    matrix([[-2.94737575,  3.3263781 ],
            [-0.45965615, -2.7782156 ],
            [ 2.93386365,  3.12782785]])
    
    yahoo开发者那个暂时不能访问。。过几天更新
    

    相关文章

      网友评论

        本文标题:机器学习实战-利用K-均值聚类算法对未标注数据分组

        本文链接:https://www.haomeiwen.com/subject/dxdlrxtx.html