K-Means 算法总结和 Python 实现

作者: 榴莲酥君 | 来源:发表于2017-04-13 11:35 被阅读0次

2018.12.8
K-Means 算法总结和 Python 实现
k-means算法总结
2018-12-19
排序算法最强总结及其代码实现（Python/Java）
K均值算法（K-Means）
基于K-means聚类算法的图像分割
聚类
08 聚类算法 - 聚类算法的衡量指标
K-means原理、优化、应用

在聚类算法中，我们给定训练集 ${x^{(1)},...,x{(m)}}$ （醉了，简书的 Markdown 平台不支持数学公式的解析 o(≧口≦)o，可参考我另外的博客），希望这些输入数据聚类到若干个类簇当中。其中$x^{(1)}\in{Rn}$，但是每个样本数据没有$y^{(i)}$，即没有类标信息，因而这时一个无监督学习问题。

K-Means 主要想法是找到 k 个质心，将离某个质心最近的样本聚类到这个类簇档当中，将所有样本聚类成 k 个类簇（对K-Means 详细的介绍可参考 Wikipedia）。基本算法那如下：

Pseudocode of K-Means

第一步，首先随机初始化 k 个质心的位置。

第二步是一个迭代循环的操作:

首先对于每一个样本 $x^{(i)}$，找到离该样本最近的质心，将其归类到该质心对应的类簇中，即这里的第 $j$ 个类簇中。
将所有样本都归类到对应的类簇后，需要利用每一个类簇中的样本，重新计算该类簇中所有样本的均值得到新的质心。

循环执行步骤1和步骤2，直至收敛，即迭代过程中质心不再更新。

Python 实现简单的 K-Means 算法如下：

__author__ = 'bin'
# reference: https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/

import numpy as np
import random
import matplotlib.pyplot as plt


# Lloyd's algorithm
# inner loop step 1
def cluster_points(X, mu):
    clusters = {}  # store k centers, type: dict

    for x in X:
        # bestmukey is "int" type
        # for i in enumerate(mu):
        #     print ((i[0], np.linalg.norm(x-mu[i[0]])))
        bestmukey = min([(i[0], np.linalg.norm(x - mu[i[0]])) \
                         for i in enumerate(mu)], key=lambda t: t[1])[0]
        # A new built-in function, enumerate(), will make certain loops a bit clearer.
        # enumerate(thing), where thing is either an iterator or a sequence,
        # returns a iterator that will return (0, thing[0]), (1, thing[1]), (2, thing[2]), and so forth.
        # key=lambda t:t[1] is used for sort this dict by t:t[1] (the second element in this element)

        try:
            clusters[bestmukey].append(x)
        except KeyError:
            clusters[bestmukey] = [x]
    return clusters


# inner loop step 2, (update the mu)
def reevaluate_centers(mu, clusters):
    newmu = []
    keys = sorted(clusters.keys())
    for k in keys:
        print len(clusters[k])
        newmu.append(np.mean(clusters[k], axis=0))

    return newmu


def has_converged(mu, oldmu):
    # A tuple is a sequence of immutable Python objects.
    # tuple is using (), list is using [], dict is using {}
    return (set([tuple(a) for a in mu]) == set([tuple(a) for a in oldmu]))


def find_centers(X, K):
    # Initialize to K random centers
    oldmu = random.sample(X, K)
    mu = random.sample(X, K)

    while not has_converged(mu, oldmu):
        oldmu = mu
        # Assign all points in X to clusters
        clusters = cluster_points(X, mu)
        # Reevaluate centers (update the centers)
        mu = reevaluate_centers(oldmu, clusters)
    return (mu, clusters)


# The initial configuration of points for the algorithm is created as follows:
def init_board(N):
    # random.uniform:
    # Draw samples from a uniform distribution
    X = np.array([(random.uniform(-1, 1), random.uniform(-1, 1)) for i in range(N)])

    return X


# The following routine constructs a specified number of Gaussian distributed clusters with random variances:
def init_board_gauss(N, k):
    n = float(N) / k
    X = []
    for i in range(k):
        c = (random.uniform(-1, 1), random.uniform(-1, 1))
        s = random.uniform(0.05, 0.5)
        x = []
        while len(x) < n:
            a, b = np.array([np.random.normal(c[0], s), np.random.normal(c[1], s)])
            # Continue drawing points from the distribution in the range [-1,1]
            if abs(a) < 1 and abs(b) < 1:
                x.append([a, b])
        X.extend(x)
    X = np.array(X)[:N]
    return X


if __name__ == "__main__":
    X = init_board(100)
    K = 4
    mu, clusters = find_centers(X, K)

    x = []
    y = []
    for i in range(K):
        lx = []
        ly = []
        for l0 in clusters[i]:
            lx.append(l0[0])
            ly.append(l0[1])
        x.append(lx)
        y.append(ly)

    for i in range(K):
        plt.plot(x[i], y[i], 'o')
        plt.plot(mu[i][0], mu[i][1], 's', markersize=10)

    plt.show()

程序中假设 $k=4$，可以看到用均匀分布随机生成的样本，在算法收敛后，成功被聚成了四类。