美文网首页
聚类实现

聚类实现

作者: 单调不减 | 来源:发表于2019-05-23 23:23 被阅读0次

    1、kmeans

    k-means算法的思路很简单,一句话描述就是:随机选定K个中心点,将样本归入距离其最近的中心点代表的簇,重新计算K个簇的中心,重复上述过程,直到误差减小值低于某个阈值。

    但是我们常常忽视k-means算法的适用范围,即它在什么情况下表现好,什么情况下表现不好。我们知道,所有机器学习的算法的效果好坏都是和数据相关的,不存在一种放之四海而皆准的算法。那么k-means适用于怎样的数据呢?或者说使用k-means能产生好的簇划分的数据应该满足何种条件呢?

    当我们使用簇内误差平方和作为优化目标,隐含的假设就是簇应该是凸的、各向同性的(通俗地说就是球形簇),且各个簇的尺寸(样本数量)和密度(方差)应该相当。

    下面我们看一下当数据不满足假设时k-means的表现:
    https://scikit-learn.org/stable/modules/clustering.html

    import numpy as np
    import matplotlib.pyplot as plt
    
    from sklearn.cluster import KMeans
    from sklearn.datasets import make_blobs
    
    plt.figure(figsize=(12, 12))
    
    n_samples = 1500
    random_state = 177
    #默认centers=3,即产生3个簇
    X, y = make_blobs(n_samples=n_samples, random_state=random_state)
    
    # Incorrect number of clusters
    y_pred = KMeans(n_clusters=2, random_state=random_state).fit_predict(X)
    
    plt.subplot(221)
    plt.scatter(X[:, 0], X[:, 1], c=y_pred)
    plt.title("Incorrect Number of Blobs")
    
    # Anisotropicly distributed data
    transformation = [[0.60834549, -0.63667341], [-0.40887718, 0.85253229]]
    X_aniso = np.dot(X, transformation)
    y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_aniso)
    
    plt.subplot(222)
    plt.scatter(X_aniso[:, 0], X_aniso[:, 1], c=y_pred)
    plt.title("Anisotropicly Distributed Blobs")
    
    # Different variance
    X_varied, y_varied = make_blobs(n_samples=n_samples,
                                    cluster_std=[1.0, 2.5, 0.5],
                                    random_state=random_state)
    y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_varied)
    
    plt.subplot(223)
    plt.scatter(X_varied[:, 0], X_varied[:, 1], c=y_pred)
    plt.title("Unequal Variance")
    
    # Unevenly sized blobs
    # np.vstack:按垂直方向(行顺序)堆叠数组构成一个新的数组
    X_filtered = np.vstack((X[y == 0][:500], X[y == 1][:100], X[y == 2][:10]))
    y_pred = KMeans(n_clusters=3,
                    random_state=random_state).fit_predict(X_filtered)
    
    plt.subplot(224)
    plt.scatter(X_filtered[:, 0], X_filtered[:, 1], c=y_pred)
    plt.title("Unevenly Sized Blobs")
    
    plt.show()
    

    图一展示的是K选取不当时的结果,关于K的选取一会会再提到。

    图二是簇形状不规则的情形(非球形),此时k-means的聚类结果表现并不理想。

    图三是各个簇方差不同的情况,可以看到,方差大的簇更难“留住样本”,划分结果显示方差小的簇可能会“抢夺”方差大的簇的样本,使划分结果表现不好。

    图四是各个簇包含样本数不同的情况,可以看到中心点倾向于向样本多的方向偏移,从而导致尺寸大的簇中部分样本点被划入错误的簇(图中实际上就出现了一个簇中有两个中心点的情况,这会分走簇的一大部分样本)。

    除了上述四种情况,k-means还有一个局限,那就是维度灾难。维度灾难对几乎所有机器学习算法来说都是棘手的问题,对于基于距离定义相似度的k-means算法更是如此,因此当我们处理高维数据的聚类时,通常要先采用适当的降维方法(如PCA等)然后再对降维后的数据进行聚类。

    下面是一个手写数字识别数据集上的聚类问题,我们先使用PCA降维然后运用k-means进行聚类:
    https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-cluster-plot-kmeans-digits-py

    from time import time
    import numpy as np
    import matplotlib.pyplot as plt
    
    from sklearn import metrics
    from sklearn.cluster import KMeans
    from sklearn.datasets import load_digits
    from sklearn.decomposition import PCA
    from sklearn.preprocessing import scale
    
    np.random.seed(42)
    
    digits = load_digits()
    data = scale(digits.data)
    
    n_samples, n_features = data.shape
    n_digits = len(np.unique(digits.target))
    labels = digits.target
    
    sample_size = 300
    
    print("n_digits: %d, \t n_samples %d, \t n_features %d"
          % (n_digits, n_samples, n_features))
    
    
    print(82 * '_')
    print('init\t\ttime\tinertia\thomo\tcompl\tv-meas\tARI\tAMI\tsilhouette')
    
    
    def bench_k_means(estimator, name, data):
        t0 = time()
        estimator.fit(data)
        print('%-9s\t%.2fs\t%i\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f'
              % (name, (time() - t0), estimator.inertia_,
                 metrics.homogeneity_score(labels, estimator.labels_),
                 metrics.completeness_score(labels, estimator.labels_),
                 metrics.v_measure_score(labels, estimator.labels_),
                 metrics.adjusted_rand_score(labels, estimator.labels_),
                 metrics.adjusted_mutual_info_score(labels,  estimator.labels_,
                                                    average_method='arithmetic'),
                 metrics.silhouette_score(data, estimator.labels_,
                                          metric='euclidean',
                                          sample_size=sample_size)))
    
    bench_k_means(KMeans(init='k-means++', n_clusters=n_digits, n_init=10),
                  name="k-means++", data=data)
    
    bench_k_means(KMeans(init='random', n_clusters=n_digits, n_init=10),
                  name="random", data=data)
    
    # in this case the seeding of the centers is deterministic, hence we run the
    # kmeans algorithm only once with n_init=1
    pca = PCA(n_components=n_digits).fit(data)
    bench_k_means(KMeans(init=pca.components_, n_clusters=n_digits, n_init=1),
                  name="PCA-based",
                  data=data)
    print(82 * '_')
    
    # #############################################################################
    # Visualize the results on PCA-reduced data
    
    reduced_data = PCA(n_components=2).fit_transform(data)
    kmeans = KMeans(init='k-means++', n_clusters=n_digits, n_init=10)
    kmeans.fit(reduced_data)
    
    # Step size of the mesh. Decrease to increase the quality of the VQ.
    h = .02     # point in the mesh [x_min, x_max]x[y_min, y_max].
    
    # Plot the decision boundary. For that, we will assign a color to each
    x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
    y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    
    # Obtain labels for each point in mesh. Use last trained model.
    Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
    
    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure(1)
    plt.clf()
    plt.imshow(Z, interpolation='nearest',
               extent=(xx.min(), xx.max(), yy.min(), yy.max()),
               cmap=plt.cm.Paired,
               aspect='auto', origin='lower')
    
    plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
    # Plot the centroids as a white X
    centroids = kmeans.cluster_centers_
    plt.scatter(centroids[:, 0], centroids[:, 1],
                marker='x', s=169, linewidths=3,
                color='w', zorder=10)
    plt.title('K-means clustering on the digits dataset (PCA-reduced data)\n'
              'Centroids are marked with white cross')
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
    plt.xticks(())
    plt.yticks(())
    plt.show()
    
    out:
    n_digits: 10,    n_samples 1797,     n_features 64
    __________________________________________________________________________________
    init        time    inertia homo    compl   v-meas  ARI AMI silhouette
    k-means++   0.55s   69432   0.602   0.650   0.625   0.465   0.621   0.146
    random      0.40s   69694   0.669   0.710   0.689   0.553   0.686   0.147
    PCA-based   0.09s   70804   0.671   0.698   0.684   0.561   0.681   0.118
    __________________________________________________________________________________
    

    2、 Gaussian mixture model

    我们在iris数据集上使用各种GMM协方差类型绘制训练的预测标签,并给出测试数据,将diagonal、spherical、tied、full按性能递增顺序进行比较。

    尽管人们期望全协方差在一般情况下表现最好,但它很容易对小数据集进行过度拟合,并且不能很好地推广到保留的测试数据(diagonal指每个分量分布有各自不同对角协方差矩阵,spherical指每个分量分布有各自不同的简单协方差矩阵, tied指所有分量分布有相同的标准协方差矩阵,full指每个分量分布有各自不同的标准协方差矩阵)。

    https://scikit-learn.org/stable/modules/mixture.html#mixture

    import matplotlib.pyplot as plt
    
    import numpy as np
    
    from sklearn import datasets
    from sklearn.mixture import [GaussianMixture](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture "View documentation for sklearn.mixture.GaussianMixture")
    from sklearn.model_selection import [StratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold "View documentation for sklearn.model_selection.StratifiedKFold")
    
    print(__doc__)
    
    colors = ['navy', 'turquoise', 'darkorange']
    
    def make_ellipses(gmm, ax):
        for n, color in enumerate(colors):
            if gmm.covariance_type == 'full':
                covariances = gmm.covariances_[n][:2, :2]
            elif gmm.covariance_type == 'tied':
                covariances = gmm.covariances_[:2, :2]
            elif gmm.covariance_type == 'diag':
                covariances = [np.diag](https://docs.scipy.org/doc/numpy/reference/generated/numpy.diag.html#numpy.diag "View documentation for numpy.diag")(gmm.covariances_[n][:2])
            elif gmm.covariance_type == 'spherical':
                covariances = [np.eye](https://docs.scipy.org/doc/numpy/reference/generated/numpy.eye.html#numpy.eye "View documentation for numpy.eye")(gmm.means_.shape[1]) * gmm.covariances_[n]
            v, w = [np.linalg.eigh](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.eigh.html#numpy.linalg.eigh "View documentation for numpy.linalg.eigh")(covariances)
            u = w[0] / [np.linalg.norm](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html#numpy.linalg.norm "View documentation for numpy.linalg.norm")(w[0])
            angle = [np.arctan2](https://docs.scipy.org/doc/numpy/reference/generated/numpy.arctan2.html#numpy.arctan2 "View documentation for numpy.arctan2")(u[1], u[0])
            angle = 180 * angle / [np.pi](https://docs.scipy.org/doc/numpy/reference/constants.html#numpy.pi "View documentation for numpy.pi")  # convert to degrees
            v = 2. * [np.sqrt](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sqrt.html#numpy.sqrt "View documentation for numpy.sqrt")(2.) * [np.sqrt](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sqrt.html#numpy.sqrt "View documentation for numpy.sqrt")(v)
            ell = [mpl.patches.Ellipse](https://matplotlib.org/api/_as_gen/matplotlib.patches.Ellipse.html#matplotlib.patches.Ellipse "View documentation for matplotlib.patches.Ellipse")(gmm.means_[n, :2], v[0], v[1],
                                      180 + angle, color=color)
            ell.set_clip_box(ax.bbox)
            ell.set_alpha(0.5)
            ax.add_artist(ell)
            ax.set_aspect('equal', 'datalim')
    
    iris = [datasets.load_iris](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris "View documentation for sklearn.datasets.load_iris")()
    
    # Break up the dataset into non-overlapping training (75%) and testing
    # (25%) sets.
    skf = [StratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold "View documentation for sklearn.model_selection.StratifiedKFold")(n_splits=4)
    # Only take the first fold.
    train_index, test_index = next(iter(skf.split(iris.data, iris.target)))
    
    X_train = iris.data[train_index]
    y_train = iris.target[train_index]
    X_test = iris.data[test_index]
    y_test = iris.target[test_index]
    
    n_classes = len([np.unique](https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html#numpy.unique "View documentation for numpy.unique")(y_train))
    
    # Try GMMs using different types of covariances.
    estimators = {cov_type: [GaussianMixture](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture "View documentation for sklearn.mixture.GaussianMixture")(n_components=n_classes,
                  covariance_type=cov_type, max_iter=20, random_state=0)
                  for cov_type in ['spherical', 'diag', 'tied', 'full']}
    
    n_estimators = len(estimators)
    
    [plt.figure](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.figure.html#matplotlib.pyplot.figure "View documentation for matplotlib.pyplot.figure")(figsize=(3 * n_estimators // 2, 6))
    [plt.subplots_adjust](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplots_adjust.html#matplotlib.pyplot.subplots_adjust "View documentation for matplotlib.pyplot.subplots_adjust")(bottom=.01, top=0.95, hspace=.15, wspace=.05,
                        left=.01, right=.99)
    
    for index, (name, estimator) in enumerate(estimators.items()):
        # Since we have class labels for the training data, we can
        # initialize the GMM parameters in a supervised manner.
        estimator.means_init = [np.array](https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html#numpy.array "View documentation for numpy.array")([X_train[y_train == i].mean(axis=0)
                                        for i in range(n_classes)])
    
        # Train the other parameters using the EM algorithm.
        estimator.fit(X_train)
    
        h = [plt.subplot](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplot.html#matplotlib.pyplot.subplot "View documentation for matplotlib.pyplot.subplot")(2, n_estimators // 2, index + 1)
        make_ellipses(estimator, h)
    
        for n, color in enumerate(colors):
            data = iris.data[iris.target == n]
            [plt.scatter](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html#matplotlib.pyplot.scatter "View documentation for matplotlib.pyplot.scatter")(data[:, 0], data[:, 1], s=0.8, color=color,
                        label=iris.target_names[n])
        # Plot the test data with crosses
        for n, color in enumerate(colors):
            data = X_test[y_test == n]
            [plt.scatter](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html#matplotlib.pyplot.scatter "View documentation for matplotlib.pyplot.scatter")(data[:, 0], data[:, 1], marker='x', color=color)
    
        y_train_pred = estimator.predict(X_train)
        train_accuracy = [np.mean](https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html#numpy.mean "View documentation for numpy.mean")(y_train_pred.ravel() == y_train.ravel()) * 100
        [plt.text](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.text.html#matplotlib.pyplot.text "View documentation for matplotlib.pyplot.text")(0.05, 0.9, 'Train accuracy: %.1f' % train_accuracy,
                 transform=h.transAxes)
    
        y_test_pred = estimator.predict(X_test)
        test_accuracy = [np.mean](https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html#numpy.mean "View documentation for numpy.mean")(y_test_pred.ravel() == y_test.ravel()) * 100
        [plt.text](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.text.html#matplotlib.pyplot.text "View documentation for matplotlib.pyplot.text")(0.05, 0.8, 'Test accuracy: %.1f' % test_accuracy,
                 transform=h.transAxes)
    
        [plt.xticks](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.xticks.html#matplotlib.pyplot.xticks "View documentation for matplotlib.pyplot.xticks")(())
        [plt.yticks](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.yticks.html#matplotlib.pyplot.yticks "View documentation for matplotlib.pyplot.yticks")(())
        [plt.title](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.title.html#matplotlib.pyplot.title "View documentation for matplotlib.pyplot.title")(name)
    
    [plt.legend](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html#matplotlib.pyplot.legend "View documentation for matplotlib.pyplot.legend")(scatterpoints=1, loc='lower right', prop=dict(size=12))
    
    [plt.show](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.show.html#matplotlib.pyplot.show "View documentation for matplotlib.pyplot.show")()</pre>
    
    

    3、DBSCAN

    DBSCAN算法将簇看作由低密度区域分隔的高密度区域。与KMeans算法不同,DBSCAN不需要确定聚类的数量,而是基于数据推测聚类的数目,它能够针对任意形状产生聚类。

    DBSCAN算法需要首先确定两个参数:

    • epsilon:在一个点周围邻近区域的半径
    • minPts:邻近区域内至少包含点的个数

    根据以上两个参数,结合epsilon-neighborhood的特征,可以把样本中的点分成三类:

    • 核点(core point):满足NBHD(p,epsilon)>=minPts,则为核样本点
    • 边缘点(border point):NBHD(p,epsilon)<minPts,但是该点可由一些核点获得(density-reachable或者directly-reachable)
    • 离群点(Outlier):既不是核点也不是边缘点,则是不属于这一类的点

    下面我们看一下DBSCAN算法在不规则形状的数据集上的表现,与k-means形成对比:

    https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-auto-examples-cluster-plot-dbscan-py

    import numpy as np
    
    from sklearn.cluster import DBSCAN
    from sklearn import metrics
    from sklearn.datasets.samples_generator import make_blobs
    from sklearn.preprocessing import StandardScaler
    
    
    # #############################################################################
    # Generate sample data
    centers = [[1, 1], [-1, -1], [1, -1]]
    X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.3,
                                random_state=12)
    
    X = StandardScaler().fit_transform(X)
    
    transformation = [[0.80834549, -0.43667341], [-0.40887718, 0.85253229]]
    X = np.dot(X, transformation)
    # #############################################################################
    # Compute DBSCAN
    db = DBSCAN(eps=0.3, min_samples=10).fit(X)
    core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
    core_samples_mask[db.core_sample_indices_] = True
    labels = db.labels_
    
    # Number of clusters in labels, ignoring noise if present.
    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
    n_noise_ = list(labels).count(-1)
    
    print('Estimated number of clusters: %d' % n_clusters_)
    print('Estimated number of noise points: %d' % n_noise_)
    print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
    print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
    print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
    print("Adjusted Rand Index: %0.3f"
          % metrics.adjusted_rand_score(labels_true, labels))
    print("Adjusted Mutual Information: %0.3f"
          % metrics.adjusted_mutual_info_score(labels_true, labels,
                                               average_method='arithmetic'))
    print("Silhouette Coefficient: %0.3f"
          % metrics.silhouette_score(X, labels))
    
    # #############################################################################
    # Plot result
    import matplotlib.pyplot as plt
    
    # Black removed and is used for noise instead.
    unique_labels = set(labels)
    colors = [plt.cm.Spectral(each)
              for each in np.linspace(0, 1, len(unique_labels))]
    for k, col in zip(unique_labels, colors):
        if k == -1:
            # Black used for noise.
            col = [0, 0, 0, 1]
    
        class_member_mask = (labels == k)
    
        xy = X[class_member_mask & core_samples_mask]
        plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
                 markeredgecolor='k', markersize=14)
    
        xy = X[class_member_mask & ~core_samples_mask]
        plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
                 markeredgecolor='k', markersize=6)
    
    plt.title('Estimated number of clusters: %d' % n_clusters_)
    plt.show()
    

    相关文章

      网友评论

          本文标题:聚类实现

          本文链接:https://www.haomeiwen.com/subject/hyivzqtx.html