1、kmeans
k-means算法的思路很简单,一句话描述就是:随机选定K个中心点,将样本归入距离其最近的中心点代表的簇,重新计算K个簇的中心,重复上述过程,直到误差减小值低于某个阈值。
但是我们常常忽视k-means算法的适用范围,即它在什么情况下表现好,什么情况下表现不好。我们知道,所有机器学习的算法的效果好坏都是和数据相关的,不存在一种放之四海而皆准的算法。那么k-means适用于怎样的数据呢?或者说使用k-means能产生好的簇划分的数据应该满足何种条件呢?
当我们使用簇内误差平方和作为优化目标,隐含的假设就是簇应该是凸的、各向同性的(通俗地说就是球形簇),且各个簇的尺寸(样本数量)和密度(方差)应该相当。
下面我们看一下当数据不满足假设时k-means的表现:
(https://scikit-learn.org/stable/modules/clustering.html)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
plt.figure(figsize=(12, 12))
n_samples = 1500
random_state = 177
#默认centers=3,即产生3个簇
X, y = make_blobs(n_samples=n_samples, random_state=random_state)
# Incorrect number of clusters
y_pred = KMeans(n_clusters=2, random_state=random_state).fit_predict(X)
plt.subplot(221)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.title("Incorrect Number of Blobs")
# Anisotropicly distributed data
transformation = [[0.60834549, -0.63667341], [-0.40887718, 0.85253229]]
X_aniso = np.dot(X, transformation)
y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_aniso)
plt.subplot(222)
plt.scatter(X_aniso[:, 0], X_aniso[:, 1], c=y_pred)
plt.title("Anisotropicly Distributed Blobs")
# Different variance
X_varied, y_varied = make_blobs(n_samples=n_samples,
cluster_std=[1.0, 2.5, 0.5],
random_state=random_state)
y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_varied)
plt.subplot(223)
plt.scatter(X_varied[:, 0], X_varied[:, 1], c=y_pred)
plt.title("Unequal Variance")
# Unevenly sized blobs
# np.vstack:按垂直方向(行顺序)堆叠数组构成一个新的数组
X_filtered = np.vstack((X[y == 0][:500], X[y == 1][:100], X[y == 2][:10]))
y_pred = KMeans(n_clusters=3,
random_state=random_state).fit_predict(X_filtered)
plt.subplot(224)
plt.scatter(X_filtered[:, 0], X_filtered[:, 1], c=y_pred)
plt.title("Unevenly Sized Blobs")
plt.show()
图一展示的是K选取不当时的结果,关于K的选取一会会再提到。
图二是簇形状不规则的情形(非球形),此时k-means的聚类结果表现并不理想。
图三是各个簇方差不同的情况,可以看到,方差大的簇更难“留住样本”,划分结果显示方差小的簇可能会“抢夺”方差大的簇的样本,使划分结果表现不好。
图四是各个簇包含样本数不同的情况,可以看到中心点倾向于向样本多的方向偏移,从而导致尺寸大的簇中部分样本点被划入错误的簇(图中实际上就出现了一个簇中有两个中心点的情况,这会分走簇的一大部分样本)。
除了上述四种情况,k-means还有一个局限,那就是维度灾难。维度灾难对几乎所有机器学习算法来说都是棘手的问题,对于基于距离定义相似度的k-means算法更是如此,因此当我们处理高维数据的聚类时,通常要先采用适当的降维方法(如PCA等)然后再对降维后的数据进行聚类。
下面是一个手写数字识别数据集上的聚类问题,我们先使用PCA降维然后运用k-means进行聚类:
https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-cluster-plot-kmeans-digits-py
from time import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
np.random.seed(42)
digits = load_digits()
data = scale(digits.data)
n_samples, n_features = data.shape
n_digits = len(np.unique(digits.target))
labels = digits.target
sample_size = 300
print("n_digits: %d, \t n_samples %d, \t n_features %d"
% (n_digits, n_samples, n_features))
print(82 * '_')
print('init\t\ttime\tinertia\thomo\tcompl\tv-meas\tARI\tAMI\tsilhouette')
def bench_k_means(estimator, name, data):
t0 = time()
estimator.fit(data)
print('%-9s\t%.2fs\t%i\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f'
% (name, (time() - t0), estimator.inertia_,
metrics.homogeneity_score(labels, estimator.labels_),
metrics.completeness_score(labels, estimator.labels_),
metrics.v_measure_score(labels, estimator.labels_),
metrics.adjusted_rand_score(labels, estimator.labels_),
metrics.adjusted_mutual_info_score(labels, estimator.labels_,
average_method='arithmetic'),
metrics.silhouette_score(data, estimator.labels_,
metric='euclidean',
sample_size=sample_size)))
bench_k_means(KMeans(init='k-means++', n_clusters=n_digits, n_init=10),
name="k-means++", data=data)
bench_k_means(KMeans(init='random', n_clusters=n_digits, n_init=10),
name="random", data=data)
# in this case the seeding of the centers is deterministic, hence we run the
# kmeans algorithm only once with n_init=1
pca = PCA(n_components=n_digits).fit(data)
bench_k_means(KMeans(init=pca.components_, n_clusters=n_digits, n_init=1),
name="PCA-based",
data=data)
print(82 * '_')
# #############################################################################
# Visualize the results on PCA-reduced data
reduced_data = PCA(n_components=2).fit_transform(data)
kmeans = KMeans(init='k-means++', n_clusters=n_digits, n_init=10)
kmeans.fit(reduced_data)
# Step size of the mesh. Decrease to increase the quality of the VQ.
h = .02 # point in the mesh [x_min, x_max]x[y_min, y_max].
# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap=plt.cm.Paired,
aspect='auto', origin='lower')
plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
marker='x', s=169, linewidths=3,
color='w', zorder=10)
plt.title('K-means clustering on the digits dataset (PCA-reduced data)\n'
'Centroids are marked with white cross')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()
out:
n_digits: 10, n_samples 1797, n_features 64
__________________________________________________________________________________
init time inertia homo compl v-meas ARI AMI silhouette
k-means++ 0.55s 69432 0.602 0.650 0.625 0.465 0.621 0.146
random 0.40s 69694 0.669 0.710 0.689 0.553 0.686 0.147
PCA-based 0.09s 70804 0.671 0.698 0.684 0.561 0.681 0.118
__________________________________________________________________________________
2、 Gaussian mixture model
我们在iris数据集上使用各种GMM协方差类型绘制训练的预测标签,并给出测试数据,将diagonal、spherical、tied、full按性能递增顺序进行比较。
尽管人们期望全协方差在一般情况下表现最好,但它很容易对小数据集进行过度拟合,并且不能很好地推广到保留的测试数据(diagonal指每个分量分布有各自不同对角协方差矩阵,spherical指每个分量分布有各自不同的简单协方差矩阵, tied指所有分量分布有相同的标准协方差矩阵,full指每个分量分布有各自不同的标准协方差矩阵)。
https://scikit-learn.org/stable/modules/mixture.html#mixture
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
from sklearn.mixture import [GaussianMixture](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture "View documentation for sklearn.mixture.GaussianMixture")
from sklearn.model_selection import [StratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold "View documentation for sklearn.model_selection.StratifiedKFold")
print(__doc__)
colors = ['navy', 'turquoise', 'darkorange']
def make_ellipses(gmm, ax):
for n, color in enumerate(colors):
if gmm.covariance_type == 'full':
covariances = gmm.covariances_[n][:2, :2]
elif gmm.covariance_type == 'tied':
covariances = gmm.covariances_[:2, :2]
elif gmm.covariance_type == 'diag':
covariances = [np.diag](https://docs.scipy.org/doc/numpy/reference/generated/numpy.diag.html#numpy.diag "View documentation for numpy.diag")(gmm.covariances_[n][:2])
elif gmm.covariance_type == 'spherical':
covariances = [np.eye](https://docs.scipy.org/doc/numpy/reference/generated/numpy.eye.html#numpy.eye "View documentation for numpy.eye")(gmm.means_.shape[1]) * gmm.covariances_[n]
v, w = [np.linalg.eigh](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.eigh.html#numpy.linalg.eigh "View documentation for numpy.linalg.eigh")(covariances)
u = w[0] / [np.linalg.norm](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html#numpy.linalg.norm "View documentation for numpy.linalg.norm")(w[0])
angle = [np.arctan2](https://docs.scipy.org/doc/numpy/reference/generated/numpy.arctan2.html#numpy.arctan2 "View documentation for numpy.arctan2")(u[1], u[0])
angle = 180 * angle / [np.pi](https://docs.scipy.org/doc/numpy/reference/constants.html#numpy.pi "View documentation for numpy.pi") # convert to degrees
v = 2. * [np.sqrt](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sqrt.html#numpy.sqrt "View documentation for numpy.sqrt")(2.) * [np.sqrt](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sqrt.html#numpy.sqrt "View documentation for numpy.sqrt")(v)
ell = [mpl.patches.Ellipse](https://matplotlib.org/api/_as_gen/matplotlib.patches.Ellipse.html#matplotlib.patches.Ellipse "View documentation for matplotlib.patches.Ellipse")(gmm.means_[n, :2], v[0], v[1],
180 + angle, color=color)
ell.set_clip_box(ax.bbox)
ell.set_alpha(0.5)
ax.add_artist(ell)
ax.set_aspect('equal', 'datalim')
iris = [datasets.load_iris](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris "View documentation for sklearn.datasets.load_iris")()
# Break up the dataset into non-overlapping training (75%) and testing
# (25%) sets.
skf = [StratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold "View documentation for sklearn.model_selection.StratifiedKFold")(n_splits=4)
# Only take the first fold.
train_index, test_index = next(iter(skf.split(iris.data, iris.target)))
X_train = iris.data[train_index]
y_train = iris.target[train_index]
X_test = iris.data[test_index]
y_test = iris.target[test_index]
n_classes = len([np.unique](https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html#numpy.unique "View documentation for numpy.unique")(y_train))
# Try GMMs using different types of covariances.
estimators = {cov_type: [GaussianMixture](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture "View documentation for sklearn.mixture.GaussianMixture")(n_components=n_classes,
covariance_type=cov_type, max_iter=20, random_state=0)
for cov_type in ['spherical', 'diag', 'tied', 'full']}
n_estimators = len(estimators)
[plt.figure](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.figure.html#matplotlib.pyplot.figure "View documentation for matplotlib.pyplot.figure")(figsize=(3 * n_estimators // 2, 6))
[plt.subplots_adjust](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplots_adjust.html#matplotlib.pyplot.subplots_adjust "View documentation for matplotlib.pyplot.subplots_adjust")(bottom=.01, top=0.95, hspace=.15, wspace=.05,
left=.01, right=.99)
for index, (name, estimator) in enumerate(estimators.items()):
# Since we have class labels for the training data, we can
# initialize the GMM parameters in a supervised manner.
estimator.means_init = [np.array](https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html#numpy.array "View documentation for numpy.array")([X_train[y_train == i].mean(axis=0)
for i in range(n_classes)])
# Train the other parameters using the EM algorithm.
estimator.fit(X_train)
h = [plt.subplot](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplot.html#matplotlib.pyplot.subplot "View documentation for matplotlib.pyplot.subplot")(2, n_estimators // 2, index + 1)
make_ellipses(estimator, h)
for n, color in enumerate(colors):
data = iris.data[iris.target == n]
[plt.scatter](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html#matplotlib.pyplot.scatter "View documentation for matplotlib.pyplot.scatter")(data[:, 0], data[:, 1], s=0.8, color=color,
label=iris.target_names[n])
# Plot the test data with crosses
for n, color in enumerate(colors):
data = X_test[y_test == n]
[plt.scatter](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html#matplotlib.pyplot.scatter "View documentation for matplotlib.pyplot.scatter")(data[:, 0], data[:, 1], marker='x', color=color)
y_train_pred = estimator.predict(X_train)
train_accuracy = [np.mean](https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html#numpy.mean "View documentation for numpy.mean")(y_train_pred.ravel() == y_train.ravel()) * 100
[plt.text](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.text.html#matplotlib.pyplot.text "View documentation for matplotlib.pyplot.text")(0.05, 0.9, 'Train accuracy: %.1f' % train_accuracy,
transform=h.transAxes)
y_test_pred = estimator.predict(X_test)
test_accuracy = [np.mean](https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html#numpy.mean "View documentation for numpy.mean")(y_test_pred.ravel() == y_test.ravel()) * 100
[plt.text](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.text.html#matplotlib.pyplot.text "View documentation for matplotlib.pyplot.text")(0.05, 0.8, 'Test accuracy: %.1f' % test_accuracy,
transform=h.transAxes)
[plt.xticks](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.xticks.html#matplotlib.pyplot.xticks "View documentation for matplotlib.pyplot.xticks")(())
[plt.yticks](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.yticks.html#matplotlib.pyplot.yticks "View documentation for matplotlib.pyplot.yticks")(())
[plt.title](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.title.html#matplotlib.pyplot.title "View documentation for matplotlib.pyplot.title")(name)
[plt.legend](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html#matplotlib.pyplot.legend "View documentation for matplotlib.pyplot.legend")(scatterpoints=1, loc='lower right', prop=dict(size=12))
[plt.show](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.show.html#matplotlib.pyplot.show "View documentation for matplotlib.pyplot.show")()</pre>
3、DBSCAN
DBSCAN算法将簇看作由低密度区域分隔的高密度区域。与KMeans算法不同,DBSCAN不需要确定聚类的数量,而是基于数据推测聚类的数目,它能够针对任意形状产生聚类。
DBSCAN算法需要首先确定两个参数:
- epsilon:在一个点周围邻近区域的半径
- minPts:邻近区域内至少包含点的个数
根据以上两个参数,结合epsilon-neighborhood的特征,可以把样本中的点分成三类:
- 核点(core point):满足NBHD(p,epsilon)>=minPts,则为核样本点
- 边缘点(border point):NBHD(p,epsilon)<minPts,但是该点可由一些核点获得(density-reachable或者directly-reachable)
- 离群点(Outlier):既不是核点也不是边缘点,则是不属于这一类的点
下面我们看一下DBSCAN算法在不规则形状的数据集上的表现,与k-means形成对比:
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.3,
random_state=12)
X = StandardScaler().fit_transform(X)
transformation = [[0.80834549, -0.43667341], [-0.40887718, 0.85253229]]
X = np.dot(X, transformation)
# #############################################################################
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
% metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
% metrics.adjusted_mutual_info_score(labels_true, labels,
average_method='arithmetic'))
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, labels))
# #############################################################################
# Plot result
import matplotlib.pyplot as plt
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
网友评论