层次凝聚聚类法 HAC
也称为全连接聚类,与单连接聚类不同的是,两个类之间的距离不是最近点距离,而是最远点距离
优点
- 能够帮助进行数据可视化
- 适合某些特殊的数据集和领域
缺点
- 对离群点和噪点敏感
- 计算量大 O(n^2)
sklearn使用
- 凝聚聚类示例
from sklearn import datasets, cluster
# iris数据集
X = datasets.load_iris().data[:10]
clust = cluster.AgglomerativeClustering(n_clusters = 3, linkage='ward')
labels = clust.fit_predict(X)
- 系统树绘画
from scipy.cluster.hierarchy import dendrogram, ward, single
from sklearn import datasets
import matlplotlib.pyplot as plt
# iris数据集
X = datasets.load_iris().data[:10]
linkage_matrix = ward(X)
dendogram(linkage_matrix)
plt.show
- 层次聚类的三种参数
from sklearn import datasets
iris = datasets.load_iris()
from sklearn.cluster import AgglomerativeClustering
ward = AgglomerativeClustering(n_clusters=3)
ward_pred = ward.fit_predict(iris.data)
complete = AgglomerativeClustering(n_clusters=3, linkage="complete")
complete_pred = complete.fit_predict(iris.data)
avg = AgglomerativeClustering(n_clusters=3, linkage="average")
avg_pred = avg.fit_predict(iris.data)
from sklearn.metrics import adjusted_rand_score
ward_ar_score = adjusted_rand_score(iris.target, ward_pred)
complete_ar_score = adjusted_rand_score(iris.target, complete_pred)
avg_ar_score = adjusted_rand_score(iris.target, avg_pred)
Scores:
Ward: 0.731198556771
Complete: 0.642251251836
Average: 0.759198707107
# 标准化数据
from sklearn import preprocessing
normalized_X = preprocessing.normalize(iris.data)
ward = AgglomerativeClustering(n_clusters=3)
ward_pred = ward.fit_predict(normalized_X)
complete = AgglomerativeClustering(n_clusters=3, linkage="complete")
complete_pred = complete.fit_predict(normalized_X)
avg = AgglomerativeClustering(n_clusters=3, linkage="average")
avg_pred = avg.fit_predict(normalized_X)
ward_ar_score = adjusted_rand_score(iris.target, ward_pred)
complete_ar_score = adjusted_rand_score(iris.target, complete_pred)
avg_ar_score = adjusted_rand_score(iris.target, avg_pred)
Scores:
Ward: 0.885697031028
Complete: 0.644447235392
Average: 0.558371443754
网友评论