例1.采集了178种意大利葡萄酒的13种化学成分的数据,试对葡萄酒进行聚类分析。
数据:rattle包-wine
代码如下
library(rattle)
data(wine,package="rattle")
head(wine)
df=scale(wine[-1])
d=dist(df) #计算矩阵距离
fit.ward=hclust(d,method="ward") #系统聚类
plot(fit.ward) #聚类图
clusters=cutree(fit.ward,k=3) #把树状图分成三类
table(clusters) #系统聚类规模
aggregate(df,by=list(cluster=clusters),median)
rect.hclust(fit.ward,k=3) #在树状图中叠加分三类的结果
set.seed(1234)
fit.km=kmeans(df,3,nstart=20)
fit.km$size #k值聚类规模
fit.km$centers
fit.km$cluster
运行结果
> library(rattle)
程辑包‘rattle’是用R版本3.4.4 来建造的
> data(wine,package="rattle")
> head(wine)
Type Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids Nonflavanoids Proanthocyanins Color Hue Dilution Proline
1 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
2 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
3 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
4 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
5 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735
6 1 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.05 2.85 1450
> df=scale(wine[-1]) #数据标准化
> d=dist(df) #计算矩阵距离
> fit.ward=hclust(d,method="ward") #系统聚类
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
> plot(fit.ward) #聚类图
> clusters=cutree(fit.ward,k=3) #把树状图分成三类
> table(clusters) #系统聚类规模
clusters
1 2 3
65 59 54
> aggregate(df,by=list(cluster=clusters),median) #描述统计
cluster Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids Nonflavanoids Proanthocyanins
1 1 0.8984457 -0.5248627 0.30430096 -0.7470867 0.4381890 0.8067217 0.8917481 -0.5773564 0.57499088
2 2 -0.9615576 -0.6054251 -0.49761194 0.1512342 -0.8220960 -0.1519728 0.0107426 -0.1755994 -0.05398515
3 3 0.1162589 0.8178444 0.03092156 0.4506745 -0.1569456 -1.0307762 -1.3057599 0.9091445 -0.89261984
Color Hue Dilution Proline
1 0.07846751 0.4924084 0.7722845 0.9942817
2 -0.97403427 0.3611585 0.3638283 -0.8475291
3 0.88078444 -1.1700906 -1.3122506 -0.3870764
> rect.hclust(fit.ward,k=3) #在树状图中叠加分三类的结果
> set.seed(1234)
> fit.km=kmeans(df,3,nstart=20)
> fit.km$size # k值聚类规模
[1] 62 65 51
> fit.km$centers
Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids Nonflavanoids Proanthocyanins Color
1 0.8328826 -0.3029551 0.3636801 -0.6084749 0.57596208 0.88274724 0.97506900 -0.56050853 0.57865427 0.1705823
2 -0.9234669 -0.3929331 -0.4931257 0.1701220 -0.49032869 -0.07576891 0.02075402 -0.03343924 0.05810161 -0.8993770
3 0.1644436 0.8690954 0.1863726 0.5228924 -0.07526047 -0.97657548 -1.21182921 0.72402116 -0.77751312 0.9388902
Hue Dilution Proline
1 0.4726504 0.7770551 1.1220202
2 0.4605046 0.2700025 -0.7517257
3 -1.1615122 -1.2887761 -0.4059428
> fit.km$cluster
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2
[62] 3 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 1
[123] 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
最终输出图片如下
Rplot03.png
- 采用系统聚类法,根据树状图,可以分成三类,规模分别是65,59,54。
- 采用k均值法,分三类的规模是62,65,51。
网友评论