给女朋友写的生统资料_Part19

作者: 城管大队哈队长 | 来源:发表于2019-06-19 21:41 被阅读0次

给女朋友写的生统资料_Part19
给女朋友写的生统资料_Part14
给女朋友写的生统资料_Part15
给女朋友写的生统资料_Part11
给女朋友写的生统资料_Part12
给女朋友写的生统资料_Part13
给女朋友写的生统资料_Part16
给女朋友写的生统资料_Part17
给女朋友写的生统资料_Part18
给女朋友写的生统资料_Part6

聚类

聚类(clustering),指将样本分到不同的组中，使得同一组中的样本差异尽可能的小，而不同组中的样本差异尽可能的大（这个定义很虚哈╮（╯＿╰）╭）。我也不知道考试的话，聚类能考个啥

聚类的话，课件上和作业里提到的似乎是层次聚类（Hierarchical cluster ），可以用R里面的hclust函数。然后稍微注意几点的是，hclust函数有不同的method，到时候如果要的话，根据题目来就行了。

method  
the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).

然后hclust函数输入的数据是各个样本之间的距离，用dist函数就可以了，dist函数里面可以设置不同的度量距离的方法，比如欧氏距离，曼哈顿距离等等

method  
the distance measure to be used. This must be one of "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski". Any unambiguous substring can be given.

然后画图的话，就是plot。

举个作业上的例子：请利用广泛使用的iris数据的花瓣属性值进行简单层次聚类。

# 整理数据，因为鸢尾花数据第5列是花的品种，所以不选
> dat <- iris[,1:4]
> head(dat)
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
6          5.4         3.9          1.7         0.4

# 然后利用dist计算距离
> dat_dist <- dist(dat)

# 把距离输入hclust，然后画图
plot(hclust(dat_dist))

聚类的话，在《R语言实战》第二版的第16章。

在聚类那一章里面提到了缩放数据，这个可以稍微提下，好像看到了往年有道题目考了这个。代码来源于P343

# 因为我们是变量内部进行缩放，变量之间是没有关系的，而我们每列都是一个变量，所以apply的margin是2

data <- data.frame(x1=c(100:105),x2=(0:5))
> data
   x1 x2
1 100  0
2 101  1
3 102  2
4 103  3
5 104  4
6 105  5

# 每个变量标准化为均值为0和标准差为1的变量。
> apply(data, 2, function(x){(x-mean(x))/sd(x)})
             x1         x2
[1,] -1.3363062 -1.3363062
[2,] -0.8017837 -0.8017837
[3,] -0.2672612 -0.2672612
[4,]  0.2672612  0.2672612
[5,]  0.8017837  0.8017837
[6,]  1.3363062  1.3363062

# 每个变量被其最大值相除
> apply(data, 2, function(x){x/max(x)})
            x1  x2
[1,] 0.9523810 0.0
[2,] 0.9619048 0.2
[3,] 0.9714286 0.4
[4,] 0.9809524 0.6
[5,] 0.9904762 0.8
[6,] 1.0000000 1.0

# 该变量减去它的平均值并除以变量的平均绝对偏差（Mean Absolute Deviation，查下百度吧）
> apply(data, 2, function(x){(x - mean(x)) / mad(x)})
             x1         x2
[1,] -1.1241513 -1.1241513
[2,] -0.6744908 -0.6744908
[3,] -0.2248303 -0.2248303
[4,]  0.2248303  0.2248303
[5,]  0.6744908  0.6744908
[6,]  1.1241513  1.1241513

# 第一种方法可以用scale解决
> scale(data)
             x1         x2
[1,] -1.3363062 -1.3363062
[2,] -0.8017837 -0.8017837
[3,] -0.2672612 -0.2672612
[4,]  0.2672612  0.2672612
[5,]  0.8017837  0.8017837
[6,]  1.3363062  1.3363062
attr(,"scaled:center")
   x1    x2 
102.5   2.5 
attr(,"scaled:scale")
      x1       x2 
1.870829 1.870829

主成分分析

主成分分析的话，我用一个例子来说明我们可能会问到的问题（我PCA其实搞的不清楚，所以还是按照作业答案来。）

对鸢尾花数据进行PCA分析

进行主成分分析

# 因为鸢尾花第5列是物种名，所以做PCA的时候去掉第五列
# 记得要cor = T，这样应该是可以保证对你的数据是标准化
# 但具体原因还是不太清楚
iris_pca <- princomp(iris[,1:4], cor = T)

各个主成分能解释多少方差

# 主成分概述
# 这里看Proportion of Variance那一列，代表主成分能解释多少变异
# 看Cumulative Proportion就可以知道，前面的几个主成分能累积解释多少变异

> summary(iris_pca)
Importance of components:
                          Comp.1    Comp.2     Comp.3      Comp.4
Standard deviation     1.7083611 0.9560494 0.38308860 0.143926497
Proportion of Variance 0.7296245 0.2285076 0.03668922 0.005178709
Cumulative Proportion  0.7296245 0.9581321 0.99482129 1.000000000

哪些变量能被PC1所解释

# 在loadings那边看，所有变量应该都能被PC1所解释
# PC2那边Petal.Length，Petal.Width，loading就很小，没有显示（其实是有的，不过很小），应该就无法被解释

> iris_pca$loadings

Loadings:
             Comp.1 Comp.2 Comp.3 Comp.4
Sepal.Length  0.521  0.377  0.720  0.261
Sepal.Width  -0.269  0.923 -0.244 -0.124
Petal.Length  0.580        -0.142 -0.801
Petal.Width   0.565        -0.634  0.524

               Comp.1 Comp.2 Comp.3 Comp.4
SS loadings      1.00   1.00   1.00   1.00
Proportion Var   0.25   0.25   0.25   0.25
Cumulative Var   0.25   0.50   0.75   1.00

降维后的数据

> head(iris_pca$scores)
        Comp.1     Comp.2      Comp.3      Comp.4
[1,] -2.264703  0.4800266  0.12770602  0.02416820
[2,] -2.080961 -0.6741336  0.23460885  0.10300677
[3,] -2.364229 -0.3419080 -0.04420148  0.02837705
[4,] -2.299384 -0.5973945 -0.09129011 -0.06595556
[5,] -2.389842  0.6468354 -0.01573820 -0.03592281
[6,] -2.075631  1.4891775 -0.02696829  0.00660818

# 只取投射到PC1和PC2上的数据
> head(iris_pca$scores[,1:2])
        Comp.1     Comp.2
[1,] -2.264703  0.4800266
[2,] -2.080961 -0.6741336
[3,] -2.364229 -0.3419080
[4,] -2.299384 -0.5973945
[5,] -2.389842  0.6468354
[6,] -2.075631  1.4891775

写下PC1（以向量的形式）

# 还是在loading那边看
> iris_pca$loadings

Loadings:
             Comp.1 Comp.2 Comp.3 Comp.4
Sepal.Length  0.521  0.377  0.720  0.261
Sepal.Width  -0.269  0.923 -0.244 -0.124
Petal.Length  0.580        -0.142 -0.801
Petal.Width   0.565        -0.634  0.524

               Comp.1 Comp.2 Comp.3 Comp.4
SS loadings      1.00   1.00   1.00   1.00
Proportion Var   0.25   0.25   0.25   0.25
Cumulative Var   0.25   0.50   0.75   1.00

PC1：(0.521,-0.269,0.580,0.565)

网友评论

本文标题：给女朋友写的生统资料_Part19

本文链接：https://www.haomeiwen.com/subject/yjilqctx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

给女朋友写的生统资料_Part19

聚类

主成分分析

对鸢尾花数据进行PCA分析

相关文章