Shelork Holmes里的一句台词是,Brainy is the new sexy;学了PCA的推导后,我觉得,统计学对我简直是降维打击,所以改下这句话,Statistics is the new sexy!!!
理解PCA
PCA是为了更好地展示多维数据,通过线性转化,展示保留最多信息的主成分;将样本尽可能地分散地展示在坐标轴中达到可视化的目的;
- PCA的理论假设是:方差越大,信息量越大;
- 拿生信数据来说,大概率上,我们是要看数据的分组情况,所以要在坐标轴上表示的point是样本,舍掉的是基因层面的component,即n个基因获得n个component,依据方差最大化,取前k(0<k<n)个component;
- 本质上计算出n个特征向量,给予矩阵n个移动方向,最后保存了k个移动后的结果;
PCA步骤:
1)数据为m行n列的原始矩阵(sample为行,gene为列)
2)矩阵X每一个元素减去该列的均值(中心化)
目的是使所有维度的偏移都是以0为基础的(我们必须对数据中individual(sample)和observations(gene)有区分和了解)
3)求出协方差矩阵
协方差矩阵为对称矩阵,对角线为每行方差,其他元素分别为不同行的协方差,协方差表示的是各行元素之间的线性相关性;
4)目的是协方差矩阵中除对角线外的元素为0,即实现协方差矩阵对角化;
协方差矩阵为可对角化矩阵,对角化后的矩阵中,对角线上的元素保持不变,但对角线外的元素为0;这样便实现了使各行元素之间无相关性;对协方差矩阵进行特征值分解即可获得P;
5)将P按特征值进行排序,因为Y=PX,所以,中心化后的矩阵(转置)与特征向量矩阵(转置)乘积后得到新的样本矩阵,取前两行即PC1和PC2;
我们用R的基础函数简略演示下新的样本矩阵的由来,也更好的理解一下PCA
1.以下是原始矩阵:
> rm(list=ls())
> library("FactoMineR")
> library("factoextra")
> data("decathlon2")
> decathlon2.active <- decathlon2[1:23, 1:10]
> decathlon2.active
X100m Long.jump Shot.put High.jump X400m X110m.hurdle Discus
SEBRLE 11.04 7.58 14.83 2.07 49.81 14.69 43.75
CLAY 10.76 7.40 14.26 1.86 49.37 14.05 50.72
BERNARD 11.02 7.23 14.25 1.92 48.93 14.99 40.87
YURKOV 11.34 7.09 15.19 2.10 50.42 15.31 46.26
ZSIVOCZKY 11.13 7.30 13.48 2.01 48.62 14.17 45.67
McMULLEN 10.83 7.31 13.76 2.13 49.91 14.38 44.41
MARTINEAU 11.64 6.81 14.57 1.95 50.14 14.93 47.60
HERNU 11.37 7.56 14.41 1.86 51.10 15.06 44.99
BARRAS 11.33 6.97 14.09 1.95 49.48 14.48 42.10
NOOL 11.33 7.27 12.68 1.98 49.20 15.29 37.92
BOURGUIGNON 11.36 6.80 13.46 1.86 51.16 15.67 40.49
Sebrle 10.85 7.84 16.36 2.12 48.36 14.05 48.72
Clay 10.44 7.96 15.23 2.06 49.19 14.13 50.11
Karpov 10.50 7.81 15.93 2.09 46.81 13.97 51.65
Macey 10.89 7.47 15.73 2.15 48.97 14.56 48.34
Warners 10.62 7.74 14.48 1.97 47.97 14.01 43.73
Zsivoczky 10.91 7.14 15.31 2.12 49.40 14.95 45.62
Hernu 10.97 7.19 14.65 2.03 48.73 14.25 44.72
Bernard 10.69 7.48 14.80 2.12 49.13 14.17 44.75
Schwarzl 10.98 7.49 14.01 1.94 49.76 14.25 42.43
Pogorelov 10.95 7.31 15.10 2.06 50.79 14.21 44.60
Schoenbeck 10.90 7.30 14.77 1.88 50.30 14.34 44.41
Barras 11.14 6.99 14.91 1.94 49.41 14.37 44.83
Pole.vault Javeline X1500m
SEBRLE 5.02 63.19 291.70
CLAY 4.92 60.15 301.50
BERNARD 5.32 62.77 280.10
YURKOV 4.72 63.44 276.40
ZSIVOCZKY 4.42 55.37 268.00
McMULLEN 4.42 56.37 285.10
MARTINEAU 4.92 52.33 262.10
HERNU 4.82 57.19 285.10
BARRAS 4.72 55.40 282.00
NOOL 4.62 57.44 266.60
BOURGUIGNON 5.02 54.68 291.70
Sebrle 5.00 70.52 280.01
Clay 4.90 69.71 282.00
Karpov 4.60 55.54 278.11
Macey 4.40 58.46 265.42
Warners 4.90 55.39 278.05
Zsivoczky 4.70 63.45 269.54
Hernu 4.80 57.76 264.35
Bernard 4.40 55.27 276.31
Schwarzl 5.10 56.32 273.56
Pogorelov 5.00 53.45 287.63
Schoenbeck 5.00 60.89 278.82
Barras 4.60 64.55 267.09
2.以下是对原始矩阵的(维度)进行中心化的结果:
> #######R_base协方差|特征值|特征向量
> #tmp<-rowMeans(t(decathlon2.active))
> #center_deca<-t(decathlon2.active)-tmp
> center_d<-scale(decathlon2.active,center=TRUE,scale=FALSE)
> center_d
X100m Long.jump Shot.put High.jump X400m
SEBRLE 0.04043478 0.23043478 0.21 0.062608696 0.37695652
CLAY -0.23956522 0.05043478 -0.36 -0.147391304 -0.06304348
BERNARD 0.02043478 -0.11956522 -0.37 -0.087391304 -0.50304348
YURKOV 0.34043478 -0.25956522 0.57 0.092608696 0.98695652
ZSIVOCZKY 0.13043478 -0.04956522 -1.14 0.002608696 -0.81304348
McMULLEN -0.16956522 -0.03956522 -0.86 0.122608696 0.47695652
MARTINEAU 0.64043478 -0.53956522 -0.05 -0.057391304 0.70695652
HERNU 0.37043478 0.21043478 -0.21 -0.147391304 1.66695652
BARRAS 0.33043478 -0.37956522 -0.53 -0.057391304 0.04695652
NOOL 0.33043478 -0.07956522 -1.94 -0.027391304 -0.23304348
BOURGUIGNON 0.36043478 -0.54956522 -1.16 -0.147391304 1.72695652
Sebrle -0.14956522 0.49043478 1.74 0.112608696 -1.07304348
Clay -0.55956522 0.61043478 0.61 0.052608696 -0.24304348
Karpov -0.49956522 0.46043478 1.31 0.082608696 -2.62304348
Macey -0.10956522 0.12043478 1.11 0.142608696 -0.46304348
Warners -0.37956522 0.39043478 -0.14 -0.037391304 -1.46304348
Zsivoczky -0.08956522 -0.20956522 0.69 0.112608696 -0.03304348
Hernu -0.02956522 -0.15956522 0.03 0.022608696 -0.70304348
Bernard -0.30956522 0.13043478 0.18 0.112608696 -0.30304348
Schwarzl -0.01956522 0.14043478 -0.61 -0.067391304 0.32695652
Pogorelov -0.04956522 -0.03956522 0.48 0.052608696 1.35695652
Schoenbeck -0.09956522 -0.04956522 0.15 -0.127391304 0.86695652
Barras 0.14043478 -0.35956522 0.29 -0.067391304 -0.02304348
X110m.hurdle Discus Pole.vault Javeline X1500m
SEBRLE 0.15608696 -1.4104348 0.223478261 4.0752174 13.8221739
CLAY -0.48391304 5.5595652 0.123478261 1.0352174 23.6221739
BERNARD 0.45608696 -4.2904348 0.523478261 3.6552174 2.2221739
YURKOV 0.77608696 1.0995652 -0.076521739 4.3252174 -1.4778261
ZSIVOCZKY -0.36391304 0.5095652 -0.376521739 -3.7447826 -9.8778261
McMULLEN -0.15391304 -0.7504348 -0.376521739 -2.7447826 7.2221739
MARTINEAU 0.39608696 2.4395652 0.123478261 -6.7847826 -15.7778261
HERNU 0.52608696 -0.1704348 0.023478261 -1.9247826 7.2221739
BARRAS -0.05391304 -3.0604348 -0.076521739 -3.7147826 4.1221739
NOOL 0.75608696 -7.2404348 -0.176521739 -1.6747826 -11.2778261
BOURGUIGNON 1.13608696 -4.6704348 0.223478261 -4.4347826 13.8221739
Sebrle -0.48391304 3.5595652 0.203478261 11.4052174 2.1321739
Clay -0.40391304 4.9495652 0.103478261 10.5952174 4.1221739
Karpov -0.56391304 6.4895652 -0.196521739 -3.5747826 0.2321739
Macey 0.02608696 3.1795652 -0.396521739 -0.6547826 -12.4578261
Warners -0.52391304 -1.4304348 0.103478261 -3.7247826 0.1721739
Zsivoczky 0.41608696 0.4595652 -0.096521739 4.3352174 -8.3378261
Hernu -0.28391304 -0.4404348 0.003478261 -1.3547826 -13.5278261
Bernard -0.36391304 -0.4104348 -0.396521739 -3.8447826 -1.5678261
Schwarzl -0.28391304 -2.7304348 0.303478261 -2.7947826 -4.3178261
Pogorelov -0.32391304 -0.5604348 0.203478261 -5.6647826 9.7521739
Schoenbeck -0.19391304 -0.7504348 0.203478261 1.7752174 0.9421739
Barras -0.16391304 -0.3304348 -0.196521739 5.4352174 -10.7878261
attr(,"scaled:center")
X100m Long.jump Shot.put High.jump X400m X110m.hurdle
10.999565 7.349565 14.620000 2.007391 49.433043 14.533913
Discus Pole.vault Javeline X1500m
45.160435 4.796522 59.114783 277.877826
- 注意scale函数的两个参数均以列为单位进行;
- center是指每列的元素减去该列的均值,PCA中这个操作是为了保证所有维度的偏移是以0为基准的;在scale函数中,center是默认执行的;
- scale是指在center功能执行后,每列元素除以其标准差;在scale函数中,scale也是默认执行的;
3.以下是对中心化后矩阵求协方差矩阵结果:
> cov_deca<-cov(center_d)
> cov_deca
X100m Long.jump Shot.put High.jump X400m
X100m 0.090686166 -0.07183202 -0.113395455 -0.011728458 0.18034684
Long.jump -0.071832016 0.09821344 0.116295455 0.010353360 -0.16100771
Shot.put -0.113395455 0.11629545 0.713518182 0.043518182 -0.26718636
High.jump -0.011728458 0.01035336 0.043518182 0.009347431 -0.03579170
X400m 0.180346838 -0.16100771 -0.267186364 -0.035791700 1.01420395
X110m.hurdle 0.106465415 -0.09000731 -0.155872727 -0.011534783 0.28359209
Discus -0.483517984 0.47429111 1.992863636 0.109001186 -1.21791502
Pole.vault 0.006589328 0.00160751 0.003827273 -0.012027668 0.06342925
Javeline -0.436175099 0.56543854 1.992636364 0.106272134 -0.66853340
X1500m -0.668191897 0.69294901 -0.396386364 -0.251305929 2.95288419
X110m.hurdle Discus Pole.vault Javeline X1500m
X100m 0.10646542 -0.4835180 0.006589328 -0.4361751 -0.6681919
Long.jump -0.09000731 0.4742911 0.001607510 0.5654385 0.6929490
Shot.put -0.15587273 1.9928636 0.003827273 1.9926364 -0.3963864
High.jump -0.01153478 0.1090012 -0.012027668 0.1062721 -0.2513059
X400m 0.28359209 -1.2179150 0.063429249 -0.6685334 2.9528842
X110m.hurdle 0.23411581 -0.8522518 0.017505138 -0.1694377 -0.2303729
Discus -0.85225178 11.0625316 -0.157434783 4.6430569 2.5717646
Pole.vault 0.01750514 -0.1574348 0.062278261 0.2798992 0.9729648
Javeline -0.16943775 4.6430569 0.279899209 24.3063261 4.4479336
X1500m -0.23037292 2.5717646 0.972964822 4.4479336 100.2299542
4.对协方差矩阵求特征向量和特征值:
> deca_rotation<-eigen(cov_deca)
> deca_rotation
eigen() decomposition
$values
[1] 1.006840e+02 2.578783e+01 9.968741e+00 8.647143e-01 2.822286e-01 1.232549e-01
[7] 4.927276e-02 4.052578e-02 1.742956e-02 3.137752e-03
$vectors
[,1] [,2] [,3] [,4] [,5]
[1,] 0.006986928 -0.021005809 0.035210712 -0.19034812 -0.071024702
[2,] -0.007312503 0.025532679 -0.029814632 0.16625248 0.067170085
[3,] 0.002193901 0.101099662 -0.134997417 -0.01096451 0.970306608
[4,] 0.002402614 0.006076201 -0.008013291 0.01753102 0.059193880
[5,] -0.028740478 -0.049811926 0.111757272 -0.91565675 0.057823061
[6,] 0.002581313 -0.017216936 0.081278216 -0.27124976 -0.055223739
[7,] -0.031355058 0.306079695 -0.924526837 -0.14964630 -0.160947293
[8,] -0.009777235 0.005798730 0.025326389 -0.01932767 0.101075155
[9,] -0.059802247 0.942306214 0.324118956 -0.01034241 -0.053718141
[10,] -0.997195770 -0.064897755 0.006522677 0.02866866 0.006768862
[,6] [,7] [,8] [,9] [,10]
[1,] -0.291402114 0.320507924 0.13921549 0.861711536 -0.0861683397
[2,] 0.251412655 -0.413861269 -0.75078605 0.408212303 0.0380464490
[3,] -0.109846963 -0.021747143 0.09819153 0.047978292 0.0738364582
[4,] 0.002220758 -0.234201394 0.08344638 -0.013058493 -0.9664847617
[5,] 0.361562371 -0.078839429 -0.06140280 -0.041452501 0.0008151440
[6,] -0.840883209 -0.290782847 -0.29340766 -0.193359991 0.0367327281
[7,] -0.030340336 0.017305544 -0.02788910 -0.013974306 -0.0095425190
[8,] -0.026222073 0.761309040 -0.55608405 -0.221336323 -0.2239225812
[9,] 0.011114470 0.002280716 0.01715753 0.003689028 0.0005149581
[10,] -0.016174502 -0.001957392 0.01320850 0.006200789 -0.0005126593
5.中心化后矩阵(转置)与特征向量(转置)乘积后取前两行即PC1和PC2:
> (t(deca_rotation$vectors)%*%t(center_d))[1:2,]
SEBRLE CLAY BERNARD YURKOV ZSIVOCZKY McMULLEN MARTINEAU
[1,] -13.996304 -23.795992 -2.289494 1.160683 10.082984 -7.027159 16.050504
[2,] 2.517846 1.125352 1.965726 4.489585 -2.806377 -3.391762 -4.696647
HERNU BARRAS NOOL BOURGUIGNON Sebrle Clay Karpov
[1,] -7.128017 -3.789459 11.582324 -13.417010 -2.892805 -4.901404 -0.149155
[2,] -2.451095 -4.777138 -3.270108 -6.749538 11.953568 11.340372 -1.102863
Macey Warners Zsivoczky Hernu Bernard Schwarzl Pogorelov
[1,] 12.38074 0.1296911 8.046438 13.6052392 1.815421 4.542704 -9.409190
[2,] 1.30346 -3.8728285 4.827768 -0.4938572 -3.599063 -3.256776 -6.154349
Schoenbeck Barras
[1,] -1.049881 10.449143
[2,] 1.358511 5.740215
我们汲汲以求的PCA其实早有对统计学烂熟于心的人做了R包,不得不说,数学才是王道啊!!!
对比下在R的现成的PCA功能的结果
- FactoMineR和factoextra配合做PCA和可视化;
- prcomp(stats base级别)和autoplot配合做PCA和可视化;
######prcomp为stats自带的PCA函数
deca_re<-prcomp(decathlon2.active)
#####rotation-包含特征向量的矩阵
deca_re$rotation[, 1]
#####x-如果retx参数设为TRUE,则返回rotated矩阵(中心化(scaled,如果有设定)矩阵乘以rotation矩阵)的结果;cov(x)为协方差矩阵;
deca_re$x
#####ggfortify使ggplot2功能更加丰富,使autoplot能够处理prcomp的结果
library(ggfortify)
autoplot(prcomp(decathlon2.active),label=TRUE,loadings=TRUE)
res<-PCA(X = decathlon2.active, scale.unit = FALSE, ncp = 5, graph = FALSE)
res$ind$coord
#####res调出对PCA函数结果的更详尽说明
res
**Results for the Principal Component Analysis (PCA)**
The analysis was performed on 23 individuals, described by 10 variables
*The results are available in the following objects:
name description
1 "$eig" "eigenvalues"
2 "$var" "results for the variables"
3 "$var$coord" "coord. for the variables"
4 "$var$cor" "correlations variables - dimensions"
5 "$var$cos2" "cos2 for the variables"
6 "$var$contrib" "contributions of the variables"
7 "$ind" "results for the individuals"
8 "$ind$coord" "coord. for the individuals"
9 "$ind$cos2" "cos2 for the individuals"
10 "$ind$contrib" "contributions of the individuals"
11 "$call" "summary statistics"
12 "$call$centre" "mean of the variables"
13 "$call$ecart.type" "standard error of the variables"
14 "$call$row.w" "weights for the individuals"
15 "$call$col.w" "weights for the variables"
######fviz_pca_ind对individual绘图,fviz_pca_var对variable绘图
fviz_pca_ind(res)
#####对coord处理后获得特征向量,与prcomp中的rotation一致
loadings<-sweep(res$var$coord,2,sqrt(res$eig[1:5,1]),FUN="/")
loadings
PCA
看着跟前面的图坐标位置哪儿哪儿不一样,后面再用$x画图后,看到是坐标比例的差异,再对比发现,与上图是某种镜像的关系,相对位置其实是一样的:
prcompRplot_deca$x
参考内容:
1.https://www.zhihu.com/question/36481348
2.http://www.sthda.com/english/wiki/print.php?id=204
3.https://wenku.baidu.com/view/ce7ee04bcc175527072208ca.html
4.https://zhuanlan.zhihu.com/p/21580949
5.https://groups.google.com/forum/#!topic/factominer-users/BRN8jRm-_EM
6.http://blog.sciencenet.cn/home.php?mod=space&uid=443073&do=blog&id=321347
课程分享
生信技能树全球公益巡讲
(https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g)
B站公益74小时生信工程师教学视频合辑
(https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw)
招学徒:
(https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw)
网友评论