美文网首页R语言4_PCA and tSNE
PCA-Statistics is the new sexy!!

PCA-Statistics is the new sexy!!

作者: Juan_NF | 来源:发表于2019-04-28 16:12 被阅读111次
    Shelork Holmes里的一句台词是,Brainy is the new sexy;学了PCA的推导后,我觉得,统计学对我简直是降维打击,所以改下这句话,Statistics is the new sexy!!!

    理解PCA

    PCA是为了更好地展示多维数据,通过线性转化,展示保留最多信息的主成分;将样本尽可能地分散地展示在坐标轴中达到可视化的目的;

    • PCA的理论假设是:方差越大,信息量越大;
    • 拿生信数据来说,大概率上,我们是要看数据的分组情况,所以要在坐标轴上表示的point是样本,舍掉的是基因层面的component,即n个基因获得n个component,依据方差最大化,取前k(0<k<n)个component;
    • 本质上计算出n个特征向量,给予矩阵n个移动方向,最后保存了k个移动后的结果;

    PCA步骤:

    1)数据为m行n列的原始矩阵(sample为行,gene为列)
    2)矩阵X每一个元素减去该列的均值(中心化)
    目的是使所有维度的偏移都是以0为基础的(我们必须对数据中individual(sample)和observations(gene)有区分和了解)
    3)求出协方差矩阵
    协方差矩阵为对称矩阵,对角线为每行方差,其他元素分别为不同行的协方差,协方差表示的是各行元素之间的线性相关性;
    4)目的是协方差矩阵中除对角线外的元素为0,即实现协方差矩阵对角化;
    协方差矩阵为可对角化矩阵,对角化后的矩阵中,对角线上的元素保持不变,但对角线外的元素为0;这样便实现了使各行元素之间无相关性;对协方差矩阵进行特征值分解即可获得P;

    image.png

    5)将P按特征值进行排序,因为Y=PX,所以,中心化后的矩阵(转置)与特征向量矩阵(转置)乘积后得到新的样本矩阵,取前两行即PC1和PC2;

    我们用R的基础函数简略演示下新的样本矩阵的由来,也更好的理解一下PCA
    1.以下是原始矩阵:
    > rm(list=ls())
    > library("FactoMineR")
    > library("factoextra")
    > data("decathlon2")
    > decathlon2.active <- decathlon2[1:23, 1:10]
    > decathlon2.active
                X100m Long.jump Shot.put High.jump X400m X110m.hurdle Discus
    SEBRLE      11.04      7.58    14.83      2.07 49.81        14.69  43.75
    CLAY        10.76      7.40    14.26      1.86 49.37        14.05  50.72
    BERNARD     11.02      7.23    14.25      1.92 48.93        14.99  40.87
    YURKOV      11.34      7.09    15.19      2.10 50.42        15.31  46.26
    ZSIVOCZKY   11.13      7.30    13.48      2.01 48.62        14.17  45.67
    McMULLEN    10.83      7.31    13.76      2.13 49.91        14.38  44.41
    MARTINEAU   11.64      6.81    14.57      1.95 50.14        14.93  47.60
    HERNU       11.37      7.56    14.41      1.86 51.10        15.06  44.99
    BARRAS      11.33      6.97    14.09      1.95 49.48        14.48  42.10
    NOOL        11.33      7.27    12.68      1.98 49.20        15.29  37.92
    BOURGUIGNON 11.36      6.80    13.46      1.86 51.16        15.67  40.49
    Sebrle      10.85      7.84    16.36      2.12 48.36        14.05  48.72
    Clay        10.44      7.96    15.23      2.06 49.19        14.13  50.11
    Karpov      10.50      7.81    15.93      2.09 46.81        13.97  51.65
    Macey       10.89      7.47    15.73      2.15 48.97        14.56  48.34
    Warners     10.62      7.74    14.48      1.97 47.97        14.01  43.73
    Zsivoczky   10.91      7.14    15.31      2.12 49.40        14.95  45.62
    Hernu       10.97      7.19    14.65      2.03 48.73        14.25  44.72
    Bernard     10.69      7.48    14.80      2.12 49.13        14.17  44.75
    Schwarzl    10.98      7.49    14.01      1.94 49.76        14.25  42.43
    Pogorelov   10.95      7.31    15.10      2.06 50.79        14.21  44.60
    Schoenbeck  10.90      7.30    14.77      1.88 50.30        14.34  44.41
    Barras      11.14      6.99    14.91      1.94 49.41        14.37  44.83
                Pole.vault Javeline X1500m
    SEBRLE            5.02    63.19 291.70
    CLAY              4.92    60.15 301.50
    BERNARD           5.32    62.77 280.10
    YURKOV            4.72    63.44 276.40
    ZSIVOCZKY         4.42    55.37 268.00
    McMULLEN          4.42    56.37 285.10
    MARTINEAU         4.92    52.33 262.10
    HERNU             4.82    57.19 285.10
    BARRAS            4.72    55.40 282.00
    NOOL              4.62    57.44 266.60
    BOURGUIGNON       5.02    54.68 291.70
    Sebrle            5.00    70.52 280.01
    Clay              4.90    69.71 282.00
    Karpov            4.60    55.54 278.11
    Macey             4.40    58.46 265.42
    Warners           4.90    55.39 278.05
    Zsivoczky         4.70    63.45 269.54
    Hernu             4.80    57.76 264.35
    Bernard           4.40    55.27 276.31
    Schwarzl          5.10    56.32 273.56
    Pogorelov         5.00    53.45 287.63
    Schoenbeck        5.00    60.89 278.82
    Barras            4.60    64.55 267.09
    
    2.以下是对原始矩阵的(维度)进行中心化的结果:
    > #######R_base协方差|特征值|特征向量
    > #tmp<-rowMeans(t(decathlon2.active))
    > #center_deca<-t(decathlon2.active)-tmp
    > center_d<-scale(decathlon2.active,center=TRUE,scale=FALSE)
    > center_d
                      X100m   Long.jump Shot.put    High.jump       X400m
    SEBRLE       0.04043478  0.23043478     0.21  0.062608696  0.37695652
    CLAY        -0.23956522  0.05043478    -0.36 -0.147391304 -0.06304348
    BERNARD      0.02043478 -0.11956522    -0.37 -0.087391304 -0.50304348
    YURKOV       0.34043478 -0.25956522     0.57  0.092608696  0.98695652
    ZSIVOCZKY    0.13043478 -0.04956522    -1.14  0.002608696 -0.81304348
    McMULLEN    -0.16956522 -0.03956522    -0.86  0.122608696  0.47695652
    MARTINEAU    0.64043478 -0.53956522    -0.05 -0.057391304  0.70695652
    HERNU        0.37043478  0.21043478    -0.21 -0.147391304  1.66695652
    BARRAS       0.33043478 -0.37956522    -0.53 -0.057391304  0.04695652
    NOOL         0.33043478 -0.07956522    -1.94 -0.027391304 -0.23304348
    BOURGUIGNON  0.36043478 -0.54956522    -1.16 -0.147391304  1.72695652
    Sebrle      -0.14956522  0.49043478     1.74  0.112608696 -1.07304348
    Clay        -0.55956522  0.61043478     0.61  0.052608696 -0.24304348
    Karpov      -0.49956522  0.46043478     1.31  0.082608696 -2.62304348
    Macey       -0.10956522  0.12043478     1.11  0.142608696 -0.46304348
    Warners     -0.37956522  0.39043478    -0.14 -0.037391304 -1.46304348
    Zsivoczky   -0.08956522 -0.20956522     0.69  0.112608696 -0.03304348
    Hernu       -0.02956522 -0.15956522     0.03  0.022608696 -0.70304348
    Bernard     -0.30956522  0.13043478     0.18  0.112608696 -0.30304348
    Schwarzl    -0.01956522  0.14043478    -0.61 -0.067391304  0.32695652
    Pogorelov   -0.04956522 -0.03956522     0.48  0.052608696  1.35695652
    Schoenbeck  -0.09956522 -0.04956522     0.15 -0.127391304  0.86695652
    Barras       0.14043478 -0.35956522     0.29 -0.067391304 -0.02304348
                X110m.hurdle     Discus   Pole.vault   Javeline      X1500m
    SEBRLE        0.15608696 -1.4104348  0.223478261  4.0752174  13.8221739
    CLAY         -0.48391304  5.5595652  0.123478261  1.0352174  23.6221739
    BERNARD       0.45608696 -4.2904348  0.523478261  3.6552174   2.2221739
    YURKOV        0.77608696  1.0995652 -0.076521739  4.3252174  -1.4778261
    ZSIVOCZKY    -0.36391304  0.5095652 -0.376521739 -3.7447826  -9.8778261
    McMULLEN     -0.15391304 -0.7504348 -0.376521739 -2.7447826   7.2221739
    MARTINEAU     0.39608696  2.4395652  0.123478261 -6.7847826 -15.7778261
    HERNU         0.52608696 -0.1704348  0.023478261 -1.9247826   7.2221739
    BARRAS       -0.05391304 -3.0604348 -0.076521739 -3.7147826   4.1221739
    NOOL          0.75608696 -7.2404348 -0.176521739 -1.6747826 -11.2778261
    BOURGUIGNON   1.13608696 -4.6704348  0.223478261 -4.4347826  13.8221739
    Sebrle       -0.48391304  3.5595652  0.203478261 11.4052174   2.1321739
    Clay         -0.40391304  4.9495652  0.103478261 10.5952174   4.1221739
    Karpov       -0.56391304  6.4895652 -0.196521739 -3.5747826   0.2321739
    Macey         0.02608696  3.1795652 -0.396521739 -0.6547826 -12.4578261
    Warners      -0.52391304 -1.4304348  0.103478261 -3.7247826   0.1721739
    Zsivoczky     0.41608696  0.4595652 -0.096521739  4.3352174  -8.3378261
    Hernu        -0.28391304 -0.4404348  0.003478261 -1.3547826 -13.5278261
    Bernard      -0.36391304 -0.4104348 -0.396521739 -3.8447826  -1.5678261
    Schwarzl     -0.28391304 -2.7304348  0.303478261 -2.7947826  -4.3178261
    Pogorelov    -0.32391304 -0.5604348  0.203478261 -5.6647826   9.7521739
    Schoenbeck   -0.19391304 -0.7504348  0.203478261  1.7752174   0.9421739
    Barras       -0.16391304 -0.3304348 -0.196521739  5.4352174 -10.7878261
    attr(,"scaled:center")
           X100m    Long.jump     Shot.put    High.jump        X400m X110m.hurdle 
       10.999565     7.349565    14.620000     2.007391    49.433043    14.533913 
          Discus   Pole.vault     Javeline       X1500m 
       45.160435     4.796522    59.114783   277.877826 
    
    • 注意scale函数的两个参数均以为单位进行;
    • center是指每列的元素减去该列的均值,PCA中这个操作是为了保证所有维度的偏移是以0为基准的;在scale函数中,center是默认执行的;
    • scale是指在center功能执行后,每列元素除以其标准差;在scale函数中,scale也是默认执行的;
    3.以下是对中心化后矩阵求协方差矩阵结果:
    > cov_deca<-cov(center_d)
    > cov_deca
                        X100m   Long.jump     Shot.put    High.jump       X400m
    X100m         0.090686166 -0.07183202 -0.113395455 -0.011728458  0.18034684
    Long.jump    -0.071832016  0.09821344  0.116295455  0.010353360 -0.16100771
    Shot.put     -0.113395455  0.11629545  0.713518182  0.043518182 -0.26718636
    High.jump    -0.011728458  0.01035336  0.043518182  0.009347431 -0.03579170
    X400m         0.180346838 -0.16100771 -0.267186364 -0.035791700  1.01420395
    X110m.hurdle  0.106465415 -0.09000731 -0.155872727 -0.011534783  0.28359209
    Discus       -0.483517984  0.47429111  1.992863636  0.109001186 -1.21791502
    Pole.vault    0.006589328  0.00160751  0.003827273 -0.012027668  0.06342925
    Javeline     -0.436175099  0.56543854  1.992636364  0.106272134 -0.66853340
    X1500m       -0.668191897  0.69294901 -0.396386364 -0.251305929  2.95288419
                 X110m.hurdle     Discus   Pole.vault   Javeline      X1500m
    X100m          0.10646542 -0.4835180  0.006589328 -0.4361751  -0.6681919
    Long.jump     -0.09000731  0.4742911  0.001607510  0.5654385   0.6929490
    Shot.put      -0.15587273  1.9928636  0.003827273  1.9926364  -0.3963864
    High.jump     -0.01153478  0.1090012 -0.012027668  0.1062721  -0.2513059
    X400m          0.28359209 -1.2179150  0.063429249 -0.6685334   2.9528842
    X110m.hurdle   0.23411581 -0.8522518  0.017505138 -0.1694377  -0.2303729
    Discus        -0.85225178 11.0625316 -0.157434783  4.6430569   2.5717646
    Pole.vault     0.01750514 -0.1574348  0.062278261  0.2798992   0.9729648
    Javeline      -0.16943775  4.6430569  0.279899209 24.3063261   4.4479336
    X1500m        -0.23037292  2.5717646  0.972964822  4.4479336 100.2299542
    
    4.对协方差矩阵求特征向量和特征值:
    > deca_rotation<-eigen(cov_deca)
    > deca_rotation
    eigen() decomposition
    $values
     [1] 1.006840e+02 2.578783e+01 9.968741e+00 8.647143e-01 2.822286e-01 1.232549e-01
     [7] 4.927276e-02 4.052578e-02 1.742956e-02 3.137752e-03
    
    $vectors
                  [,1]         [,2]         [,3]        [,4]         [,5]
     [1,]  0.006986928 -0.021005809  0.035210712 -0.19034812 -0.071024702
     [2,] -0.007312503  0.025532679 -0.029814632  0.16625248  0.067170085
     [3,]  0.002193901  0.101099662 -0.134997417 -0.01096451  0.970306608
     [4,]  0.002402614  0.006076201 -0.008013291  0.01753102  0.059193880
     [5,] -0.028740478 -0.049811926  0.111757272 -0.91565675  0.057823061
     [6,]  0.002581313 -0.017216936  0.081278216 -0.27124976 -0.055223739
     [7,] -0.031355058  0.306079695 -0.924526837 -0.14964630 -0.160947293
     [8,] -0.009777235  0.005798730  0.025326389 -0.01932767  0.101075155
     [9,] -0.059802247  0.942306214  0.324118956 -0.01034241 -0.053718141
    [10,] -0.997195770 -0.064897755  0.006522677  0.02866866  0.006768862
                  [,6]         [,7]        [,8]         [,9]         [,10]
     [1,] -0.291402114  0.320507924  0.13921549  0.861711536 -0.0861683397
     [2,]  0.251412655 -0.413861269 -0.75078605  0.408212303  0.0380464490
     [3,] -0.109846963 -0.021747143  0.09819153  0.047978292  0.0738364582
     [4,]  0.002220758 -0.234201394  0.08344638 -0.013058493 -0.9664847617
     [5,]  0.361562371 -0.078839429 -0.06140280 -0.041452501  0.0008151440
     [6,] -0.840883209 -0.290782847 -0.29340766 -0.193359991  0.0367327281
     [7,] -0.030340336  0.017305544 -0.02788910 -0.013974306 -0.0095425190
     [8,] -0.026222073  0.761309040 -0.55608405 -0.221336323 -0.2239225812
     [9,]  0.011114470  0.002280716  0.01715753  0.003689028  0.0005149581
    [10,] -0.016174502 -0.001957392  0.01320850  0.006200789 -0.0005126593
    
    5.中心化后矩阵(转置)与特征向量(转置)乘积后取前两行即PC1和PC2:
    > (t(deca_rotation$vectors)%*%t(center_d))[1:2,]
             SEBRLE       CLAY   BERNARD   YURKOV ZSIVOCZKY  McMULLEN MARTINEAU
    [1,] -13.996304 -23.795992 -2.289494 1.160683 10.082984 -7.027159 16.050504
    [2,]   2.517846   1.125352  1.965726 4.489585 -2.806377 -3.391762 -4.696647
             HERNU    BARRAS      NOOL BOURGUIGNON    Sebrle      Clay    Karpov
    [1,] -7.128017 -3.789459 11.582324  -13.417010 -2.892805 -4.901404 -0.149155
    [2,] -2.451095 -4.777138 -3.270108   -6.749538 11.953568 11.340372 -1.102863
            Macey    Warners Zsivoczky      Hernu   Bernard  Schwarzl Pogorelov
    [1,] 12.38074  0.1296911  8.046438 13.6052392  1.815421  4.542704 -9.409190
    [2,]  1.30346 -3.8728285  4.827768 -0.4938572 -3.599063 -3.256776 -6.154349
         Schoenbeck    Barras
    [1,]  -1.049881 10.449143
    [2,]   1.358511  5.740215
    

    我们汲汲以求的PCA其实早有对统计学烂熟于心的人做了R包,不得不说,数学才是王道啊!!!

    对比下在R的现成的PCA功能的结果
    • FactoMineR和factoextra配合做PCA和可视化;
    • prcomp(stats base级别)和autoplot配合做PCA和可视化;
    ######prcomp为stats自带的PCA函数
    deca_re<-prcomp(decathlon2.active)
    #####rotation-包含特征向量的矩阵
    deca_re$rotation[, 1]
    #####x-如果retx参数设为TRUE,则返回rotated矩阵(中心化(scaled,如果有设定)矩阵乘以rotation矩阵)的结果;cov(x)为协方差矩阵;
    deca_re$x
    #####ggfortify使ggplot2功能更加丰富,使autoplot能够处理prcomp的结果
    library(ggfortify)
    autoplot(prcomp(decathlon2.active),label=TRUE,loadings=TRUE)
    res<-PCA(X = decathlon2.active, scale.unit = FALSE, ncp = 5, graph = FALSE)
    res$ind$coord
    #####res调出对PCA函数结果的更详尽说明
    res
    **Results for the Principal Component Analysis (PCA)**
    The analysis was performed on 23 individuals, described by 10 variables
    *The results are available in the following objects:
    
       name               description                          
    1  "$eig"             "eigenvalues"                        
    2  "$var"             "results for the variables"          
    3  "$var$coord"       "coord. for the variables"           
    4  "$var$cor"         "correlations variables - dimensions"
    5  "$var$cos2"        "cos2 for the variables"             
    6  "$var$contrib"     "contributions of the variables"     
    7  "$ind"             "results for the individuals"        
    8  "$ind$coord"       "coord. for the individuals"         
    9  "$ind$cos2"        "cos2 for the individuals"           
    10 "$ind$contrib"     "contributions of the individuals"   
    11 "$call"            "summary statistics"                 
    12 "$call$centre"     "mean of the variables"              
    13 "$call$ecart.type" "standard error of the variables"    
    14 "$call$row.w"      "weights for the individuals"        
    15 "$call$col.w"      "weights for the variables"
    ######fviz_pca_ind对individual绘图,fviz_pca_var对variable绘图
    fviz_pca_ind(res)
    #####对coord处理后获得特征向量,与prcomp中的rotation一致
    loadings<-sweep(res$var$coord,2,sqrt(res$eig[1:5,1]),FUN="/")
    loadings
    
    PCA
    看着跟前面的图坐标位置哪儿哪儿不一样,后面再用$x画图后,看到是坐标比例的差异,再对比发现,与上图是某种镜像的关系,相对位置其实是一样的:
    prcomp
    Rplot_deca$x

    参考内容
    1.https://www.zhihu.com/question/36481348
    2.http://www.sthda.com/english/wiki/print.php?id=204
    3.https://wenku.baidu.com/view/ce7ee04bcc175527072208ca.html
    4.https://zhuanlan.zhihu.com/p/21580949
    5.https://groups.google.com/forum/#!topic/factominer-users/BRN8jRm-_EM
    6.http://blog.sciencenet.cn/home.php?mod=space&uid=443073&do=blog&id=321347


    课程分享
    生信技能树全球公益巡讲
    https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
    B站公益74小时生信工程师教学视频合辑
    https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
    招学徒:
    https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw

    相关文章

      网友评论

        本文标题:PCA-Statistics is the new sexy!!

        本文链接:https://www.haomeiwen.com/subject/nnzknqtx.html