R语言-相关系数计算(一)

作者: 医科研 | 来源:发表于2019-06-26 17:32 被阅读0次

    应用R语言完成相关性检验,相关性矩阵及相关性可视化
    首先安装相应的R包

    require(ggpubr)
    ## Loading required package: ggpubr
    ## Loading required package: ggplot2
    ## Loading required package: magrittr
    require(tidyverse)
    ## Loading required package: tidyverse
    ## -- Attaching packages ---------------------------------------------------- tidyverse 1.2.1 --
    ## √ tibble  2.0.1       √ purrr   0.3.0  
    ## √ tidyr   0.8.2       √ dplyr   0.8.0.1
    ## √ readr   1.3.1       √ stringr 1.4.0  
    ## √ tibble  2.0.1       √ forcats 0.4.0
    ## -- Conflicts ------------------------------------------------------- tidyverse_conflicts() --
    ## x tidyr::extract()   masks magrittr::extract()
    ## x dplyr::filter()    masks stats::filter()
    ## x dplyr::lag()       masks stats::lag()
    ## x purrr::set_names() masks magrittr::set_names()
    require(Hmisc)#相关性矩阵计算
    ## Loading required package: Hmisc
    ## Loading required package: lattice
    ## Loading required package: survival
    ## Loading required package: Formula
    ## 
    ## Attaching package: 'Hmisc'
    ## The following objects are masked from 'package:dplyr':
    ## 
    ##     src, summarize
    ## The following objects are masked from 'package:base':
    ## 
    ##     format.pval, units
    require(corrplot)#相关性图
    ## Loading required package: corrplot
    ## corrplot 0.84 loaded
    

    相关性分析的方法
    Pearson correlation,属于最常用的一种,用于计算两个变量之间的线下相关系数,应用于正态分布的数据,可以绘制线下回归曲线 Kendall 以及 Spearman法,是一只能怪基于秩的相关性检验,也称非参数

    在R中的计算

    相关系数可以使用函数计算

    cor函数计算相关系数 cor.test 计算相关系数,返回结果包括相关系数,统计量等

    举个简单示例

    当x,y是相同的数值向量时,很简单

    x=rnorm(10)
    y=1:10
    cor(x,y,method = "pearson")
    ## [1] -0.5518297
    cor(x, y, method=c("pearson", "kendall", "spearman"))
    ## [1] -0.5518297
    ##如果存在缺失值
    cor(x, y,  method = "pearson", use = "complete.obs")
    ## [1] -0.5518297
    cor.test(x, y,  method = "pearson", use = "complete.obs")##处理缺失值
    ## 
    ##  Pearson's product-moment correlation
    ## 
    ## data:  x and y
    ## t = -1.8716, df = 8, p-value = 0.09817
    ## alternative hypothesis: true correlation is not equal to 0
    ## 95 percent confidence interval:
    ##  -0.8768111  0.1192188
    ## sample estimates:
    ##        cor 
    ## -0.5518297
    

    示例分析

    使用大家熟悉的mtcars作为示例数据集

    class(mtcars)
    ## [1] "data.frame"
    head(mtcars)
    ##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
    ## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
    ## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
    ## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
    ## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
    ## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
    ## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
    dim(mtcars)#是一个32行,11列的数据框
    ## [1] 32 11
    先调整下数据格式
    my_data <- mtcars
    my_data$cyl <- factor(my_data$cyl)##转换位因子作分类
    str(my_data)
    ## 'data.frame':    32 obs. of  11 variables:
    ##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
    ##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
    ##  $ disp: num  160 160 108 258 360 ...
    ##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
    ##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
    ##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
    ##  $ qsec: num  16.5 17 18.6 19.4 17 ...
    ##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
    ##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
    ##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
    ##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
    

    下面我们来计算下mpg与drat的相关系数,并且可视化

    library(ggpubr)
    ggscatter(my_data,
              x = "drat", #x变量
              y = "mpg",#y变量
              add = "reg.line",##拟合曲线
              conf.int = TRUE,##置信区间阴影带
              cor.coef = TRUE, ##系数
              cor.method = "pearson",#方法
              xlab = "drat", ## x轴
              ylab = "mg")## y轴
    
    Fig1

    预先测试数据是否符合正态分布

    Shapiro-Wilk,零检验:正态分布, p>0.05,认为与正态分布没有差异
    如果检验不符合正态分布或接近,建议使用非参数检验,spearman等

    Shapiro-Wilk 变量正态分布检验

    shapiro.test(my_data$mpg) # => p = 0.1229
    ## 
    ##  Shapiro-Wilk normality test
    ## 
    ## data:  my_data$mpg
    ## W = 0.94756, p-value = 0.1229
    shapiro.test(my_data$drat)# => p = 0.1101
    ## 
    ##  Shapiro-Wilk normality test
    ## 
    ## data:  my_data$drat
    ## W = 0.94588, p-value = 0.1101
    # Q-Q图绘制给定样本与理论正态分布之间的相关性,非常相符
    library("ggpubr")
    # Check for the normality of "mpg""
    ggqqplot(my_data$mpg, ylab = "MPG")
    
    Fig2
    res <- cor.test(my_data$drat, my_data$mpg, method = "pearson")
    res
    ## 
    ##  Pearson's product-moment correlation
    ## 
    ## data:  my_data$drat and my_data$mpg
    ## t = 5.096, df = 30, p-value = 1.776e-05
    ## alternative hypothesis: true correlation is not equal to 0
    ## 95 percent confidence interval:
    ##  0.4360484 0.8322010
    ## sample estimates:
    ##       cor 
    ## 0.6811719
    

    t值是t检验的统计量
    df指自由度
    p-value指t检验的检验水平
    conf.int指confidence interval相关系数的95%可信区间
    sample estimates 是相关系数的值

    该结果可以解释为mpg与draft有相关性,相关系数为0.68,p<0.05

    考虑到有些新手同学不会获取cor.test的结果

    实际上就是一个列表数据的获取

    str(res)##发现其实检验结果是返回的列表
    ## List of 9
    ##  $ statistic  : Named num 5.1
    ##   ..- attr(*, "names")= chr "t"
    ##  $ parameter  : Named int 30
    ##   ..- attr(*, "names")= chr "df"
    ##  $ p.value    : num 1.78e-05
    ##  $ estimate   : Named num 0.681
    ##   ..- attr(*, "names")= chr "cor"
    ##  $ null.value : Named num 0
    ##   ..- attr(*, "names")= chr "correlation"
    ##  $ alternative: chr "two.sided"
    ##  $ method     : chr "Pearson's product-moment correlation"
    ##  $ data.name  : chr "my_data$drat and my_data$mpg"
    ##  $ conf.int   : num [1:2] 0.436 0.832
    ##   ..- attr(*, "conf.level")= num 0.95
    ##  - attr(*, "class")= chr "htest"
    res$p.value##获取pvalue
    ## [1] 1.77624e-05
    res$conf.int## 95% CI
    ## [1] 0.4360484 0.8322010
    ## attr(,"conf.level")
    ## [1] 0.95
    res$estimate## cor
    ##       cor 
    ## 0.6811719
    

    Kendall法相关性检验

    注意其中tau是相关系数的值

    res2 <- cor.test(my_data$mpg, my_data$wt, method = "kendall")
    ## Warning in cor.test.default(my_data$mpg, my_data$wt, method = "kendall"):
    ## Cannot compute exact p-value with ties
    res2
    ## 
    ##  Kendall's rank correlation tau
    ## 
    ## data:  my_data$mpg and my_data$wt
    ## z = -5.7981, p-value = 6.706e-09
    ## alternative hypothesis: true tau is not equal to 0
    ## sample estimates:
    ##        tau 
    ## -0.7278321
    

    Spearman法相关性系数-rho是其相关性系数

    res3 <- cor.test(my_data$wt, my_data$mpg, method = "spearman")
    ## Warning in cor.test.default(my_data$wt, my_data$mpg, method = "spearman"):
    ## Cannot compute exact p-value with ties
    res3
    ## 
    ##  Spearman's rank correlation rho
    ## 
    ## data:  my_data$wt and my_data$mpg
    ## S = 10292, p-value = 1.488e-11
    ## alternative hypothesis: true rho is not equal to 0
    ## sample estimates:
    ##       rho 
    ## -0.886422
    

    相关文章

      网友评论

        本文标题:R语言-相关系数计算(一)

        本文链接:https://www.haomeiwen.com/subject/tjchcctx.html