美文网首页
R语言统计5:方差分析(ANOVA,F- test)

R语言统计5:方差分析(ANOVA,F- test)

作者: cc的生信随记 | 来源:发表于2023-04-11 08:50 被阅读0次

方差分析(Analysis of Variance,简称ANOVA),又称“变异数分析”,是R.A.Fisher发明的,用于两个及两个以上样本均数差别的显著性检验。根据影响实验指标条件的个数可以区分为:单因素方差分析,双因素方差分析和多因素方差分析

aov()函数是为平衡设计准备的,对于非平衡设计它的结果很难解释

1. 单因素方差分析

前提条件: 1) 样本独立性
\ \ \ \ \ \ \ 2) 样本总体必须服从正态分布
\ \ \ \ \ \ \ 3) 方差齐性

原假设:H0: μ_1 = μ_2 = ... = μ_n
\ \ \ \ \ \ H1: μ_1,μ_2,...,μ_n不完全相等

1.1 先进行三大检验

# 使用R自带的数据集PlantGrowth,研究两个处理和一个对照组对植物产量的影响
head(PlantGrowth)
##   weight group
## 1   4.17  ctrl
## 2   5.58  ctrl
## 3   5.18  ctrl
## 4   6.11  ctrl
## 5   4.50  ctrl
## 6   4.61  ctrl

# 独立性检验
mytable1 <- table(PlantGrowth$group, PlantGrowth$weight)
chisq.test(mytable1)
## 
##  Pearson's Chi-squared test
## 
## data:  mytable1
## X-squared = 57, df = 56, p-value = 0.4377
## 
## Warning message:
## In chisq.test(mytable1) : Chi-squared approximation may be incorrect

# 正态性检验
mydata <- PlantGrowth[,1]
shapiro.test(mydata[1:10]) # 对应group 每组水平下的检验
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata[1:10]
## W = 0.95668, p-value = 0.7475

shapiro.test(mydata[11:20])
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata[11:20]
## W = 0.93041, p-value = 0.4519

shapiro.test(mydata[21:30])
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata[21:30]
## W = 0.94101, p-value = 0.5643

# 方差齐性
bartlett.test(weight~group, data = PlantGrowth)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  weight by group
## Bartlett's K-squared = 2.8786, df = 2, p-value = 0.2371

★ 三大检验的p值均>0.05,也就是都接受了原假设,满足条件

1.2 单因素方差分析

ANOVA <- aov(weight~group,data = PlantGrowth)# weight 因变量在~的左边,group 自变量/处理在~右边
summary(ANOVA)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## group        2  3.766  1.8832   4.846 0.0159 *
## Residuals   27 10.492  0.3886                 
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

★ p = 0.0159 < 0.05,拒绝原假设(H0:3组样本数据均值相等,无差异),说明3组样本均值有差异,不同处理方式对植物产量有显著影响

2. 双因素方差分析

原假设
A因素提出的假设为
H0:μ_1 = μ_2 = … = μ_n(假设A因素各水平间没有显著差异,也即A因素对试验指标无显著影响)
H1:μ_i不全相等(i = 1,2,…,n)(假设A因素各水平间有显著差异,也即A因素对试验指标有显著影响)
B因素提出的假设为
H0:μ_1 = μ_2 = … = μ_n(假设B因素各水平间没有显著差异,也即B因素对试验指标无显著影响)
H1:μ_i不全相等(i = 1,2,…,n)(假设B因素各水平间有显著差异,也即B因素对试验指标有显著影响)

2.1 初步检验

# 使用ToothGrowth数据集-随机分配60只豚鼠,分别采取两种喂食方式(supp:橙汁或维生素C)各种喂食方式中抗坏血酸的剂量有三种水平(dose:0.5mg,1mg,2mg),牙齿长度为因变量
summary(ToothGrowth)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

shapiro.test(ToothGrowth$len)      #正态性检验
## 
##  Shapiro-Wilk normality test
## 
## data:  ToothGrowth$len
## W = 0.96743, p-value = 0.1091

bartlett.test(ToothGrowth$len ~ ToothGrowth$supp, data = ToothGrowth)   #方差齐次检验
## 
##  Bartlett test of homogeneity of variances
## 
## data:  ToothGrowth$len by ToothGrowth$supp
## Bartlett's K-squared = 1.4217, df = 1, p-value = 0.2331

bartlett.test(ToothGrowth$len ~ ToothGrowth$dose, data = ToothGrowth)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  ToothGrowth$len by ToothGrowth$dose
## Bartlett's K-squared = 0.66547, df = 2, p-value = 0.717

★ p-value = 0.1091,数据满足正态性假设,因素A、B均满足方差齐性要求。

2.2 双因素方差分析

2.2.1 \color{green}{不含有交互项}的双因素方差分析

aov_data1 <- aov(ToothGrowth$len~ToothGrowth$supp + ToothGrowth$dose, data = ToothGrowth) 
#  查看结果
aov_data1
## Call:
##    aov(formula = ToothGrowth$len ~ ToothGrowth$supp + ToothGrowth$dose, 
##     data = ToothGrowth)
## 
## Terms:
##                 ToothGrowth$supp ToothGrowth$dose Residuals
## Sum of Squares           205.350         2224.304  1022.555
## Deg. of Freedom                1                1        57
## 
## Residual standard error: 4.235512
## Estimated effects may be unbalanced

summary(aov_data1)
##                  Df Sum Sq Mean Sq F value   Pr(>F)    
## ToothGrowth$supp  1  205.4   205.4   11.45   0.0013 ** 
## ToothGrowth$dose  1 2224.3  2224.3  123.99 6.31e-16 ***
## Residuals        57 1022.6    17.9                     
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

★ 种喂食方式(supp)和剂量(dose)均具有显著性

2.2.2 \color{green}{含有交互项}的双因素方差分析

fit <- aov(len ~ supp * dose, data = ToothGrowth) 
fit
## Call:
##    aov(formula = len ~ supp * dose, data = ToothGrowth)
## 
## Terms:
##                      supp      dose supp:dose Residuals
## Sum of Squares   205.3500 2224.3043   88.9201  933.6349
## Deg. of Freedom         1         1         1        56
## 
## Residual standard error: 4.083142
## Estimated effects may be unbalanced

summary(fit)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## supp         1  205.4   205.4  12.317 0.000894 ***
## dose         1 2224.3  2224.3 133.415  < 2e-16 ***
## supp:dose    1   88.9    88.9   5.333 0.024631 *  
## Residuals   56  933.6    16.7                     
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

★ 主效应和交互效应都很显著

3. 多因素方差分析

多因素方差分析是研究两个或两个以上因素对因变量的作用和影响以及这些因素共同作用的影响,而其核心内容则是检验在不同控制变量的不同交叉水平下,各交叉分组下样品数据所带来的总体均值,有无显著性差异,进而判断多个因素是否对观测变量产生了显著影响

对于平衡设计来说,多因素方差分析的结果与变量顺序无关;而对于非平衡设计来说,存在三种类型的多因素方差分析:I型、II型和III型,而aov()函数使用的是I型方差分析,它的结果与变量顺序有关。

# 使用mtcars数据集的四个变量
library(dplyr) 
mpg <- mtcars %>%
  select(mpg, cyl, gear, carb) %>%
  mutate(cyl = factor(cyl),
         gear = factor(gear),
         carb = factor(carb))
head(mpg)
##                    mpg cyl gear carb
## Mazda RX4         21.0   6    4    4
## Mazda RX4 Wag     21.0   6    4    4
## Datsun 710        22.8   4    4    1
## Hornet 4 Drive    21.4   6    3    1
## Hornet Sportabout 18.7   8    3    2
## Valiant           18.1   6    3    1

# 统计mtcars数据集的交叉组样本数量
with(mpg, {tapply(mpg, list(cyl, gear), length)})
##    3  4 5
## 4  1  8 2
## 6  2  4 1
## 8 12 NA 2

★ mpg数据的9个交叉组的样本并不完全一致,并且有一个交叉组没有样本(NA),属于非平衡设计

3.1 最简单的三因素方差分析

aov(mpg ~ cyl + gear + carb, mpg)
## Call:
##    aov(formula = mpg ~ cyl + gear + carb, data = mpg)
## 
## Terms:
##                      cyl     gear     carb Residuals
## Sum of Squares  824.7846   8.2519  88.5199  204.4908
## Deg. of Freedom        2        2        5        22
## 
## Residual standard error: 3.048776
## Estimated effects may be unbalanced
cyl的离差平方和:
fit1 <- lm(mpg ~ cyl, mpg) 
sum((fitted(fit1) - mean(mpg$mpg))^2)
## [1] 824.7846
gear的离差平方和:
fit2 <- lm(mpg ~ cyl + gear, mpg) 
sum((fitted(fit2) - mean(mpg$mpg))^2) - sum((fitted(fit1) - mean(mpg$mpg))^2)
## [1] 8.251855
carb的离差平方和:
fit3 <- lm(mpg ~ cyl + gear + carb, mpg) 
sum((fitted(fit3) - mean(mpg$mpg))^2) - sum((fitted(fit2) - mean(mpg$mpg))^2)
## [1] 88.5199
Residuals的离差平方和:
sum(residuals(fit3)^2)
## [1] 204.4908

I型方差分析可以使用逐步回归的方法计算离差平方和。I型平方和又称为顺序平方和

3.2 含有\color{green}{交互项}的三因素方差分析

aov(mpg ~ cyl*gear + carb, mpg)
## Call:
##    aov(formula = mpg ~ cyl * gear + carb, data = mpg)
## 
## Terms:
##                      cyl     gear     carb cyl:gear Residuals
## Sum of Squares  824.7846   8.2519  88.5199  25.3878  179.1030
## Deg. of Freedom        2        2        5        2        20
## 
## Residual standard error: 2.992516
## 2 out of 14 effects not estimable
## Estimated effects may be unbalanced

aov(terms(mpg ~ cyl*gear + carb, keep.order = T), mpg)
## Call:
##    aov(formula = terms(mpg ~ cyl * gear + carb, keep.order = T), 
##     data = mpg)
## 
## Terms:
##                      cyl     gear cyl:gear     carb Residuals
## Sum of Squares  824.7846   8.2519  23.8907  90.0170  179.1030
## Deg. of Freedom        2        2        3        4        20
## 
## Residual standard error: 2.992516
## 2 out of 14 effects not estimable
## Estimated effects may be unbalanced

★ 默认情况下,交互项会排在所有主效应变量之后;而使用terms()函数可以更改设置。位置不同会导致离差平方和发生变化。

两种情况下cyl:gear的离差平方和:
## 默认情况
fit11 <- lm(mpg ~ cyl + gear + carb, mpg) 
fit22 <- lm(mpg ~ cyl * gear + carb, mpg) 
sum((fitted(fit22) - mean(mpg$mpg))^2) - sum((fitted(fit11) - mean(mpg$mpg))^2) 
## [1] 25.38784
 
## 更改后的情况
fit33 <- lm(mpg ~ cyl + gear, mpg) 
fit44 <- lm(mpg ~ cyl * gear, mpg) 
sum((fitted(fit44) - mean(mpg$mpg))^2) - sum((fitted(fit33) - mean(mpg$mpg))^2) 
## [1] 23.89074

4. 多样本均数间的多重比较

4.1 \color{green}{TukeyHSD}函数

方差分析使用F统计量检验各组样本均值是否存在显著差异:当分组数在三组以上时,只要任意两组样本均值存在显著差异,F统计量就具有显著性。方差分析不会直接反映两两比较的结果

# TukeyHSD函数需要一个aov对象作为输入,并返回一个hsd对象
fit <- aov(mpg ~ cyl + gear + carb, mpg)
summary(fit)
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## cyl          2  824.8   412.4  44.367 1.9e-08 ***
## gear         2    8.3     4.1   0.444   0.647    
## carb         5   88.5    17.7   1.905   0.134    
## Residuals   22  204.5     9.3                    
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

levels(mpg$cyl)
## [1] "4" "6" "8"

TukeyHSD(fit, "cyl", ordered = TRUE)
## Tukey multiple comparisons of means
##     95% family-wise confidence level
##     factor levels have been ordered
## 
## Fit: aov(formula = mpg ~ cyl + gear + carb, data = mpg)
## 
## $cyl
##          diff      lwr       upr     p adj
## 6-8  4.642857 1.097559  8.188155 0.0089999
## 4-8 11.563636 8.477851 14.649422 0.0000000
## 4-6  6.920779 3.217836 10.623722 0.0003137

★ 根据cyl变量分组的样本均值在95%置信水平下存在显著性差异;
★ 使用TukeyHSD()函数输出cyl变量三个水平( "4" "6" "8")对应样本均值的两两比较结果:在95%置信水平下,"6"和"8"组、 "4"和"8"组及"4"和"6"组间存在显著性差异。
★ diff列表示两个样本均值之差,lwr列表示置信区间的下界,upr列表示置信区间的上界,p adj列就是p值

4.2 \color{green}{LSD-t}检验

# install.packages("PMCMRplus")
library(PMCMRplus)

fit <- aov(mpg ~ cyl + gear + carb, mpg)
res <- lsdTest(fit)

summary(res)
## 
##  Pairwise comparisons using Least Significant Difference Test
## 
## data: mpg by cyl by gear by carb
## alternative hypothesis: two.sided
## P value adjustment method: none
## H0
##            t value   Pr(>|t|)    
## 6 - 4 == 0  -4.441 0.00011947 ***
## 8 - 4 == 0  -8.905 8.5682e-10 ***
## 8 - 6 == 0  -3.112 0.00415221  **
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# 结果可视化
plot(res)

★ 与TukeyHSD结果一致:"6"和"8"组、 "4"和"8"组及"4"和"6"组间存在显著性差异。


LSD-t-1

4.3 \color{green}{Dunnett-t}检验

library(PMCMRplus)

res <- dunnettTest(fit)

summary(res)
## 
##  Pairwise comparisons using Dunnett's-test for multiple  
##                          comparisons with one control
## 
## data: mpg by cyl by gear by carb
## alternative hypothesis: two.sided
## P value adjustment method: single-step
## H0
##            t value   Pr(>|t|)    
## 6 - 4 == 0  -4.441 0.00023456 ***
## 8 - 4 == 0  -8.905 1.7071e-09 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## Connected to your session in progress, last started 2023-Apr-11 07:35:16 UTC (2 hours ago)

4.4 \color{green}{SNK-q}检验

library(PMCMRplus)

res <- snkTest(fit)

summary(res)
## 
##  Pairwise comparisons using SNK test
## 
## data: mpg by cyl by gear by carb
## alternative hypothesis: two.sided
## P value adjustment method: step down
## H0
##            q value   Pr(>|q|)    
## 6 - 4 == 0  -6.281 0.00011947 ***
## 8 - 4 == 0 -12.593  2.544e-09 ***
## 8 - 6 == 0  -4.401 0.00415221  **
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

参考:

  1. https://blog.csdn.net/qq_42830713/article/details/127544975
  2. https://blog.csdn.net/weixin_54000907/article/details/128029735
  3. https://blog.csdn.net/weixin_54000907/article/details/128030334

相关文章

网友评论

      本文标题:R语言统计5:方差分析(ANOVA,F- test)

      本文链接:https://www.haomeiwen.com/subject/nrxsddtx.html