设计矩阵与R

第一个例子：

我们想知道mutant小鼠和control小鼠在大小上是否有统计学意义上的差别。我们分别做了这两种小鼠的回归曲线

Type <- factor(c(
  rep("Control", times=4),
  rep("Mutant", times=4)))

Weight <- c(2.4, 3.5, 4.4, 4.9, 1.7, 2.8, 3.2, 3.9)
Size <- c(1.9, 3, 2.9, 3.7, 2.8, 3.3, 3.9, 4.8)
#矩阵的第1列乘以control的截距
#第2列代表的数据乘以mutant与control的差值
#第3列乘以slope
> model.matrix(~Type+Weight)
  (Intercept) TypeMutant Weight
1           1          0    2.4
2           1          0    3.5
3           1          0    4.4
4           1          0    4.9
5           1          1    1.7
6           1          1    2.8
7           1          1    3.2
8           1          1    3.9
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$Type
[1] "contr.treatment"

image-20210109122051890.png

model <- lm(Size~Type + Weight)
> summary(model)

Call:
lm(formula = Size ~ Type + Weight)

#与拟合曲线的残差
Residuals:
       1        2        3        4        5        6        7        8 
 0.05455  0.34562 -0.41623  0.01607 -0.01753 -0.32646 -0.02062  0.36461 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.08052    0.52744   0.153  0.88463   
TypeMutant   1.48685    0.26023   5.714  0.00230 **
Weight       0.73539    0.13194   5.574  0.00256 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3275 on 5 degrees of freedom
Multiple R-squared:  0.8975,    Adjusted R-squared:  0.8564 
#F值与p值，这个p值是fancy model 与 simple model对比后的p值，反应了
F-statistic: 21.88 on 2 and 5 DF,  p-value: 0.003367

simple model ，只考虑大小

image-20210109122436799.png

simple model ，现在我们即考虑大小又考虑种类，但是忽略老鼠的体重

image-20210109122742676.png

simple model ，现在我们即考虑大小有考虑体重，但是忽略老鼠的种类

image-20210109122836940.png

结果中Weight这个变量的p值为0.00256，它表示的是，如果在这个复杂模型（fancy model）中剔除Weight这个变量，构建成的简单模型（simpler model）与原始数据的拟合效果。它反映的是一个复杂模型的统计效果和只用常规t检验的统计效果的比较（p值小于0.05，说明前者的统计效果更好）。换句话讲，它反映了复杂模型中剔除了Weight这个变量后，使用最小二乘法进行的检验与原复杂模型检验的比较，由于p值很小，这就说明了Weight这个变量的重要性。

同理，TypeMutant这个变量对应的p值为0.00230，它反映的是原复杂方程（fancy model）与该方程剔除小鼠种类这个变量后的比较（剔除了小鼠类型的这个变量，就变成了普通线性回归方程）。由于p值很小，这就说明了种类这个变量的重要性。

第二个例子：批次效应

Lab A为一个实验，Lab B重复它，但是测量结果全部变小了。我们希望结合这两个数据集，看看mutant是否与control不同，但我们需要进行消除“批次效应”

image-20210109124332045.png

Lab <- factor(c(rep("A",times=6),rep("B",times=6)))
Type <- factor(c(rep("Control", times = 3),rep("Mutant", times = 3),rep("Control", times = 3),rep("Mutant", times = 3)))
Expression <- c(1.7, 2, 2.2,3.1, 3.6, 3.9, 0.9, 1.2, 1.9, 1.8, 2.2, 2.9)
#第1列乘以labA control mean
#第2列乘以labB offset 
#第3列乘以difference
> model.matrix(~Lab+Type)
   (Intercept) LabB TypeMutant
1            1    0          0
2            1    0          0
3            1    0          0
4            1    0          1
5            1    0          1
6            1    0          1
7            1    1          0
8            1    1          0
9            1    1          0
10           1    1          1
11           1    1          1
12           1    1          1
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$Lab
[1] "contr.treatment"

attr(,"contrasts")$Type
[1] "contr.treatment"

batch.lm <- lm(Expression ~ Lab + Type)
> summary(batch.lm)

Call:
lm(formula = Expression ~ Lab + Type)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.6500 -0.2833 -0.0500  0.2750  0.7167 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.1167     0.2279   9.287  6.6e-06 ***
LabB         -0.9333     0.2632  -3.546 0.006250 ** 
TypeMutant    1.2667     0.2632   4.813 0.000956 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4558 on 9 degrees of freedom
Multiple R-squared:  0.7989,    Adjusted R-squared:  0.7542 
F-statistic: 17.87 on 2 and 9 DF,  p-value: 0.0007342

这个结果跟先前多元回归的结果类似，最后一部分就是F以及调整后的R平方，它的p值为0.0007342，但是这个p值不是我们所需要的，这个p值代表的是复杂方程与一个简单方程（也就是下图中y = overall mean）的统计学差异，这个p值是不能用于判断mutant和control差异的。

image-20210109124828965.png

我们再看一下TypeMutant这个变量对应的p值，这个p值是0.000956，这个才是我们所需要的p值，它表示的是，我们去掉difference(mutant - control)这个变量后，我们是否还能拟合出一条好的曲线（我个人理解就是：原假设就是去掉`TypeMutant后，剩下的数据能够拟合出一条好的曲线方程，但是结果却是p值小于0.05，是一个非常小的值，因此原假设不成立，也就是说这个变量不能去掉）

image-20210109124911693.png