PH525x series - Linear models in

作者: 3between7 | 来源:发表于2019-11-28 10:15 被阅读0次

PH525x series - Exercises - Line
线性回归模型
PH525x series - Linear models in
PH525x series - Linear Algebra E
BA: Predictive Data
机器学习-正则化线性模型实例
DLNM时间序列数据分析（2017年1月版）“Example 2
Kaggle time series tutorial solu
15 Mar 2016 - Tasks Memo
机器学习基石第十一节

在小鼠体重的例子中，使用线性模型与使用t检验得到的结果是相同的，为了说明，我们先来回顾下不同饲料的小鼠体重数据的分布情况：

dat <- read.csv("femaleMiceWeights.csv") ##previously downloaded
stripchart(dat$Bodyweight ~ dat$Diet, vertical=TRUE, method="jitter",
           main="Bodyweight over Diet")

mice.png

虽然两组数据有重叠，但是饲喂高脂饲料的小鼠体重均值是高于对照组的。

lm函数中的数学计算过程

lm()函数为了建模，首先会生成一个设计矩阵 $\mathbf{X}$ ，然后去计算可以最小化RSS的 $\hat β$ ， $\hat β$ 的计算公式如下：
$\hat\beta = (X^TX)^{-1}X^TY$

在R中使用矩阵相乘符号%*%、逆函数solve()和转置函数t()就可以计算出 $\hat β$ ：

Y <- dat$Bodyweight
X <- model.matrix(~ Diet, data=dat)
solve(t(X) %*% X) %*% t(X) %*% Y

##                  [,1]
## (Intercept) 23.813333
## Diethf       3.020833

其中，系数Intercept、Diethf的实际意义分别是对照组的平均值、两组体重均值之差：

s <- split(dat$Bodyweight, dat$Diet)
mean(s[["chow"]])
## [1] 23.81333
mean(s[["hf"]]) - mean(s[["chow"]])
## [1] 3.020833

intercept.png

然后我们使用lm()再试一下：

fit <- lm(Bodyweight ~ Diet ,data=dat)
summary(fit)

## 
## Call:
## lm(formula = Bodyweight ~ Diet, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1042 -2.4358 -0.4138  2.8335  7.1858 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   23.813      1.039  22.912   <2e-16 ***
## Diethf         3.021      1.470   2.055   0.0519 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.6 on 22 degrees of freedom
## Multiple R-squared:  0.1611, Adjusted R-squared:  0.1229 
## F-statistic: 4.224 on 1 and 22 DF,  p-value: 0.05192

(coefs <- coef(fit))

## (Intercept)      Diethf 
##   23.813333    3.020833

线性模型 vs t检验

如果假设群体方差一致，那么使用t检验得到的结果将和前文fit中的结果一致，因为在这个线性模型中假设了误差ε的分布相同：

summary(fit)$coefficients
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 23.813333   1.039353 22.911684 7.642256e-17
## Diethf       3.020833   1.469867  2.055174 5.192480e-02

ttest <- t.test(s[["hf"]], s[["chow"]], var.equal=TRUE)
summary(fit)$coefficients[2,3]
## [1] 2.055174
ttest$statistic
##        t 
## 2.055174

从上述的结果中可以看到，t检验得到的t值和p值与使用线性模型得到的的确都是相等的。