> library(pacman)
> p_load(dplyr, readr, caret)
以上一节中未去除离群值的MSE为3619.029,修正R2为0.8603和去除离群值后的MSE为2690.545,修正R2为0.8706为基准,以及两个模型在测试集上的MSE分别为2914.014和1672.859,对模型进行改进。
> results <- tribble(~ model, ~ mse, ~ r_square, ~ test_mse,
+ "original", 3619.029, 0.8603, 2914.014,
+ "remove_out", 2690.545, 0.8706, 1672.859)
> results
## # A tibble: 2 x 4
## model mse r_square test_mse
## <chr> <dbl> <dbl> <dbl>
## 1 original 3619. 0.860 2914.
## 2 remove_out 2691. 0.871 1673.
1、数据预处理
> machine <- read_csv("data_set/machine.data", col_names = F)
> names(machine) <- c("vendor", "model", "myct", "mmin", "mmax",
+ "cach", "chmin", "chmax", "prp", "erp")
> machine <- machine[, 3:9]
>
> set.seed(123)
> ind <- createDataPartition(machine$prp, p = 0.85, list = F)
>
> dtrain <- machine[ind, ]
> dtest <- machine[-ind, ]
2、缩减特征集
> ct <- trainControl(preProcOptions = list(cutoff = 0.75))
> set.seed(123)
> fit.step <- train(prp ~ ., data = dtrain, method = "lmStepAIC",
+ trControl = ct, preProcess = c("corr"), trace = F)
>
> summary(fit.step$finalModel)
##
## Call:
## lm(formula = .outcome ~ myct + mmin + mmax + cach + chmax, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -163.94 -29.68 3.25 28.52 355.05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.024e+01 8.909e+00 -6.762 2.01e-10 ***
## myct 5.550e-02 1.998e-02 2.777 0.006084 **
## mmin 1.476e-02 2.006e-03 7.358 7.20e-12 ***
## mmax 5.725e-03 6.919e-04 8.275 3.33e-14 ***
## cach 5.693e-01 1.443e-01 3.944 0.000116 ***
## chmax 1.683e+00 2.301e-01 7.313 9.33e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61.33 on 173 degrees of freedom
## Multiple R-squared: 0.8644, Adjusted R-squared: 0.8605
## F-statistic: 220.6 on 5 and 173 DF, p-value: < 2.2e-16
> compute_mse <- function(prediction, actual) {
+ mean((prediction - actual) ^ 2)
+ }
>
> compute_mse(fit.step$finalModel$fitted.values, dtrain$prp)
## [1] 3634.847
> compute_mse(predict(fit.step, newdata = dtest), dtest$prp)
## [1] 2785.94
使用逐步回归模型的结果为:
> results <- bind_rows(results,
+ tibble(model = "step",
+ mse = 3634.847,
+ r_square = 0.8605,
+ test_mse = 2785.94))
> results
## # A tibble: 3 x 4
## model mse r_square test_mse
## <chr> <dbl> <dbl> <dbl>
## 1 original 3619. 0.860 2914.
## 2 remove_out 2691. 0.871 1673.
## 3 step 3635. 0.860 2786.
去掉离群值后,再次逐步回归:
> dtrain.new <- dtrain[!(rownames(dtrain)) %in% c(173), ]
> set.seed(123)
> fit.step.out <- train(prp ~ ., data = dtrain.new, method = "lmStepAIC",
+ trControl = ct, preProcess = c("corr"), trace = F)
>
> summary(fit.step.out$finalModel)
##
## Call:
## lm(formula = .outcome ~ myct + mmin + mmax + cach + chmax, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -168.560 -23.668 2.268 21.691 271.120
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.474e+01 7.930e+00 -5.643 6.78e-08 ***
## myct 4.193e-02 1.731e-02 2.422 0.0165 *
## mmin 1.697e-02 1.752e-03 9.690 < 2e-16 ***
## mmax 4.629e-03 6.125e-04 7.557 2.35e-12 ***
## cach 5.968e-01 1.244e-01 4.797 3.48e-06 ***
## chmax 1.168e+00 2.090e-01 5.588 8.84e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.85 on 172 degrees of freedom
## Multiple R-squared: 0.8702, Adjusted R-squared: 0.8664
## F-statistic: 230.6 on 5 and 172 DF, p-value: < 2.2e-16
> compute_mse(fit.step.out$finalModel$fitted.values, dtrain.new$prp)
## [1] 2698.78
> compute_mse(predict(fit.step.out, newdata = dtest), dtest$prp)
## [1] 1812.763
> results <- bind_rows(results,
+ tibble(model = "step_out",
+ mse = 2698.78,
+ r_square = 0.8664,
+ test_mse = 1812.763))
> results
## # A tibble: 4 x 4
## model mse r_square test_mse
## <chr> <dbl> <dbl> <dbl>
## 1 original 3619. 0.860 2914.
## 2 remove_out 2691. 0.871 1673.
## 3 step 3635. 0.860 2786.
## 4 step_out 2699. 0.866 1813.
删减特征后,模型在训练集上的mse都增大了,但在测试集上的mse却减小了。去除离群值后,mse都有所增大。
3、正则化
岭回归和lasso都值得尝试,当依赖于输入特征的某个子集的模型时往往用lasso表现更好;但当很多不同变量的系数具有较大分散度的模型则往往在岭回归下有更好的表现。
3.1 岭回归
但数据集维度很高时,尤其是和能获得的观测数据的数量相比很大时,线性回归往往会表现出非常高的方差。
岭回归是一种通过其约束条件引入偏误但能有效地减小模型方差的方法。
> set.seed(123)
> fit.ridge <- train(prp ~ ., data = dtrain, method = "ridge",
+ trControl = ct, preProcess = c("corr"))
>
> fit.ridge$bestTune
## lambda
## 3 0.1
> fit.ridge$results$Rsquared[3]
## [1] 0.8058767
> compute_mse(predict(fit.ridge, newdata = dtrain), dtrain$prp)
## [1] 3730.474
> compute_mse(predict(fit.ridge, newdata = dtest), dtest$prp)
## [1] 2958.191
> results <- bind_rows(results,
+ tibble(model = "ridge",
+ mse = 3730.474,
+ r_square = 0.8059,
+ test_mse = 2958.191))
> results
## # A tibble: 5 x 4
## model mse r_square test_mse
## <chr> <dbl> <dbl> <dbl>
## 1 original 3619. 0.860 2914.
## 2 remove_out 2691. 0.871 1673.
## 3 step 3635. 0.860 2786.
## 4 step_out 2699. 0.866 1813.
## 5 ridge 3730. 0.806 2958.
3.2 lasso回归
lasso是岭回归的一种替代正则化方法。它们之间的差别体现在惩罚项里,岭回归是将有效的将系数压缩到更小的值,而lasso最小化的是系数的绝对值之和,由于lasso会把某些系数完全收缩到0,所以它兼具了选择和收缩的功能,这个是岭回归是不具备的。
在模型中,当alpha参数取值为0时是岭回归,alpha取值为1时是lasso。
> set.seed(123)
> fit.lasso <- train(prp ~ ., data = dtrain, method = "lasso",
+ trControl = ct, preProcess = c("corr"))
>
> fit.lasso$bestTune
## fraction
## 3 0.9
> fit.lasso$results$Rsquared[3]
## [1] 0.7996164
> compute_mse(predict(fit.lasso, newdata = dtrain), dtrain$prp)
## [1] 3664.031
> compute_mse(predict(fit.lasso, newdata = dtest), dtest$prp)
## [1] 2628.372
最终选择的是alpha=0,即岭回归模型。
> results <- bind_rows(results,
+ tibble(model = "lasso",
+ mse = 3664.031,
+ r_square = 0.7996,
+ test_mse = 2628.372))
> results
## # A tibble: 6 x 4
## model mse r_square test_mse
## <chr> <dbl> <dbl> <dbl>
## 1 original 3619. 0.860 2914.
## 2 remove_out 2691. 0.871 1673.
## 3 step 3635. 0.860 2786.
## 4 step_out 2699. 0.866 1813.
## 5 ridge 3730. 0.806 2958.
## 6 lasso 3664. 0.800 2628.
综合对比,数据在去除离群值的线性回归模型上性能最优。
网友评论