Caret包(即Classification And REgression Training的缩写),包含简化复杂的回归和分类问题模型训练过程的功能。
1、导入数据并分割数据
> library(pacman)
> p_load(caret,mlbench,dplyr)
> # 使用Sonar数据集
> data("Sonar")
>
> # 拆分为训练集和测试集
> # 默认情况下,该函数使用分层随机分割
> set.seed(123)
> ind <- createDataPartition(y = Sonar$Class,
+ # 训练集所占比例
+ p = 0.75,
+ list = F)
> train <- Sonar[ind,]
> test <- Sonar[-ind,]
> dim(train)
## [1] 157 61
> dim(test)
## [1] 51 61
2、模型构建与训练
通过调整PLS组件的数量对PLSDA(partial least squares discriminant analysis)模型进行调优:
> # 修改重采样方法
> # method控制重新取样的类型,默认为“boot”
> # 另一个方法是“ repeatedcv” ,用于指定重复的k折交叉验证(repeats参数控制重复的次数)
> # K由number参数控制,默认值为10
>
> # 为了选择不同的性能指标,Summaryfunction 参数用于传入一个函数,该函数接受观察值和预测值,并估计性能的某种度量
> # 包中已经包含了两个这样的函数: defaultSummary 和 twoClassSummary。
># 后者将计算特定于两类问题的度量,例如 ROC 曲线下的面积,灵敏度和特异度
> # 由于ROC曲线是基于预测的类别概率(不会自动计算)
> # Classprobs=TRUE 选项用于包含这些计算。
> ctrl <- trainControl(method = "repeatedcv",repeats = 3,
+ classProbs = T,
+ summaryFunction = twoClassSummary)
> set.seed(123)
> fit.pls <- train(Class ~ .,
+ data = train,
+ method = "pls",
+ preProc = c("center","scale"),
+ # train函数生成一系列候选参数值,tuneLength参数控制数量
+ tuneLength = 15,
+ trControl = ctrl,
+ # 使用ROC指标度量性能
+ metric = "ROC")
> fit.pls
## Partial Least Squares
##
## 157 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## Pre-processing: centered (60), scaled (60)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 141, 141, 142, 142, 141, 142, ...
## Resampling results across tuning parameters:
##
## ncomp ROC Sens Spec
## 1 0.8060764 0.7185185 0.6785714
## 2 0.8641617 0.7569444 0.7958333
## 3 0.8516121 0.7685185 0.7702381
## 4 0.8457672 0.7444444 0.7755952
## 5 0.8350777 0.7481481 0.7517857
## 6 0.8200149 0.7435185 0.7482143
## 7 0.8125992 0.7430556 0.7339286
## 8 0.8144180 0.7712963 0.7386905
## 9 0.8095238 0.7500000 0.7380952
## 10 0.8084160 0.7578704 0.7238095
## 11 0.8086310 0.7578704 0.7279762
## 12 0.8019924 0.7509259 0.7232143
## 13 0.7945271 0.7430556 0.7333333
## 14 0.7933780 0.7629630 0.7339286
## 15 0.7904679 0.7629630 0.7339286
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 2.
输出的是性能的平均重采样估计,最后2个PLS组件(ncomp = 2)被认为是最优的。如图所示:
> ggplot(fit.pls)
![](https://img.haomeiwen.com/i20267488/9c494e907dac019b.png)
3、预测
预测新样本使用predict函数。
> # 默认情况下是预测分类type=class
> pls.class <- predict(fit.pls, newdata = test)
> str(pls.class)
## Factor w/ 2 levels "M","R": 1 2 2 2 2 1 2 2 2 2 ...
> # 返回分类概率
> pls.prob <- predict(fit.pls, newdata = test, type = "prob")
> head(pls.prob)
## M R
## 2 0.5444070 0.4555930
## 6 0.4030138 0.5969862
## 12 0.4371093 0.5628907
## 15 0.4101486 0.5898514
## 18 0.3932332 0.6067668
## 28 0.5213515 0.4786485
4、性能计算
confusionMatrix函数用于计算模型混淆矩阵和相关统计信息。
> confusionMatrix(data = pls.class, test$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction M R
## M 19 7
## R 8 17
##
## Accuracy : 0.7059
## 95% CI : (0.5617, 0.8251)
## No Information Rate : 0.5294
## P-Value [Acc > NIR] : 0.007812
##
## Kappa : 0.4111
##
## Mcnemar's Test P-Value : 1.000000
##
## Sensitivity : 0.7037
## Specificity : 0.7083
## Pos Pred Value : 0.7308
## Neg Pred Value : 0.6800
## Prevalence : 0.5294
## Detection Rate : 0.3725
## Detection Prevalence : 0.5098
## Balanced Accuracy : 0.7060
##
## 'Positive' Class : M
##
可以看到该模型准确度为70.59%。
5、拟合另一个模型
在数据上拟合RDA正则化鉴别模型。
> grid.rda <- data.frame(gamma = (0:4)/4, lambda = 3/4)
> set.seed(123)
> fit.rda <- train(Class ~ ., data = train,
+ method = "rda",
+ tuneGrid = grid.rda,
+ trControl = ctrl,
+ metric = "ROC")
>
> fit.rda
## Regularized Discriminant Analysis
##
## 157 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 141, 141, 142, 142, 141, 142, ...
## Resampling results across tuning parameters:
##
## gamma ROC Sens Spec
## 0.00 0.8156250 0.7606481 0.7392857
## 0.25 0.8851852 0.8796296 0.7166667
## 0.50 0.8832011 0.8972222 0.6833333
## 0.75 0.8663608 0.8842593 0.6654762
## 1.00 0.7288360 0.6972222 0.6309524
##
## Tuning parameter 'lambda' was held
## constant at a value of 0.75
## ROC was used to select the optimal
## model using the largest value.
## The final values used for the model
## were gamma = 0.25 and lambda = 0.75.
最终选择的参数为:gamma = 0.25,lambda = 0.75。
> rda.class <- predict(fit.rda, newdata = test)
> confusionMatrix(rda.class, test$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction M R
## M 23 10
## R 4 14
##
## Accuracy : 0.7255
## 95% CI : (0.5826, 0.8411)
## No Information Rate : 0.5294
## P-Value [Acc > NIR] : 0.003347
##
## Kappa : 0.4413
##
## Mcnemar's Test P-Value : 0.181449
##
## Sensitivity : 0.8519
## Specificity : 0.5833
## Pos Pred Value : 0.6970
## Neg Pred Value : 0.7778
## Prevalence : 0.5294
## Detection Rate : 0.4510
## Detection Prevalence : 0.6471
## Balanced Accuracy : 0.7176
##
## 'Positive' Class : M
此模型的准确度为72.55%,略高于第一个模型。
6、两个模型的对比
resamples函数可以对结果进行收集、总结和对比。
> resamps <- resamples(list(pls = fit.pls, rda = fit.rda))
> summary(resamps)
## Call:
## summary.resamples(object = resamps)
##
## Models: pls, rda
## Number of resamples: 30
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## pls 0.6666667 0.7916667 0.8611111 0.8641617 0.9350198 1 0
## rda 0.6964286 0.8571429 0.8888889 0.8851852 0.9305556 1 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## pls 0.375 0.6666667 0.7777778 0.7569444 0.8750000 1 0
## rda 0.625 0.8750000 0.8888889 0.8796296 0.8888889 1 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## pls 0.4285714 0.7142857 0.8571429 0.7958333 0.8750000 1 0
## rda 0.4285714 0.6250000 0.7142857 0.7166667 0.8571429 1 0
将对比结果可视化:
> xyplot(resamps, what = "BlandAltman")
![](https://img.haomeiwen.com/i20267488/1808f28a3a8aef34.png)
结果看起来很相似。 由于每次重采样都有成对的结果,因此可以使用配对 t 检验来评估 ROC 曲线下的平均重采样面积是否存在差异。 可以使用 diff.resamples 函数来计算:
> diffs <- diff(resamps)
> summary(diffs)
## Call:
## summary.diff.resamples(object = diffs)
## p-value adjustment: bonferroni
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
##
## ROC
## pls rda
## pls -0.02102
## rda 0.1584
##
## Sens
## pls rda
## pls -0.1227
## rda 1.922e-05
##
## Spec
## pls rda
## pls 0.07917
## rda 0.02049
rda模型的ROC和Sens性能高于pls模型,但Spec性能低于pls模型。ROC曲线差异性P值=0.1584>0.05,所以不能拒绝原假设,即两个模型性能相同,没有差异。
网友评论