1. 交叉验证基本介绍

通常在建立模型后需要使用外部进行验证，以评估模型的外部可用性。然而，获取外部数据并不容易，这时交叉验证（Cross Validation）则是一种较好的可替代方案。交叉验证的方法很多，这里我们介绍最常用的k折交叉验证。

简单解释一下：
假如原始数据为100例患者，建模后我们使用K折交叉验证（K一般取3-10折）。
若取K=10，则：将原始数据分为10份（K值）：
9份做训练，1份做验证，这样循环10次（因为这样的组合有10种）。
取10次评价指标的平均值（如AUC值）作为模型的验证结果。即【数据分为K个子集，1个子集做验证，K-1个子集做训练，这样的取法有K种，所以会产生K个新数据集（训练集+训练集）。我们进行K次训练和验证得到的平均评价指标可能较为准确
为了保证数据分割的影响，目前的k折交叉验证一般会进行多次重复（200-1000次）。即进行200次的10折交叉验证。这样做出的结果可能会更加准确。
如文献所示

2.交叉验证的原理图

3.R Scripts

3.1数据示例

library("caret")
library(pROC)
bc<-read.csv("E:/r/test/buyunzheng.csv",sep=',',header=TRUE)

数据有8个指标，最后两个是PSM匹配结果，我们不用理他，其余六个为：
Education：教育程度，age:年龄，parity产次，induced：人流次数，case：是否不孕，这是结局指标，spontaneous：自然流产次数。有一些变量是分类变量，我们需要把它转换一下

bc$education<-ifelse(bc$education=="0-5yrs",0,ifelse(bc$education=="6-11yrs",1,2))
bc$spontaneous<-as.factor(bc$spontaneous)
bc$case<-as.factor(bc$case)
bc$induced<-as.factor(bc$induced)
bc$education<-as.factor(bc$education)

使用caret包的createFolds函数根据结果变量把数据分成10份，因为是随机分配，为了可重复性，我们先设立一个种子

set.seed(666)
folds <- createFolds(y=bc$case,k=10)###分成10份

接下来我们可以根据每一份的数据建立方程，并求出AUC，95%CI和截点，并且画出ROC曲线。
我们先来做第一个数据的，要提取列表的数据，需要做成[[1]]这种形式

fold_test <- bc[folds[[1]],]#取fold 1数据，建立测试集和验证集
fold_train <- bc[-folds[[1]],]#

建立预测模型

fold_pre <- glm(case ~ age  + parity +spontaneous,
                family = binomial(link = logit), data =fold_train )###建立模型
fold_predict <- predict(fold_pre,type='response',newdata=fold_test)##生成预测值
roc1<-roc((fold_test[,5]),fold_predict)
round(auc(roc1),3)##AUC
round(ci(roc1),3)##95%CI

绘制ROC曲线

plot(roc1, print.auc=T, auc.polygon=T, grid=c(0.1, 0.2),
     grid.col=c("green", "red"), max.auc.polygon=T,
     auc.polygon.col="skyblue", 
     print.thres=T)

嫌一个一个做比较麻烦的话我们也可以做成循环，一次跑完结果
先建立一个auc的空值，不然跑不了

auc_value<-as.numeric()###建立空值
for(i in 1:10){
  fold_test <- bc[folds[[i]],] #取folds[[i]]作为测试集
  fold_train <- bc[-folds[[i]],] # 剩下的数据作为训练集
  fold_pre <- glm(case ~ age  + parity +spontaneous,
                  family = binomial(link = logit), data =fold_train )
  fold_predict <- predict(fold_pre,type='response',newdata=fold_test)
  auc_value<- append(auc_value,as.numeric(auc(as.numeric(fold_test[,5]),fold_predict)))
}
#找到最大AUC的索引，用该索引对应的数据作为测试集，其它数据作为训练集即可训练出AUC最高的模型
max.index = which(auc_value==max(auc_value),arr.ind=TRUE)
fold_test <- bc[folds[[max.index]],] 
fold_train <- bc[-folds[[max.index]],]