交叉验证不会返回一个模型,在调用 cross_val_score 时,内部会构建多个模型,但交叉验证的目的只是评估给定算法在特定数据集上训练后的泛化性能好坏。
1、scikit-learn 中的交叉验证
cross_val_score 函数的参数:
- 想要评估的模型
- 训练数据
- 真实标签
2、分层 k 折交叉验证和其他策略
利用 cv 参数来调节 cross_val_score 所使用的折数,可以提供一个交叉验证分离器作为 cv 参数,来对数据划分过程进行更精细的控制。
- 分层 k 折:KFold
- 留一法:LeaveOneOut
- 打乱划分:ShuffleSplit/StratifiedShuffleSplit
- 分组交叉:GroupKFold
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import GroupKFold
from sklearn.datasets import make_blobs
import mglearn
import matplotlib.pyplot as plt
iris=load_iris()
print('Iris labels:\n{}'.format(iris.target))
logreg=LogisticRegression()
scores=cross_val_score(logreg,iris.data,iris.target,cv=5)
print('Cross-validation scores:{}'.format(scores))
print('Average cross-validation score:{:.2f}'.format(scores.mean()))
kfold=KFold(n_splits=3,shuffle=True,random_state=0)
print('Cross-validation scores:\n{}'.format(cross_val_score(logreg,iris.data,iris.target,cv=kfold)))
# 留一法交叉验证
loo=LeaveOneOut()
print(len(iris.data))
scores=cross_val_score(logreg,iris.data,iris.target,cv=loo)
print('Number of cv iterations:',len(scores))
print('Mean accuracy:{:.2f}'.format(scores.mean()))
# 打乱划分交叉验证
shuffle_split=ShuffleSplit(test_size=.5,train_size=.5,n_splits=10)
scores=cross_val_score(logreg,iris.data,iris.target,cv=shuffle_split)
print('Cross-validation scores:\n{}'.format(scores))
# 分组交叉验证
X,y=make_blobs(n_samples=12,random_state=0)
groups=[0,0,0,1,1,1,1,2,2,3,3,3,]
scores=cross_val_score(logreg,X,y,groups,cv=GroupKFold(n_splits=3))
网友评论