模型调参工具GridSearchCV

作者: taon | 来源:发表于2019-07-24 22:27 被阅读0次

模型调参工具GridSearchCV
scikit_learn学习笔记十二——GridSearch，网
sklearn-GridSearchCV,CV调节超参使用方法
机器学习系列之 GridSearchCV网格搜索
算法调参 - 交叉验证
XGBoost机器学习——网格搜索交叉验证 in py3
模型调参
模型调参
深度模型训练方法(二)
XGboost 基线模型及部分参数优化

今天我们来讨论机器学习中一个非常重要的函数GridSearchCV，它是我们用来求解最佳参数组合的常用函数。例如：我们的随机森林算法有很多参数，如n_estimators,max_depth,min_samples_split等等，我们希望对比不同参数组合下的模型效果，来选出该模型的最佳参数。GridSearchCV可以自动对我们给定的参数进行组合，并得出每个组合的模型效果，通过比较其效果来选出最佳参数组合。

Mechine Learning.jpg

GridSearchCV API文档

sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, n_jobs=None, iid=’warn’, refit=True, cv=’warn’, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise-deprecating’, return_train_score=False)
estimator：我们将使用的算法模型，如KNN，SVM等等，可传入除需要确定的最佳参数之外的其他参数。
param_grid：需要优化的参数的取值，其值为字典或者列表类型。
scoring：模型的评价标准，默认是使用estimator的误差估计函数。当然我们也可以指定scoring = 'roc_auc'或者'recall'或者'F1_score'等等，具体指定哪种评估标准，需要根据我们的模型estimator来选择。
n_jobs：CPU并行数，默认为1，-1为使用所有的CPU。
cv:交叉验证参数，默认为3，可以自己指定。

GridSearchCV 样例演示

from sklearn import datasets
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

#导入iris数据集
iris = datasets.load_iris()
X = iris.data
y = iris.target

#导入要使用模型
RF = RandomForestClassifier()
#创建模型参数组合
params = {'n_estimators':[1,10,50,100],'max_depth':[1,2,3,4]}
#将模型参数组合传入到模型中
grid = GridSearchCV(RF,param_grid = params,cv=5)
#训练样本数据集
grid.fit(X,y)

#得出最佳的模型
grid.best_setimator_
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=3, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
#得出最佳的参数组合
grid.best_params_
{'max_depth': 3, 'n_estimators': 10}
#得出最佳的模型效果
grid.best_score_
0.9733333333333334

总结：在我们对机器学习算法进行模型调参时，通常我们需要调整多个参数。人工调整时，我们需通过交叉验证，一个一个的进行调整。当将所有的参数调整到最优时，其整体效果可能达不到我们的要求，整个过程费时费力。
上面的样例中，只是简单的传入了两个参数n_estimators和max_depth，并且分别传入几个参数，params = {'n_estimators':[1,10,50,100],'max_depth':[1,2,3,4]}。gridsearchcv自动地对这些参数进行组合，并对每个组合进行运算，最终通过比较得出最优的模型参数，整个过程的效率非常高。