python机器学习模型调参

作者: JeremyL | 来源:发表于2022-01-11 16:43 被阅读0次

python机器学习模型调参
机器学习：06. 调参的基本思想(乳腺癌数据)
Task4模型调参
贝叶斯调参
Task 4：建模调参
1.1深度学习概述
机器学习/深度学习中的自动调参数
机器学习：9. 模型调参 Model Tuning
算法调参 - 交叉验证
建模调参总结及实战项目（3）

1. 基于scikit-learn的RandomForestRegressor构建一个随机森林回归模型

sklearn.ensemble.RandomForestRegressor — scikit-learn 1.0.2 documentation

from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=1000, n_features=10,
                       random_state=0, shuffle=False)
model = RandomForestRegressor(random_state=0)
model.fit(X, y)

print(model.predict([range(0,10)]))

查看模型参数

>>> model.get_params() 

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 0,
 'verbose': 0,
 'warm_start': False}

2. scikit-learn中的GridSearchCV()网格搜索

2.1 GridSearchCV()参数
- 对模型的指定参数范围进行穷举搜索

class sklearn.model_selection.GridSearchCV(estimator, param_grid, *, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)

参数：
estimator: 模型，一般是scikit-learn模型；自定义的模型需要按要求编写方法
param_grid： 参数空间
scoring:交叉验证时，在测试集上的打分标准
n_jobs：多线程，-1代表所有处理器
refit：调参完成以后，是否使用最优参数组合在整个数据集训练模型
cv： cross-validation 的数据拆分份数
verbose：信息打印设置，>1:显示每个折叠的计算时间和候选参数;>2:也显示分数;>3:同时显示fold和候选参数索引，以及计算的开始时间。
pre_dispatch： 多线程时，设置作业数量；default=’2*n_jobs’
error_score： 如果在估计器拟合中发生错误，返回的数值
return_train_score：是否返回训练集的评价指标得分

输出：
cv_results_：各种参数组合的信息以及得分
best_estimator_：表现最好的模型； refit=True
best_score_：best_estimator_的得分
best_params_： best_estimator_的参数组合
......

2.2 GridSearchCV()实例

X, y = make_regression(n_samples=200, n_features=10,
                       random_state=0, shuffle=False)
model = RandomForestRegressor(random_state=0)
param_grid = {'criterion':['squared_error', 'absolute_error', 'poisson'],
              'n_estimators':range(10,12),
              'max_depth':[5,10,15],
              'min_samples_leaf':[2,3,5]}

model_grid = GridSearchCV(model,param_grid=param_grid,cv=6)
    
model_grid.fit(X, y)

print(model_grid.best_estimator_, model_grid.best_params_,model_grid.best_score_, sep="\n") 

RandomForestRegressor(max_depth=15, min_samples_leaf=2, random_state=0)
{'min_samples_leaf': 2, 'max_depth': 15, 'criterion': 'squared_error'}
0.715165941558444

3. scikit-learn中的RandomizedSearchCV()随机搜索

RandomizedSearchCV()的使用与GridSearchCV()差不多，需要注意两个参数：param_distributions和n_iter。

参数空间的参数名不一样；
因为是随机搜索, 需要指定在模型上迭代的参数组合个数。

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=200, n_features=10,
                       random_state=0, shuffle=False)
model = RandomForestRegressor(random_state=0)
param_grid = {'criterion':['squared_error', 'absolute_error', 'poisson'],
              'n_estimators':range(10,12),
              'max_depth':[5,10,15],
              'min_samples_leaf':[2,3,5]}

model_grid = RandomizedSearchCV(model,param_distributions=param_grid,cv=6, n_iter=10)

model_grid.fit(X, y)

print(model_grid.best_estimator_, model_grid.best_params_,model_grid.best_score_, sep="\n") 

RandomForestRegressor(max_depth=5, min_samples_leaf=2, n_estimators=11,
                      random_state=0)
{'n_estimators': 11, 'min_samples_leaf': 2, 'max_depth': 5, 'criterion': 'squared_error'}
0.6628360955299456

4. 贝叶斯调参

BayesianOptimization: Pure Python implementation of bayesian global optimization with gaussian processes.

github: fmfn/BayesianOptimization: A Python implementation of global optimization with gaussian processes. (github.com)

BayesianOptimization主要是基于贝叶斯推理和高斯过程，试图在尽可能少的迭代中找到一个函数的最大值。

#Install
pip install bayesian-optimization

#Import
from bayes_opt import BayesianOptimization
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=200, n_features=10,
                       random_state=0, shuffle=False)

#定义模型
def model(n_estimators, max_depth,min_samples_leaf):
    model= RandomForestRegressor(n_estimators=int(n_estimators),
        max_depth=int(max_depth),
        min_samples_leaf=int(min_samples_leaf), 
        random_state=10
    )
    model.fit(X,y)
    return model.score(X,y)

optimizer = BayesianOptimization(
    model,
    {'n_estimators': (10, 31),
     'max_depth': (5, 13),
     'min_samples_leaf':(5,31)}
)
 
optimizer.maximize(
    init_points=5, #贝叶斯初点
    n_iter=10, #迭代次数
    )

#最佳的参数组合
print(optimizer.max)

{'target': 0.885985055893347, 'params': {'max_depth': 13.0, 'min_samples_leaf': 5.0, 'n_estimators': 31.0}}

for i, res in enumerate(optimizer.max):
    print("Iteration {}: \n\t{}".format(i, res))

5. hyperopt

Hyperopt: Distributed Hyperparameter Optimization

hyperopt/hyperopt: Distributed Asynchronous Hyperparameter Optimization in Python (github.com)

Hyperopt主要依赖于FMin()函数寻找模型的最小值，而不是最大值。如果模型的输出是越大越好的话，可以用1-output score或者取负(-output score)

pip install hyperopt

import numpy as np
from sklearn.datasets import make_regression
from hyperopt import fmin, tpe, hp,Trials, space_eval
from sklearn import metrics

np.random.seed(1)
X, y = make_regression(n_samples=2000, n_features=10,
                       random_state=0, shuffle=False)

def hyperparameter_tuning(params):
    clf = RandomForestRegressor(**params,n_jobs=-1, random_state=0)
    clf.fit(X[1:1500], y[1:1500])
    mse = metrics.mean_squared_error(clf.predict(X[1500:2000]), y[1500:2000])
    return mse

# 初始化Trial 对象
trials = Trials()

#参数空间
space = {
    "n_estimators": hp.choice("n_estimators", range(5,15,5)), 
    "criterion": hp.choice("criterion", ["squared_error", "absolute_error"]),
    "max_depth": hp.quniform("max_depth", 10, 12,1)
}

best = fmin(
    fn=hyperparameter_tuning,
    space = space, 
    algo=tpe.suggest, #Tree of Parzen Estimators（TPE）,Adaptive TPE
    max_evals=10, 
    trials=trials
)

print("Best: {}".format(best))
print(space_eval(space, best))

trials.trials
trials.results
trials.losses()
trials.statuses()

使用trials对象，可以获取fmin测试过程的详细数据
搜索算法包括hyperopt.tpe.suggest和hyperopt.random.suggest
hyperopt的参数空间设置需要调用一些固定的表达式

FMin · hyperopt/hyperopt Wiki (github.com)
- hp.choice(label, options): 选项的列表或者元组
- hp.randint(label, upper)：range [0, upper)整数
- hp.uniform(label, low, high)：范围之类的均匀分布
- hp.quniform(label, low, high, q)：round(uniform(low, high) / q) * q
- hp.loguniform(label, low, high)：exp(uniform(low, high))
- hp.qloguniform(label, low, high, q)：round(exp(uniform(low, high)) / q) * q
- hp.normal(label, mu, sigma)：正态分布
- hp.qnormal(label, mu, sigma, q)：round(normal(mu, sigma) / q) * q
- hp.lognormal(label, mu, sigma)：exp(normal(mu, sigma))
- hp.qlognormal(label, mu, sigma, q)： round(exp(normal(mu, sigma)) / q) * q

参考

sklearn.ensemble
Hyperopt: Distributed Hyperparameter Optimization)
Bayesian Optimization

python机器学习模型调参
1. 基于scikit-learn的RandomForestRegressor构建一个随机森林回归模型 sklea...
机器学习：06. 调参的基本思想(乳腺癌数据)
1. 机器学习中调参的基本思想调参的目的就是为了提升模型的准确率。在机器学习中，我们用来衡量模型在未知数据上的准...
Task4模型调参
学习目标了解常用的机器学习模型，并掌握机器学习模型的建模与调参流程内容介绍线性回归模型：线性回归对于特征的要...
贝叶斯调参
机器学习中参数调优的目的是为了找到模型在测试集上表现最好的参数，目前常见的调参方法主要有四种： 1、手动调参； 2...
Task 4：建模调参
Datawhale 零基础入门数据挖掘-Task4 建模调参四、建模与调参 4.1 学习目标了解常用的机器学习...
1.1深度学习概述
机器学习 Machine Learning 机器学习是对研究问题进行模型假设，利用计算机从训练数据中学习得到模型参...
机器学习/深度学习中的自动调参数
目前的机器学习/深度学习框架日渐成熟，工具的发展为模型的自动调参提供了完善的功能，帮助我们可以快速、有效的达成优化...
机器学习：9. 模型调参 Model Tuning
@[toc] Manual Hyperparameter Tuning Start with a good bas...
算法调参 - 交叉验证
算法模型训练过程中，获取模型项目参数(比如λ、p)的最优值，这个过程叫做调参。 - 模型调参的方法： ...
建模调参总结及实战项目（3）
根据天池项目建模调参总结而出机器学习算法的基础知识详见： 1.线性回归模型将训练集分为train_X特征变量,t...