1. 基于scikit-learn的RandomForestRegressor构建一个随机森林回归模型
sklearn.ensemble.RandomForestRegressor — scikit-learn 1.0.2 documentation
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=1000, n_features=10,
random_state=0, shuffle=False)
model = RandomForestRegressor(random_state=0)
model.fit(X, y)
print(model.predict([range(0,10)]))
- 查看模型参数
>>> model.get_params()
{'bootstrap': True,
'ccp_alpha': 0.0,
'criterion': 'squared_error',
'max_depth': None,
'max_features': 'auto',
'max_leaf_nodes': None,
'max_samples': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 100,
'n_jobs': None,
'oob_score': False,
'random_state': 0,
'verbose': 0,
'warm_start': False}
2. scikit-learn中的GridSearchCV()网格搜索
- 2.1 GridSearchCV()参数
- 对模型的指定参数范围进行穷举搜索
class sklearn.model_selection.GridSearchCV(estimator, param_grid, *, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)
参数:
estimator: 模型,一般是scikit-learn模型;自定义的模型需要按要求编写方法
param_grid: 参数空间
scoring:交叉验证时,在测试集上的打分标准
n_jobs:多线程,-1代表所有处理器
refit:调参完成以后,是否使用最优参数组合在整个数据集训练模型
cv: cross-validation 的数据拆分份数
verbose:信息打印设置,>1:显示每个折叠的计算时间和候选参数;>2:也显示分数;>3:同时显示fold和候选参数索引,以及计算的开始时间。
pre_dispatch: 多线程时,设置作业数量;default=’2*n_jobs’
error_score: 如果在估计器拟合中发生错误,返回的数值
return_train_score:是否返回训练集的评价指标得分
输出:
cv_results_:各种参数组合的信息以及得分
best_estimator_:表现最好的模型; refit=True
best_score_:best_estimator_的得分
best_params_: best_estimator_的参数组合
......
- 2.2 GridSearchCV()实例
X, y = make_regression(n_samples=200, n_features=10,
random_state=0, shuffle=False)
model = RandomForestRegressor(random_state=0)
param_grid = {'criterion':['squared_error', 'absolute_error', 'poisson'],
'n_estimators':range(10,12),
'max_depth':[5,10,15],
'min_samples_leaf':[2,3,5]}
model_grid = GridSearchCV(model,param_grid=param_grid,cv=6)
model_grid.fit(X, y)
print(model_grid.best_estimator_, model_grid.best_params_,model_grid.best_score_, sep="\n")
RandomForestRegressor(max_depth=15, min_samples_leaf=2, random_state=0)
{'min_samples_leaf': 2, 'max_depth': 15, 'criterion': 'squared_error'}
0.715165941558444
3. scikit-learn中的RandomizedSearchCV()随机搜索
RandomizedSearchCV()的使用与GridSearchCV()差不多,需要注意两个参数:param_distributions和n_iter。
- 参数空间的参数名不一样;
- 因为是随机搜索, 需要指定在模型上迭代的参数组合个数。
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=200, n_features=10,
random_state=0, shuffle=False)
model = RandomForestRegressor(random_state=0)
param_grid = {'criterion':['squared_error', 'absolute_error', 'poisson'],
'n_estimators':range(10,12),
'max_depth':[5,10,15],
'min_samples_leaf':[2,3,5]}
model_grid = RandomizedSearchCV(model,param_distributions=param_grid,cv=6, n_iter=10)
model_grid.fit(X, y)
print(model_grid.best_estimator_, model_grid.best_params_,model_grid.best_score_, sep="\n")
RandomForestRegressor(max_depth=5, min_samples_leaf=2, n_estimators=11,
random_state=0)
{'n_estimators': 11, 'min_samples_leaf': 2, 'max_depth': 5, 'criterion': 'squared_error'}
0.6628360955299456
4. 贝叶斯调参
-
BayesianOptimization: Pure Python implementation of bayesian global optimization with gaussian processes.
BayesianOptimization主要是基于贝叶斯推理和高斯过程,试图在尽可能少的迭代中找到一个函数的最大值。
#Install
pip install bayesian-optimization
#Import
from bayes_opt import BayesianOptimization
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=200, n_features=10,
random_state=0, shuffle=False)
#定义模型
def model(n_estimators, max_depth,min_samples_leaf):
model= RandomForestRegressor(n_estimators=int(n_estimators),
max_depth=int(max_depth),
min_samples_leaf=int(min_samples_leaf),
random_state=10
)
model.fit(X,y)
return model.score(X,y)
optimizer = BayesianOptimization(
model,
{'n_estimators': (10, 31),
'max_depth': (5, 13),
'min_samples_leaf':(5,31)}
)
optimizer.maximize(
init_points=5, #贝叶斯初点
n_iter=10, #迭代次数
)
#最佳的参数组合
print(optimizer.max)
{'target': 0.885985055893347, 'params': {'max_depth': 13.0, 'min_samples_leaf': 5.0, 'n_estimators': 31.0}}
for i, res in enumerate(optimizer.max):
print("Iteration {}: \n\t{}".format(i, res))
5. hyperopt
- Hyperopt: Distributed Hyperparameter Optimization
hyperopt/hyperopt: Distributed Asynchronous Hyperparameter Optimization in Python (github.com)
Hyperopt主要依赖于FMin()函数寻找模型的最小值,而不是最大值。如果模型的输出是越大越好的话,可以用1-output score或者取负(-output score)
pip install hyperopt
import numpy as np
from sklearn.datasets import make_regression
from hyperopt import fmin, tpe, hp,Trials, space_eval
from sklearn import metrics
np.random.seed(1)
X, y = make_regression(n_samples=2000, n_features=10,
random_state=0, shuffle=False)
def hyperparameter_tuning(params):
clf = RandomForestRegressor(**params,n_jobs=-1, random_state=0)
clf.fit(X[1:1500], y[1:1500])
mse = metrics.mean_squared_error(clf.predict(X[1500:2000]), y[1500:2000])
return mse
# 初始化Trial 对象
trials = Trials()
#参数空间
space = {
"n_estimators": hp.choice("n_estimators", range(5,15,5)),
"criterion": hp.choice("criterion", ["squared_error", "absolute_error"]),
"max_depth": hp.quniform("max_depth", 10, 12,1)
}
best = fmin(
fn=hyperparameter_tuning,
space = space,
algo=tpe.suggest, #Tree of Parzen Estimators(TPE),Adaptive TPE
max_evals=10,
trials=trials
)
print("Best: {}".format(best))
print(space_eval(space, best))
trials.trials
trials.results
trials.losses()
trials.statuses()
-
使用trials对象,可以获取fmin测试过程的详细数据
-
搜索算法包括hyperopt.tpe.suggest和hyperopt.random.suggest
-
hyperopt的参数空间设置需要调用一些固定的表达式
- hp.choice(label, options): 选项的列表或者元组
- hp.randint(label, upper):range [0, upper)整数
- hp.uniform(label, low, high):范围之类的均匀分布
- hp.quniform(label, low, high, q):round(uniform(low, high) / q) * q
- hp.loguniform(label, low, high):exp(uniform(low, high))
- hp.qloguniform(label, low, high, q):round(exp(uniform(low, high)) / q) * q
- hp.normal(label, mu, sigma): 正态分布
- hp.qnormal(label, mu, sigma, q):round(normal(mu, sigma) / q) * q
- hp.lognormal(label, mu, sigma):exp(normal(mu, sigma))
- hp.qlognormal(label, mu, sigma, q): round(exp(normal(mu, sigma)) / q) * q
参考
sklearn.ensemble
Hyperopt: Distributed Hyperparameter Optimization)
Bayesian Optimization
网友评论