Xgboost

作者: ForgetThatNight | 来源:发表于2018-07-02 22:57 被阅读135次

    part1_data_discovery

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.linear_model import LogisticRegression
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import roc_auc_score as AUC
    from sklearn.metrics import mean_absolute_error
    from sklearn.decomposition import PCA
    from sklearn.preprocessing import LabelEncoder, LabelBinarizer
    from sklearn.cross_validation import cross_val_score
    
    from scipy import stats
    import seaborn as sns
    from copy import deepcopy
    
    %matplotlib inline
    
    # This may raise an exception in earlier versions of Jupyter
    %config InlineBackend.figure_format = 'retina'
    

    在这一部分,我们做一个简短的数据探索,看看我们有什么样的数据集,以及我们是否能找到其中的任何模式。

    train = pd.read_csv('train.csv')
    test = pd.read_csv('test.csv')
    

    先来瞅瞅数据长啥样

    train.shape
    

    输出

    (188318, 132)
    

    188k训练实例,132列 数据量还可以。

    print ('First 20 columns:', list(train.columns[:20]))
    
    print ('Last 20 columns:', list(train.columns[-20:]))
    

    输出

    First 20 columns: ['id', 'cat1', 'cat2', 'cat3', 'cat4', 'cat5', 'cat6', 'cat7', 'cat8', 'cat9', 'cat10', 'cat11', 'cat12', 'cat13', 'cat14', 'cat15', 'cat16', 'cat17', 'cat18', 'cat19']
    Last 20 columns: ['cat112', 'cat113', 'cat114', 'cat115', 'cat116', 'cont1', 'cont2', 'cont3', 'cont4', 'cont5', 'cont6', 'cont7', 'cont8', 'cont9', 'cont10', 'cont11', 'cont12', 'cont13', 'cont14', 'loss']
    

    我们看到,大概有116个种类属性(如它们的名字所示)和14个连续(数字)属性。 此外,还有ID和赔偿。总计为132列。

    train.describe()
    

    正如我们看到的,所有的连续的功能已被缩放到[0,1]区间,均值基本为0.5。其实数据已经被预处理了,我们拿到的是特征数据。

    查看缺失值

    绝大多数情况下,我们都需要对缺失值进行处理

    pd.isnull(train).values.any()
    

    输出

    False
    

    竟然木有缺失值,可以愉快的玩耍了

    Continuous vs caterogical features

    Another way to see the division to categorical and continuous features is to run pd.DataFrame.info method:

    train.info()
    

    输出

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 188318 entries, 0 to 188317
    Columns: 132 entries, id to loss
    dtypes: float64(15), int64(1), object(116)
    memory usage: 189.7+ MB
    

    In here, float64(15), int64(1) are our continuous features (the one with int64 is probably id) while object(116) are categorical features. We may confirm this:

    cat_features = list(train.select_dtypes(include=['object']).columns)
    print "Categorical: {} features".format(len(cat_features))
    

    Continuous: 14 features

    cont_features = [cont for cont in list(train.select_dtypes(
                     include=['float64', 'int64']).columns) if cont not in ['loss', 'id']]
    print "Continuous: {} features".format(len(cont_features))
    

    输出

    A column of int64: ['id']
    

    类别值中属性的个数

    cat_uniques = []
    for cat in cat_features:
        cat_uniques.append(len(train[cat].unique()))
        
    uniq_values_in_categories = pd.DataFrame.from_items([('cat_name', cat_features), ('unique_values', cat_uniques)])
    uniq_values_in_categories.head()
    
    fig, (ax1, ax2) = plt.subplots(1,2)
    fig.set_size_inches(16,5)
    ax1.hist(uniq_values_in_categories.unique_values, bins=50)
    ax1.set_title('Amount of categorical features with X distinct values')
    ax1.set_xlabel('Distinct values in a feature')
    ax1.set_ylabel('Features')
    ax1.annotate('A feature with 326 vals', xy=(322, 2), xytext=(200, 38), arrowprops=dict(facecolor='black'))
    
    ax2.set_xlim(2,30)
    ax2.set_title('Zooming in the [0,30] part of left histogram')
    ax2.set_xlabel('Distinct values in a feature')
    ax2.set_ylabel('Features')
    ax2.grid(True)
    ax2.hist(uniq_values_in_categories[uniq_values_in_categories.unique_values <= 30].unique_values, bins=30)
    ax2.annotate('Binary features', xy=(3, 71), xytext=(7, 71), arrowprops=dict(facecolor='black'))
    

    赔偿值

    plt.figure(figsize=(16,8))
    plt.plot(train['id'], train['loss'])
    plt.title('Loss values per id')
    plt.xlabel('id')
    plt.ylabel('loss')
    plt.legend()
    plt.show()
    

    损失值中有几个显著的峰值表示严重事故。这样的数据分布,使得这个功能非常扭曲导致的回归表现不佳。
    基本上,偏度度量了实值随机变量的均值分布的不对称性。让我们计算损失的偏度:

    stats.mstats.skew(train['loss']).data
    #输出 array(3.7949281496777445)
    

    数据确实是倾斜的
    对数据进行对数变换通常可以改善倾斜,可以使用 np.log

    stats.mstats.skew(np.log(train['loss'])).data
    

    输出

    array(0.0929738049841997)
    
    fig, (ax1, ax2) = plt.subplots(1,2)
    fig.set_size_inches(16,5)
    ax1.hist(train['loss'], bins=50)
    ax1.set_title('Train Loss target histogram')
    ax1.grid(True)
    ax2.hist(np.log(train['loss']), bins=50, color='g')
    ax2.set_title('Train Log Loss target histogram')
    ax2.grid(True)
    plt.show()
    

    连续值特征

    One thing we can do is to plot histogram of the numerical features and analyze their distributions:

    train[cont_features].hist(bins=50, figsize=(16,12))
    

    特征之间的相关性

    plt.subplots(figsize=(16,9))
    correlation_mat = train[cont_features].corr()
    sns.heatmap(correlation_mat, annot=True)
    
    我们看到几个特征之间有很高的相关性

    part2_xgboost

    import xgboost as xgb
    import pandas as pd
    import numpy as np
    import pickle
    import sys
    import matplotlib.pyplot as plt
    from sklearn.metrics import mean_absolute_error, make_scorer
    from sklearn.preprocessing import StandardScaler
    from sklearn.grid_search import GridSearchCV
    from scipy.sparse import csr_matrix, hstack
    from sklearn.cross_validation import KFold, train_test_split
    from xgboost import XGBRegressor
    
    import warnings
    warnings.filterwarnings('ignore')
    
    %matplotlib inline
    
    # This may raise an exception in earlier versions of Jupyter
    %config InlineBackend.figure_format = 'retina'
    

    这部分主要内容就是Xgboost啦

    数据预处理

    train = pd.read_csv('train.csv')
    

    做对数转换

    train['log_loss'] = np.log(train['loss'])
    

    数据分成连续和离散特征

    features = [x for x in train.columns if x not in ['id','loss', 'log_loss']]
    
    cat_features = [x for x in train.select_dtypes(
            include=['object']).columns if x not in ['id','loss', 'log_loss']]
    num_features = [x for x in train.select_dtypes(
            exclude=['object']).columns if x not in ['id','loss', 'log_loss']]
    
    print ("Categorical features:", len(cat_features))
    print ("Numerical features:", len(num_features))
    

    输出

    Categorical features: 116
    Numerical features: 14
    

    And use a label encoder for categorical features:

    ntrain = train.shape[0]
    
    train_x = train[features]
    train_y = train['log_loss']
    
    for c in range(len(cat_features)):
        train_x[cat_features[c]] = train_x[cat_features[c]].astype('category').cat.codes
        
    print ("Xtrain:", train_x.shape)
    print ("ytrain:", train_y.shape)
    

    输出

    Xtrain: (188318, 130)
    ytrain: (188318,)
    

    Simple XGBoost Model

    首先,我们训练一个基本的xgboost模型,然后进行参数调节通过交叉验证来观察结果的变换,使用平均绝对误差来衡量
    mean_absolute_error(np.exp(y), np.exp(yhat))。
    xgboost 自定义了一个数据矩阵类 DMatrix,会在训练开始时进行一遍预处理,从而提高之后每次迭代的效率

    def xg_eval_mae(yhat, dtrain):
        y = dtrain.get_label()
        return 'mae', mean_absolute_error(np.exp(y), np.exp(yhat))
    

    Model

    dtrain = xgb.DMatrix(train_x, train['log_loss'])
    

    Xgboost参数

    • 'booster':'gbtree',
    • 'objective': 'multi:softmax', 多分类的问题
    • 'num_class':10, 类别数,与 multisoftmax 并用
    • 'gamma':损失下降多少才进行分裂
    • 'max_depth':12, 构建树的深度,越大越容易过拟合
    • 'lambda':2, 控制模型复杂度的权重值的L2正则化项参数,参数越大,模型越不容易过拟合。
    • 'subsample':0.7, 随机采样训练样本
    • 'colsample_bytree':0.7, 生成树时进行的列采样
    • 'min_child_weight':3, 孩子节点中最小的样本权重和。如果一个叶子节点的样本权重和小于min_child_weight则拆分过程结束
    • 'silent':0 ,设置成1则没有运行信息输出,最好是设置为0.
    • 'eta': 0.007, 如同学习率
    • 'seed':1000,
    • 'nthread':7, cpu 线程数
    xgb_params = {
        'seed': 0,
        'eta': 0.1,
        'colsample_bytree': 0.5,
        'silent': 1,
        'subsample': 0.5,
        'objective': 'reg:linear',
        'max_depth': 5,
        'min_child_weight': 3
    }
    

    使用交叉验证 xgb.cv

    %%time
    
    
            
    
    bst_cv1 = xgb.cv(xgb_params, dtrain, num_boost_round=50, nfold=3, seed=0, 
                    feval=xg_eval_mae, maximize=False, early_stopping_rounds=10)
    
    print ('CV score:', bst_cv1.iloc[-1,:]['test-mae-mean'])
    

    输出

    CV score: 1218.92834467
    Wall time: 1min 6s
    

    我们得到了第一个基准结果:MAE=1218.9

    plt.figure()
    bst_cv1[['train-mae-mean', 'test-mae-mean']].plot()
    

    我们的第一个基础模型:

    • 没有发生过拟合
    • 只建立了50个树模型
    %%time
    #建立100个树模型
    bst_cv2 = xgb.cv(xgb_params, dtrain, num_boost_round=100, 
                    nfold=3, seed=0, feval=xg_eval_mae, maximize=False, 
                    early_stopping_rounds=10)
    
    print ('CV score:', bst_cv2.iloc[-1,:]['test-mae-mean'])
    

    输出

    CV score: 1171.13663733
    Wall time: 1min 57s
    
    fig, (ax1, ax2) = plt.subplots(1,2)
    fig.set_size_inches(16,4)
    
    ax1.set_title('100 rounds of training')
    ax1.set_xlabel('Rounds')
    ax1.set_ylabel('Loss')
    ax1.grid(True)
    ax1.plot(bst_cv2[['train-mae-mean', 'test-mae-mean']])
    ax1.legend(['Training Loss', 'Test Loss'])
    
    ax2.set_title('60 last rounds of training')
    ax2.set_xlabel('Rounds')
    ax2.set_ylabel('Loss')
    ax2.grid(True)
    ax2.plot(bst_cv2.iloc[40:][['train-mae-mean', 'test-mae-mean']])
    ax2.legend(['Training Loss', 'Test Loss'])
    

    有那么一丁丁过拟合,现在还没多大事
    我们得到了新的纪录 MAE = 1171.77 比第一次的要好 (1218.9). 接下来我们要改变其他参数了。

    XGBoost 参数调节

    • Step 1: 选择一组初始参数

    • Step 2: 改变 max_depthmin_child_weight.

    • Step 3: 调节 gamma 降低模型过拟合风险.

    • Step 4: 调节 subsamplecolsample_bytree 改变数据采样策略.

    • Step 5: 调节学习率 eta.

    class XGBoostRegressor(object):
        def __init__(self, **kwargs):
            self.params = kwargs
            if 'num_boost_round' in self.params:
                self.num_boost_round = self.params['num_boost_round']
            self.params.update({'silent': 1, 'objective': 'reg:linear', 'seed': 0})
            
        def fit(self, x_train, y_train):
            dtrain = xgb.DMatrix(x_train, y_train)
            self.bst = xgb.train(params=self.params, dtrain=dtrain, num_boost_round=self.num_boost_round,
                                 feval=xg_eval_mae, maximize=False)
            
        def predict(self, x_pred):
            dpred = xgb.DMatrix(x_pred)
            return self.bst.predict(dpred)
        
        def kfold(self, x_train, y_train, nfold=5):
            dtrain = xgb.DMatrix(x_train, y_train)
            cv_rounds = xgb.cv(params=self.params, dtrain=dtrain, num_boost_round=self.num_boost_round,
                               nfold=nfold, feval=xg_eval_mae, maximize=False, early_stopping_rounds=10)
            return cv_rounds.iloc[-1,:]
        
        def plot_feature_importances(self):
            feat_imp = pd.Series(self.bst.get_fscore()).sort_values(ascending=False)
            feat_imp.plot(title='Feature Importances')
            plt.ylabel('Feature Importance Score')
            
        def get_params(self, deep=True):
            return self.params
     
        def set_params(self, **params):
            self.params.update(params)
            return self
    
    def mae_score(y_true, y_pred):
        return mean_absolute_error(np.exp(y_true), np.exp(y_pred))
    
    mae_scorer = make_scorer(mae_score, greater_is_better=False)
    
    bst = XGBoostRegressor(eta=0.1, colsample_bytree=0.5, subsample=0.5, 
                           max_depth=5, min_child_weight=3, num_boost_round=50)
    
    bst.kfold(train_x, train_y, nfold=5)
    

    输出

    test-mae-mean     1219.014551
    test-mae-std         8.931061
    train-mae-mean    1210.682813
    train-mae-std        2.798608
    Name: 49, dtype: float64
    

    Step 1: 学习率与树个数

    Step 2: 树的深度与节点权重

    这些参数对xgboost性能影响最大,因此,他们应该调整第一。我们简要地概述它们:

    • max_depth: 树的最大深度。增加这个值会使模型更加复杂,也容易出现过拟合,深度3-10是合理的。
    • min_child_weight: 正则化参数. 如果树分区中的实例权重小于定义的总和,则停止树构建过程。
    xgb_param_grid = {'max_depth': list(range(4,9)), 'min_child_weight': list((1,3,6))}
    xgb_param_grid['max_depth']
    

    输出

    [4, 5, 6, 7, 8]
    
    %%time
     
    grid = GridSearchCV(XGBoostRegressor(eta=0.1, num_boost_round=50, colsample_bytree=0.5, subsample=0.5),
                    param_grid=xgb_param_grid, cv=5, scoring=mae_scorer)
    
    grid.fit(train_x, train_y.values)
    

    Wall time: 29min 48s

    grid.grid_scores_, grid.best_params_, grid.best_score_
    

    输出

    ([mean: -1243.19015, std: 6.70264, params: {'max_depth': 4, 'min_child_weight': 1},
      mean: -1243.30647, std: 6.82365, params: {'max_depth': 4, 'min_child_weight': 3},
      mean: -1243.50752, std: 6.60994, params: {'max_depth': 4, 'min_child_weight': 6},
      mean: -1219.60926, std: 7.09979, params: {'max_depth': 5, 'min_child_weight': 1},
      mean: -1218.72940, std: 6.82721, params: {'max_depth': 5, 'min_child_weight': 3},
      mean: -1219.25033, std: 6.89855, params: {'max_depth': 5, 'min_child_weight': 6},
      mean: -1204.68929, std: 6.28730, params: {'max_depth': 6, 'min_child_weight': 1},
      mean: -1203.44649, std: 7.19550, params: {'max_depth': 6, 'min_child_weight': 3},
      mean: -1203.76522, std: 7.13140, params: {'max_depth': 6, 'min_child_weight': 6},
      mean: -1195.35465, std: 6.38664, params: {'max_depth': 7, 'min_child_weight': 1},
      mean: -1194.02729, std: 6.69778, params: {'max_depth': 7, 'min_child_weight': 3},
      mean: -1193.51933, std: 6.73645, params: {'max_depth': 7, 'min_child_weight': 6},
      mean: -1189.10977, std: 6.18540, params: {'max_depth': 8, 'min_child_weight': 1},
      mean: -1188.21520, std: 6.15132, params: {'max_depth': 8, 'min_child_weight': 3},
      mean: -1187.95975, std: 6.71340, params: {'max_depth': 8, 'min_child_weight': 6}],
     {'max_depth': 8, 'min_child_weight': 6},
     -1187.9597499123447)
    

    网格搜索发现的最佳结果:
    {'max_depth': 8, 'min_child_weight': 6},
    -1187.9597499123447)
    设置成负的值是因为要找大的值

    def convert_grid_scores(scores):
        _params = []
        _params_mae = []    
        for i in scores:
            _params.append(i[0].values())
            _params_mae.append(i[1])
        params = np.array(_params)
        grid_res = np.column_stack((_params,_params_mae))
        return [grid_res[:,i] for i in range(grid_res.shape[1])]
    
    _,scores =  convert_grid_scores(grid.grid_scores_)
    scores = scores.reshape(5,3)
    
    plt.figure(figsize=(10,5))
    cp = plt.contourf(xgb_param_grid['min_child_weight'], xgb_param_grid['max_depth'], scores, cmap='BrBG')
    plt.colorbar(cp)
    plt.title('Depth / min_child_weight optimization')
    plt.annotate('We use this', xy=(5.95, 7.95), xytext=(4, 7.5), arrowprops=dict(facecolor='white'), color='white')
    plt.annotate('Good for depth=7', xy=(5.98, 7.05), 
                 xytext=(4, 6.5), arrowprops=dict(facecolor='white'), color='white')
    plt.xlabel('min_child_weight')
    plt.ylabel('max_depth')
    plt.grid(True)
    plt.show()
    

    我们看到,从网格搜索的结果,分数的提高主要是基于max_depth增加. min_child_weight稍有影响的成绩,但是,我们看到,min_child_weight = 6会更好一些。

    Step 3: 调节 gamma去降低过拟合风险

    %%time
    
    xgb_param_grid = {'gamma':[ 0.1 * i for i in range(0,5)]}
    
    grid = GridSearchCV(XGBoostRegressor(eta=0.1, num_boost_round=50, max_depth=8, min_child_weight=6,
                                            colsample_bytree=0.5, subsample=0.5),
                        param_grid=xgb_param_grid, cv=5, scoring=mae_scorer)
    
    grid.fit(train_x, train_y.values)
    

    Wall time: 13min 45s

    grid.grid_scores_, grid.best_params_, grid.best_score_
    

    输出

    ([mean: -1187.95975, std: 6.71340, params: {'gamma': 0.0},
      mean: -1187.67788, std: 6.44332, params: {'gamma': 0.1},
      mean: -1187.66616, std: 6.75004, params: {'gamma': 0.2},
      mean: -1187.21835, std: 7.06771, params: {'gamma': 0.30000000000000004},
      mean: -1188.35004, std: 6.50057, params: {'gamma': 0.4}],
     {'gamma': 0.30000000000000004},
     -1187.2183540791846)
    

    我们选择使用偏小一些的 gamma.

    Step 4: 调节样本采样方式 subsample 和 colsample_bytree

    %%time
    
    xgb_param_grid = {'subsample':[ 0.1 * i for i in range(6,9)],
                          'colsample_bytree':[ 0.1 * i for i in range(6,9)]}
    
    
    grid = GridSearchCV(XGBoostRegressor(eta=0.1, gamma=0.2, num_boost_round=50, max_depth=8, min_child_weight=6),
                        param_grid=xgb_param_grid, cv=5, scoring=mae_scorer)
    grid.fit(train_x, train_y.values)
    

    Wall time: 28min 26s

    grid.grid_scores_, grid.best_params_, grid.best_score_
    

    输出

    ([mean: -1185.67108, std: 5.40097, params: {'colsample_bytree': 0.6000000000000001, 'subsample': 0.6000000000000001},
      mean: -1184.90641, std: 5.61239, params: {'colsample_bytree': 0.6000000000000001, 'subsample': 0.7000000000000001},
      mean: -1183.73767, std: 6.15639, params: {'colsample_bytree': 0.6000000000000001, 'subsample': 0.8},
      mean: -1185.09329, std: 7.04215, params: {'colsample_bytree': 0.7000000000000001, 'subsample': 0.6000000000000001},
      mean: -1184.36149, std: 5.71298, params: {'colsample_bytree': 0.7000000000000001, 'subsample': 0.7000000000000001},
      mean: -1183.83446, std: 6.24654, params: {'colsample_bytree': 0.7000000000000001, 'subsample': 0.8},
      mean: -1184.43055, std: 6.68009, params: {'colsample_bytree': 0.8, 'subsample': 0.6000000000000001},
      mean: -1183.33878, std: 5.74989, params: {'colsample_bytree': 0.8, 'subsample': 0.7000000000000001},
      mean: -1182.93099, std: 5.75849, params: {'colsample_bytree': 0.8, 'subsample': 0.8}],
     {'colsample_bytree': 0.8, 'subsample': 0.8},
     -1182.9309918891634)
    
    _, scores =  convert_grid_scores(grid.grid_scores_)
    scores = scores.reshape(3,3)
    
    plt.figure(figsize=(10,5))
    cp = plt.contourf(xgb_param_grid['subsample'], xgb_param_grid['colsample_bytree'], scores, cmap='BrBG')
    plt.colorbar(cp)
    plt.title('Subsampling params tuning')
    plt.annotate('Optimum', xy=(0.895, 0.6), xytext=(0.8, 0.695), arrowprops=dict(facecolor='black'))
    plt.xlabel('subsample')
    plt.ylabel('colsample_bytree')
    plt.grid(True)
    plt.show()
    

    在当前的预训练模式的具体案例,我得到了下面的结果:
    `{'colsample_bytree': 0.8, 'subsample': 0.8}, -1182.9309918891634)

    Step 5: 减小学习率并增大树个数

    参数优化的最后一步是降低学习速度,同时增加更多的估计量
    First, we plot different learning rates for a simpler model (50 trees):

    %%time
        
    xgb_param_grid = {'eta':[0.5,0.4,0.3,0.2,0.1,0.075,0.05,0.04,0.03]}
    grid = GridSearchCV(XGBoostRegressor(num_boost_round=50, gamma=0.2, max_depth=8, min_child_weight=6,
                                            colsample_bytree=0.6, subsample=0.9),
                        param_grid=xgb_param_grid, cv=5, scoring=mae_scorer)
    
    grid.fit(train_x, train_y.values)
    

    CPU times: user 6.69 ms, sys: 0 ns, total: 6.69 ms
    Wall time: 6.55 ms

    grid.grid_scores_, grid.best_params_, grid.best_score_
    

    输出

    ([mean: -1205.85372, std: 3.46146, params: {'eta': 0.5},
      mean: -1185.32847, std: 4.87321, params: {'eta': 0.4},
      mean: -1170.00284, std: 4.76399, params: {'eta': 0.3},
      mean: -1160.97363, std: 6.05830, params: {'eta': 0.2},
      mean: -1183.66720, std: 6.69439, params: {'eta': 0.1},
      mean: -1266.12628, std: 7.26130, params: {'eta': 0.075},
      mean: -1709.15130, std: 8.19994, params: {'eta': 0.05},
      mean: -2104.42708, std: 8.02827, params: {'eta': 0.04},
      mean: -2545.97334, std: 7.76440, params: {'eta': 0.03}],
     {'eta': 0.2},
     -1160.9736284869114)
    
    eta, y = convert_grid_scores(grid.grid_scores_)
    plt.figure(figsize=(10,4))
    plt.title('MAE and ETA, 50 trees')
    plt.xlabel('eta')
    plt.ylabel('score')
    plt.plot(eta, -y)
    plt.grid(True)
    plt.show()
    

    {'eta': 0.2}, -1160.9736284869114 是目前最好的结果
    现在我们把树的个数增加到100

    xgb_param_grid = {'eta':[0.5,0.4,0.3,0.2,0.1,0.075,0.05,0.04,0.03]}
    grid = GridSearchCV(XGBoostRegressor(num_boost_round=100, gamma=0.2, max_depth=8, min_child_weight=6,
                                            colsample_bytree=0.6, subsample=0.9),
                        param_grid=xgb_param_grid, cv=5, scoring=mae_scorer)
    
    grid.fit(train_x, train_y.values)
    

    CPU times: user 11.5 ms, sys: 0 ns, total: 11.5 ms
    Wall time: 11.4 ms

    grid.grid_scores_, grid.best_params_, grid.best_score_
    

    输出

    ([mean: -1231.04517, std: 5.41136, params: {'eta': 0.5},
      mean: -1201.31398, std: 4.75456, params: {'eta': 0.4},
      mean: -1177.86344, std: 3.67324, params: {'eta': 0.3},
      mean: -1160.48853, std: 5.65336, params: {'eta': 0.2},
      mean: -1152.24715, std: 5.85286, params: {'eta': 0.1},
      mean: -1156.75829, std: 5.30250, params: {'eta': 0.075},
      mean: -1184.88913, std: 6.08852, params: {'eta': 0.05},
      mean: -1243.60808, std: 7.40326, params: {'eta': 0.04},
      mean: -1467.04736, std: 8.70704, params: {'eta': 0.03}],
     {'eta': 0.1},
     -1152.2471498726127)
    
    eta, y = convert_grid_scores(grid.grid_scores_)
    plt.figure(figsize=(10,4))
    plt.title('MAE and ETA, 100 trees')
    plt.xlabel('eta')
    plt.ylabel('score')
    plt.plot(eta, -y)
    plt.grid(True)
    plt.show()
    

    学习率低一些的效果更好

    %%time
    
    xgb_param_grid = {'eta':[0.09,0.08,0.07,0.06,0.05,0.04]}
    grid = GridSearchCV(XGBoostRegressor(num_boost_round=200, gamma=0.2, max_depth=8, min_child_weight=6,
                                            colsample_bytree=0.6, subsample=0.9),
                        param_grid=xgb_param_grid, cv=5, scoring=mae_scorer)
    
    grid.fit(train_x, train_y.values)
    

    输出

    CPU times: user 21.9 ms, sys: 34 µs, total: 22 ms
    Wall time: 22 ms
    

    在增加树的个数呢?

    grid.grid_scores_, grid.best_params_, grid.best_score_
    

    输出

    ([mean: -1148.37246, std: 6.51203, params: {'eta': 0.09},
      mean: -1146.67343, std: 6.13261, params: {'eta': 0.08},
      mean: -1145.92359, std: 5.68531, params: {'eta': 0.07},
      mean: -1147.44050, std: 6.33336, params: {'eta': 0.06},
      mean: -1147.98062, std: 6.39481, params: {'eta': 0.05},
      mean: -1153.17886, std: 5.74059, params: {'eta': 0.04}],
     {'eta': 0.07},
     -1145.9235944370419)
    
    eta, y = convert_grid_scores(grid.grid_scores_)
    plt.figure(figsize=(10,4))
    plt.title('MAE and ETA, 200 trees')
    plt.xlabel('eta')
    plt.ylabel('score')
    plt.plot(eta, -y)
    plt.grid(True)
    plt.show()
    
    %%time
    
    # Final XGBoost model
    
    
    bst = XGBoostRegressor(num_boost_round=200, eta=0.07, gamma=0.2, max_depth=8, min_child_weight=6,
                                            colsample_bytree=0.6, subsample=0.9)
    cv = bst.kfold(train_x, train_y, nfold=5)
    

    输出

    CPU times: user 1.26 ms, sys: 22 µs, total: 1.28 ms
    Wall time: 1.07 ms
    
    cv
    

    输出

    test-mae-mean     1146.997852
    test-mae-std         9.541592
    train-mae-mean    1036.557251
    train-mae-std        0.974437
    Name: 199, dtype: float64
    

    相关文章

      网友评论

          本文标题:Xgboost

          本文链接:https://www.haomeiwen.com/subject/korruftx.html