超级算法之XGBoost

作者: taon | 来源:发表于2019-08-02 22:21 被阅读0次

超级算法之XGBoost
[Python与数据分析]-16XGBoost
XGBoost原理以及python的实现
集成学习之Boosting-xgboost
win系统中安装xgboost的教程
Kaggle实战系列之"San Francisco Crime
必须掌握的算法
XGBoost
集成学习
Python集成学习算法

XGBoost(Extreme Gradient Boosting)：Boosting思想是将许多弱分类器集成在一起形成一个强分类器。XGBoost是集成算法的王牌算法，它可以使用多种分类器，线性分类器也可以使用，它是将众多的弱分类器集成在一起，从而形成一个强分类器。在Kaggle数据挖掘比赛中，XGBoost基本成为了必用的算法，因为它的效率高，而且效果好。

Artificial Intelligence.jpeg

XGBoost算法思想
XGBoost的算法思想就是通过不断的添加树，不断地进行特征分裂来生长一棵树，每次添加一棵树，就是学习一个新的函数，去拟合上一棵树的残差。当我们训练完m棵树时，会发现每一个样本会落入多个叶子节点，每个叶子节点都会有一个预测值，最终的预测结果就是每个叶子节点预测值相加的结果。

xgboost.png
我们用上图的例子进行说明，经过不同特征的分类，男孩最终落入到两个叶子节点中。男孩的预测值就是f(boy) = 2+0.9 = 2.9。
XGBoost数学原理
数学推导如图：

xgboost.jpg

XGBoost样例演示
Mnist数据集：https://pan.baidu.com%2Fs%2F1Tz573QiMLuaD-fEXcr4qYA
提取码：xozg

import gzip
import pickle as pkl
import time
from sklearn.model_selection import train_test_split

def load_data(path):
    f = gzip.open(path,'rb')
    train_set,valid_set,test_set = pkl.load(f,encoding = 'latin1')
    f.close()
    return(train_set,valid_set,test_set)

path = 'D:\\Py_dataset\\mnist.pkl.gz'
train_set,valid_set,test_set = load_data(path)
Xtrain,_,ytrain,_ = train_test_split(train_set[0],train_set[1],test_size = 0.9)
Xtest,_,ytest,_ = train_test_split(test_set[0],test_set[1],test_size = 0.9)

#导入xgboost
import xgboost
#将数据转换为xgboost的底层格式
dtrain = xgb.DMatrix(Xtrain,ytrain)
dtest = xgb.DMatrix(Xtest,ytest)
#设置xgboost的参数
params = {
    'booster':'gbtree',     #  tree-based models
    'objective': 'multi:softmax', 
    'num_class':10, 
    'eta': 0.1,             
    'gamma':0,              
    'alpha': 0,             
    'lambda': 2,            
    'max_depth': 3,         
    'subsample': 1,         
    'colsample_bytree': 1, 
    'min_child_weight': 1,
    'nthread':1,      
}
num_round = 10

start_time = time.time()
bst = xgb.train(params,dtrain,num_round)
end_time = time.time()
print('The time usage of xgboost {}'.format(end_time - start_time))
The time usage of xgboost 15.09824442863464

y_pred = bst.predict(dtest)
accuracy = np.sum(y_pred == ytest)/len(ytrain)
print('The accuracy of xgboost {}'.format(accuracy))
The accuracy of xgboost 0.83

XGBoost算法参数说明

'booster':'gbtree'，梯度提升树
'objective': 'multi:softmax', 多分类的问题；对于回归问题，'objective'：'reg:linear'
'num_class':10, 类别数，与 multisoftmax 并用
'gamma':损失下降多少才进行分裂
'max_depth':默认为6, 构建树的深度，越大越容易过拟合
'lambda':默认为0, 控制模型复杂度的权重值的L2正则化项参数，参数越大，模型越不容易过拟合。
'subsample':默认为1, 随机采样的比例
'colsample_bytree':0.7, 生成树时的特征采样比例
'min_child_weight':3, 孩子节点中最小的样本权重和。如果一个叶子节点的样本权重和小于—min_child_weight，则拆分过程结束
'silent':0 ,设置成1则没有运行信息输出，最好是设置为0.
'eta': 0.007, 如同学习率，每一个弱分类器对最终结果的贡献比例
'seed':1000,
'nthread':7, cpu 线程数
XGBoost算法小结
1.XGBoost集成算可以始终分类器作为基本分类器。
2.XGBoost对上一轮损失函数进行二阶求导，因此准确性更高。
3.XGBoost可对特征进行并行处理，可提升运算效率。由于XGBoost与GBDT算法都属于串行模式，所以不能对模型进行并行处理。