本内容为【科研私家菜】R语言机器学习与临床预测模型系列课程
你想要的R语言学习资料都在这里, 快来收藏关注【科研私家菜】
今天我们来学习机器学习算法中的扛把子——XGBoost!!!
xgboost是华盛顿大学博士陈天奇创造的一个梯度提升(Gradient Boosting)的开源框架。至今可以算是各种数据比赛中的大杀器,被大家广泛地运用。
01 XGBoost算法
XGBoost是一个优化的分布式梯度增强库,旨在实现高效,灵活和便携。
它在 Gradient Boosting 框架下实现机器学习算法。XGBoost提供并行树提升(也称为GBDT,GBM),可以快速准确地解决许多数据科学问题。相同的代码在主要的分布式环境(Hadoop,SGE,MPI)上运行,并且可以解决数十亿个示例之外的问题。
从决策树到 XGBoost 算法的演变
XGBoost 是基于决策树的集成机器学习算法,它以梯度提升(Gradient Boost)为框架。在非结构数据(图像、文本等)的预测问题中,人工神经网络的表现要优于其他算法或框架。但在处理中小型结构数据或表格数据时,现在普遍认为基于决策树的算法是最好的。
XGBoost 和梯度提升机(Gradient Boosting Machine,GBM)都是用梯度下降架构增强弱学习器(一般是 CART)的集成树方法。但 XGBoost 通过系统优化和算法增强改进了基础 GBM 框架。2016年,陈天奇在论文《 XGBoost:A Scalable Tree Boosting System》中正式提出。
XGBoost的基本思想和GBDT相同,但是做了一些优化,比如二阶导数使损失函数更精准;正则项避免树过拟合;Block存储可以并行计算等。三个步骤优化XGBoost目标函数:
第一步:二阶泰勒展开,去除常数项,优化损失函数项;
第二步:正则化项展开,去除常数项,优化正则化项;
第三步:合并一次项系数、二次项系数,得到最终目标函数。
XGBoost具有高效、灵活和轻便的特点,在数据挖掘、推荐系统等领域得到广泛的应用。
XGBoost 如何优化标准 GBM 算法
和其他算法相比,XGBoost 算法的不同之处有以下几点:
- 应用范围广泛:该算法可以解决回归、分类、排序以及用户自定义的预测问题;
- 可移植性:该算法可以在 Windows、Linux 和 OS X 上流畅地运行;
- 语言:支持包括 C++、Python、R、Java、Scala 和 Julia 在内的几乎所有主流编程语言;
-
云集成:支持 AWS、Azure 和 Yarn 集群,也可以很好地配合 Flink、 Spark 等其他生态系统。
XGBoost算法总结
XGBoost算法优缺点
02 XGBoost算法R语言实现
require(xgboost)
# load in the agaricus dataset
data(agaricus.train, package = 'xgboost')
data(agaricus.test, package = 'xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
# note: for customized objective function, we leave objective as default
# note: what we are getting is margin value in prediction
# you must know what you are doing
watchlist <- list(eval = dtest, train = dtrain)
num_round <- 2
# user define objective function, given prediction, return gradient and second order gradient
# this is log likelihood loss
logregobj <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
preds <- 1 / (1 + exp(-preds))
grad <- preds - labels
hess <- preds * (1 - preds)
return(list(grad = grad, hess = hess))
}
# user defined evaluation function, return a pair metric_name, result
# NOTE: when you do customized loss function, the default prediction value is margin
# this may make builtin evaluation metric not function properly
# for example, we are doing logistic loss, the prediction is score before logistic transformation
# the builtin evaluation error assumes input is after logistic transformation
# Take this in mind when you use the customization, and maybe you need write customized evaluation function
evalerror <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
err <- as.numeric(sum(labels != (preds > 0))) / length(labels)
return(list(metric = "error", value = err))
}
param <- list(max_depth = 2, eta = 1, nthread = 2, verbosity = 0,
objective = logregobj, eval_metric = evalerror)
print ('start training with user customized objective')
# training with customized objective, we can also do step by step training
# simply look at xgboost.py's implementation of train
bst <- xgb.train(param, dtrain, num_round, watchlist)
#
# there can be cases where you want additional information
# being considered besides the property of DMatrix you can get by getinfo
# you can set additional information as attributes if DMatrix
# set label attribute of dtrain to be label, we use label as an example, it can be anything
attr(dtrain, 'label') <- getinfo(dtrain, 'label')
# this is new customized objective, where you can access things you set
# same thing applies to customized evaluation function
logregobjattr <- function(preds, dtrain) {
# now you can access the attribute in customized function
labels <- attr(dtrain, 'label')
preds <- 1 / (1 + exp(-preds))
grad <- preds - labels
hess <- preds * (1 - preds)
return(list(grad = grad, hess = hess))
}
param <- list(max_depth = 2, eta = 1, nthread = 2, verbosity = 0,
objective = logregobjattr, eval_metric = evalerror)
print ('start training with user customized objective, with additional attributes in DMatrix')
# training with customized objective, we can also do step by step training
# simply look at xgboost.py's implementation of train
bst <- xgb.train(param, dtrain, num_round, watchlist)
# install.packages("breakDown")
library("breakDown")
head(wine)
View(wine)
str(wine)
library("xgboost")
model_martix_train <- model.matrix(quality ~ . - 1, wine)
data_train <- xgb.DMatrix(model_martix_train, label = wine$quality)
param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2,
objective = "reg:linear")
wine_xgb_model <- xgb.train(param, data_train, nrounds = 50)
wine_xgb_model
library("DALEX")
explainer_xgb <- explain(wine_xgb_model,
data = model_martix_train,
y = wine$quality,
label = "xgboost")
explainer_xgb
参考资料:
线性模型已退场,XGBoost时代早已来
https://xgboost.apachecn.org/#/
XGBoost的原理、公式推导、Python实现和应用
关注R小盐,关注科研私家菜(VX_GZH: SciPrivate),有问题请联系R小盐。让我们一起来学习 R语言机器学习与临床预测模型
网友评论