美文网首页
R语言机器学习与临床预测模型50--XGBoost算法预测

R语言机器学习与临床预测模型50--XGBoost算法预测

作者: 科研私家菜 | 来源:发表于2022-05-15 15:03 被阅读0次

本内容为【科研私家菜】R语言机器学习与临床预测模型系列课程

你想要的R语言学习资料都在这里, 快来收藏关注【科研私家菜】


今天我们来学习机器学习算法中的扛把子——XGBoost!!!
xgboost是华盛顿大学博士陈天奇创造的一个梯度提升(Gradient Boosting)的开源框架。至今可以算是各种数据比赛中的大杀器,被大家广泛地运用。

01 XGBoost算法

XGBoost是一个优化的分布式梯度增强库,旨在实现高效,灵活和便携。
它在 Gradient Boosting 框架下实现机器学习算法。XGBoost提供并行树提升(也称为GBDT,GBM),可以快速准确地解决许多数据科学问题。相同的代码在主要的分布式环境(Hadoop,SGE,MPI)上运行,并且可以解决数十亿个示例之外的问题。


从决策树到 XGBoost 算法的演变

XGBoost 是基于决策树的集成机器学习算法,它以梯度提升(Gradient Boost)为框架。在非结构数据(图像、文本等)的预测问题中,人工神经网络的表现要优于其他算法或框架。但在处理中小型结构数据或表格数据时,现在普遍认为基于决策树的算法是最好的。
XGBoost 和梯度提升机(Gradient Boosting Machine,GBM)都是用梯度下降架构增强弱学习器(一般是 CART)的集成树方法。但 XGBoost 通过系统优化和算法增强改进了基础 GBM 框架。2016年,陈天奇在论文《 XGBoost:A Scalable Tree Boosting System》中正式提出。

XGBoost的基本思想和GBDT相同,但是做了一些优化,比如二阶导数使损失函数更精准;正则项避免树过拟合;Block存储可以并行计算等。三个步骤优化XGBoost目标函数:
第一步:二阶泰勒展开,去除常数项,优化损失函数项;
第二步:正则化项展开,去除常数项,优化正则化项;
第三步:合并一次项系数、二次项系数,得到最终目标函数。

XGBoost具有高效、灵活和轻便的特点,在数据挖掘、推荐系统等领域得到广泛的应用。


XGBoost 如何优化标准 GBM 算法

和其他算法相比,XGBoost 算法的不同之处有以下几点:

  1. 应用范围广泛:该算法可以解决回归、分类、排序以及用户自定义的预测问题;
  2. 可移植性:该算法可以在 Windows、Linux 和 OS X 上流畅地运行;
  3. 语言:支持包括 C++、Python、R、Java、Scala 和 Julia 在内的几乎所有主流编程语言;
  4. 云集成:支持 AWS、Azure 和 Yarn 集群,也可以很好地配合 Flink、 Spark 等其他生态系统。


    XGBoost算法总结
    XGBoost算法优缺点

02 XGBoost算法R语言实现

require(xgboost)
# load in the agaricus dataset
data(agaricus.train, package = 'xgboost')
data(agaricus.test, package = 'xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)

# note: for customized objective function, we leave objective as default
# note: what we are getting is margin value in prediction
# you must know what you are doing
watchlist <- list(eval = dtest, train = dtrain)
num_round <- 2

# user define objective function, given prediction, return gradient and second order gradient
# this is log likelihood loss
logregobj <- function(preds, dtrain) {
  labels <- getinfo(dtrain, "label")
  preds <- 1 / (1 + exp(-preds))
  grad <- preds - labels
  hess <- preds * (1 - preds)
  return(list(grad = grad, hess = hess))
}

# user defined evaluation function, return a pair metric_name, result
# NOTE: when you do customized loss function, the default prediction value is margin
# this may make builtin evaluation metric not function properly
# for example, we are doing logistic loss, the prediction is score before logistic transformation
# the builtin evaluation error assumes input is after logistic transformation
# Take this in mind when you use the customization, and maybe you need write customized evaluation function
evalerror <- function(preds, dtrain) {
  labels <- getinfo(dtrain, "label")
  err <- as.numeric(sum(labels != (preds > 0))) / length(labels)
  return(list(metric = "error", value = err))
}

param <- list(max_depth = 2, eta = 1, nthread  =  2, verbosity = 0,
              objective = logregobj, eval_metric = evalerror)
print ('start training with user customized objective')
# training with customized objective, we can also do step by step training
# simply look at xgboost.py's implementation of train
bst <- xgb.train(param, dtrain, num_round, watchlist)

#
# there can be cases where you want additional information
# being considered besides the property of DMatrix you can get by getinfo
# you can set additional information as attributes if DMatrix

# set label attribute of dtrain to be label, we use label as an example, it can be anything
attr(dtrain, 'label') <- getinfo(dtrain, 'label')
# this is new customized objective, where you can access things you set
# same thing applies to customized evaluation function
logregobjattr <- function(preds, dtrain) {
  # now you can access the attribute in customized function
  labels <- attr(dtrain, 'label')
  preds <- 1 / (1 + exp(-preds))
  grad <- preds - labels
  hess <- preds * (1 - preds)
  return(list(grad = grad, hess = hess))
}
param <- list(max_depth = 2, eta = 1, nthread  =  2, verbosity = 0,
              objective = logregobjattr, eval_metric = evalerror)
print ('start training with user customized objective, with additional attributes in DMatrix')
# training with customized objective, we can also do step by step training
# simply look at xgboost.py's implementation of train
bst <- xgb.train(param, dtrain, num_round, watchlist)
# install.packages("breakDown")
library("breakDown")

head(wine)
View(wine)
str(wine)
library("xgboost")



model_martix_train <- model.matrix(quality ~ . - 1, wine)

data_train <- xgb.DMatrix(model_martix_train, label = wine$quality)

param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2,
              
              objective = "reg:linear")



wine_xgb_model <- xgb.train(param, data_train, nrounds = 50)

wine_xgb_model

library("DALEX")



explainer_xgb <- explain(wine_xgb_model,
                         
                         data = model_martix_train,
                         
                         y = wine$quality,
                         
                         label = "xgboost")

explainer_xgb

参考资料:

线性模型已退场,XGBoost时代早已来
https://xgboost.apachecn.org/#/
XGBoost的原理、公式推导、Python实现和应用


关注R小盐,关注科研私家菜(VX_GZH: SciPrivate),有问题请联系R小盐。让我们一起来学习 R语言机器学习与临床预测模型

相关文章

网友评论

      本文标题:R语言机器学习与临床预测模型50--XGBoost算法预测

      本文链接:https://www.haomeiwen.com/subject/iehxertx.html