微软大杀器 LightGBM 在R中安装及使用注解

作者: 飘舞的鼻涕 | 来源:发表于2017-11-02 14:20 被阅读214次

微软大杀器 LightGBM 在R中安装及使用注解
lightgbm 运行参数error
学习小组Day4笔记--monocyte
Mysql
Iightgbm与xgboost导入到python中
注解学习笔记
在Controller方法上添加自定义注解, 并解析自定义注解
@RequestMapping 注解
jupyter notebook 中使用R语言定义图形大小
fidder抓包

安装

安装R版本的 lightgbm, 相较于之前的 install.packages('xx') 分分钟完事, 会略显繁琐，笔者在安装之初也是填了n次坑, 与巨硬的R包作者来往了好几次才成功, 故将安装过程笔记放在这里, 以饷后来人
注，任何疑问都返回 github blog留言，或者加QQ群..[174225475].. 共同探讨进步

非GPU版本

1.0 官方安装指导传送门
1.1 安装前的准备
安装 git 和 cmake
注: lightgbm 不支持 32-bit R/Rtools
1.2.1 windows 平台
安装 64 位 Rtools 并将启动路径放置于环境变量的path中
也可以直接运行代码：
library(devtools)
options(devtools.install.args = "--no-multiarch") # if you have 64-bit R only, you can skip this
install_github("Microsoft/LightGBM", subdir = "R-package")
1.2.2 linux 平台
先安装 Open MPI
尔后运行以下代码：
git clone --recursive https://github.com/Microsoft/LightGBM ; cd LightGBM
mkdir build ; cd build
cmake -DUSE_MPI=ON ..
make -j4
Note: glibc >= 2.14 is required.
1.2.3 osx 平台
先安装 gcc 和 Open MPI :
brew install openmpi
brew install cmake
brew install gcc --without-multilib
尔后 :
git clone --recursive https://github.com/Microsoft/LightGBM ; cd LightGBM
export CXX=g++-7 CC=gcc-7
mkdir build ; cd build
cmake -DUSE_MPI=ON ..
make -j4

GPU版本
参考资料

应用

1. 回归

train_dt <- df1 %>% filter(label1!=0)
test_dt <- df1 %>% filter(label1==0)
indx1 <- sample(c(0,1),size = nrow(train_dt),replace = TRUE,prob = c(0.8,0.2))  
train1 <- train_dt %>% filter(indx1==0)
valid1 <- train_dt %>% filter(indx1==1)

library(lightgbm)
#lgb.unloader(wipe = TRUE)
bia1 <- train1 %>% 
  select(-ids1,-label1) %>% 
  data.matrix() # change class(Vars) into numeric
bia2 <- train1$price_1mi
dtrain <- lgb.Dataset(data=bia1,
                      label=bia2,
                      is_sparse=FALSE,
                      # colnames/categorical_feather used for specifying categorical feathers
                      colnames = colnames(train1 %>% select(-ids1,-label1)),
                      categorical_feature = c('cateVar1','cateVar2'))
 
bia3 <- valid1 %>% 
  select(-ids1,-label1) %>%
  data.matrix() # change class(Vars) into numeric
bia4 <- valid1$price_1mi
dtest <- lgb.Dataset.create.valid(dataset=dtrain,
                                  data=bia3,
                                  label=bia4)
valids <- list(test=dtest)

params <- list(objective = "regression", metric = "l2") # L2 not twelve
lgb1 <- lgb.train(params=params,
                  data=dtrain,
                  valids = valids,
                  min_data =1, # min data in a group
                  learning_rate=0.1, # smaller,slower,maybe more accurate
                  nrounds = 300,
                  early_stopping_rounds = 20) #if not better than last 20 rounds,stop

bia7 <- test_dt %>% 
  select(-ids1,-label1) %>%
  data.matrix() # change class(Vars) into numeric

pre.lgb=predict(lgb1,bia7)

2. 分类

根据官网demo文件提示

MultiClass模型中 label必须是数值型且必须起始于0

# We must convert factors to numeric
# They must be starting from number 0 to use multiclass
# For instance: 0, 1, 2, 3, 4, 5...
iris$Species <- as.numeric(as.factor(iris$Species)) - 1

Binary模型中, label必须是数值型[0,1]

so, 重要的事情说三遍！重要的事情说三遍！重要的事情说三遍 ... 也不一定记得住:

lightgbm 做分类预测, label 必须是数值且起始于0

2.1 二分类

library(bit64)
library(data.table)
library(dplyr)
library(lightgbm)

info1 <- fread('./data2/info1.csv',header = TRUE,encoding = 'UTF-8')

train_dt <- info1 %>% filter(!is.na(overdue))
test_dt <- info1 %>% filter(is.na(overdue))

set.seed(123)
train_dt$valid_inx <- sample(c(1,0),nrow(train_dt),replace = TRUE,prob = c(0.2,0.8))

bia1 <- train_dt %>% filter(valid_inx==0) %>% 
  select(-userid,-overdue,-valid_inx) %>% data.matrix()
bia2 <- 1-(train_dt %>% filter(valid_inx==0))$overdue
bia3 <- train_dt %>% filter(valid_inx==1) %>% 
  select(-userid,-overdue,-valid_inx) %>% data.matrix()
bia4 <- 1-(train_dt %>% filter(valid_inx==1))$overdue

dtrain <- lgb.Dataset(data = bia1, 
                      label = bia2,
                      is_sparse = FALSE,
                      # colnames/categorical_feather used for specifying categorical feathers
                      colnames = colnames(bia1),
                      categorical_feature = c('sex','occupation','education','marriage'))
dtest <- lgb.Dataset.create.valid(dtrain, 
                                  data = bia3, 
                                  label = bia4)
valids <- list(test = dtest)


## --- way2
param <- list(num_leaves =70, # 70/80 default,2^(max_depth)
              min_data_in_leaf=1, 
              learning_rate = 0.1, # smaller,slower,maybe more accurate
              is_unbalance=TRUE, # unbalance TrainingSet
              nthread = 3,
              verbose = 1,
              metric=c("AUC", "binary_logloss"), # evaluate rules
              objective = "binary")
lgb2 <- lgb.train(params=param, 
                  data=dtrain,
                  nrounds = 200,
                  early_stopping_rounds = 10,
                  valids = valids,
                  bagging_fraction = 0.7, # random sample ratio from trainSet
                  bagging_freq = 10, # random sample freq from trainSet
                  bagging_seed = 1) # set.seed

pred2 <- predict(lgb2,data.matrix(test_dt %>% select(-userid,-overdue)))
pred20 <- as.data.frame(cbind(userid=test_dt$userid,probability = 1-pred2))

2.2 多分类

多分类与二分类参数设置上的不同主要有:

需要设置 num_class(label类别数量)

params <- list(objective = "multiclass", metric = "multi_error", num_class = 3)

predict 结果输出形式可以自定义

# A (30x3) matrix with the predictions, use parameter reshape
# class1 class2 class3
#   obs1   obs1   obs1
#   obs2   obs2   obs2
#   ....   ....   ....
my_preds <- predict(model, test[, 1:4], reshape = TRUE)

# We can also get the predicted scores before the Sigmoid/Softmax application
my_preds <- predict(model, test[, 1:4], rawscore = TRUE, reshape = TRUE)

# We can also get the leaf index
my_preds <- predict(model, test[, 1:4], predleaf = TRUE, reshape = TRUE)

下面是lightgbm官网对 iris 数据集的分类预测demo, 供实践参考

require(lightgbm)
# We load the default iris dataset shipped with R
data(iris)

# We must convert factors to numeric
# They must be starting from number 0 to use multiclass
# For instance: 0, 1, 2, 3, 4, 5...
iris$Species <- as.numeric(as.factor(iris$Species)) - 1

# We cut the data set into 80% train and 20% validation
# The 10 last samples of each class are for validation

train <- as.matrix(iris[c(1:40, 51:90, 101:140), ])
test <- as.matrix(iris[c(41:50, 91:100, 141:150), ])
dtrain <- lgb.Dataset(data = train[, 1:4], label = train[, 5])
dtest <- lgb.Dataset.create.valid(dtrain, data = test[, 1:4], label = test[, 5])
valids <- list(test = dtest)

# Method 1 of training
params <- list(objective = "multiclass", metric = "multi_error", num_class = 3)
model <- lgb.train(params = params,
                  data = dtrain,
                  nrounds = 100,
                  valids = valids,
                   min_data = 1,
                   learning_rate = 1,
                   early_stopping_rounds = 10)

# possibility for each class by cols:
my_preds <- predict(model, test[, 1:4], reshape = TRUE)
            [,1]       [,2]       [,3]
 [1,] 0.82590130 0.08704935 0.08704935
 [2,] 0.82590130 0.08704935 0.08704935
 [3,] 0.82590130 0.08704935 0.08704935
 [4,] 0.82590130 0.08704935 0.08704935
 [5,] 0.82590130 0.08704935 0.08704935
 [6,] 0.82590130 0.08704935 0.08704935
# We can also get the predicted scores before the Sigmoid/Softmax application
my_preds <- predict(model, test[, 1:4], rawscore = TRUE, reshape = TRUE)
       [,1]  [,2]  [,3]
 [1,]  1.50 -0.75 -0.75
 [2,]  1.50 -0.75 -0.75
 [3,]  1.50 -0.75 -0.75
 [4,]  1.50 -0.75 -0.75
 [5,]  1.50 -0.75 -0.75
 [6,]  1.50 -0.75 -0.75
# We can also get the leaf index
my_preds <- predict(model, test[, 1:4], predleaf = TRUE)
      [,1] [,2] [,3]
 [7,]    0    0    0
 [8,]    0    0    0
 [9,]    0    0    0
[10,]    0    0    0
[11,]    1    6    0
[12,]    2    6    0

3. 参数调优

## For faster speed
# Use bagging by setting bagging_fraction and bagging_freq
# Use feature sub-sampling by setting feature_fraction
# Use small max_bin
# Use save_binary to speed up data loading in future learning
# Use parallel learning, refer to parallel learning guide.

## For better accuracy
# Use large max_bin (may be slower)
# Use small learning_rate with large num_iterations
# Use large num_leaves(may cause over-fitting)
# Use bigger training data
# Try dart

## Deal with over-fitting
# Use small max_bin
# Use small num_leaves
# Use min_data_in_leaf and min_sum_hessian_in_leaf
# Use bagging by set bagging_fraction and bagging_freq
# Use feature sub-sampling by set feature_fraction
# Use bigger training data
# Try lambda_l1, lambda_l2 and min_gain_to_split to regularization
# Try max_depth to avoid growing deep tree

参考资料

lightgbm R-package github
lightgbm demos

微软大杀器 LightGBM 在R中安装及使用注解
安装安装R版本的 lightgbm, 相较于之前的 install.packages('xx') 分分钟完事, ...
lightgbm 运行参数error
lightgbm 是继xgboost后又一boost大杀器，此处是github链接. xgboost 实质上只能接...
学习小组Day4笔记--monocyte
思考经历过linux安装，linux使用，miniconda安装及简单使用，R及Rstudio安装及简单功能了解...
Mysql
debian安装及使用mysql 一安装 1. 在Debian中安装MySQL服务器是很方便的，使用apt-ge...
Iightgbm与xgboost导入到python中
今天在win10上安装lightgbm与xgboost包,安装了一天. 我们如果搜索lightgbm的安装会得到一...
注解学习笔记
什么是注解注解分类注解作用分类元注解 Java内置注解自定义注解自定义注解实现及使用编译时注解注解处理器注解处...
在Controller方法上添加自定义注解, 并解析自定义注解
1. 定义自定义注解 2. 使用注解如在某个controller方法上使用了注解. 如我们在拦截器中要拦截使用了...
@RequestMapping 注解
SpringMVC 使用@RequestMapping注解为控制器指定可以处理哪些URL请求，在控制器的类定义及方...
jupyter notebook 中使用R语言定义图形大小
问题背景使用R语言可以在本机安装R，也可以通过jupyter notebook进行使用。我通过AWS的服务器搭建...
fidder抓包
一：安装及使用 1.1、先安装framework4.5，在安装fidder抓包工具。2、主要使用检查器inspec...

微软大杀器 LightGBM 在R中安装及使用注解

安装

应用

1. 回归

2. 分类

2.1 二分类

2.2 多分类

3. 参数调优

参考资料

相关文章

微软大杀器 LightGBM 在R中安装及使用注解

lightgbm 运行参数error

学习小组Day4笔记--monocyte

Mysql

Iightgbm与xgboost导入到python中

注解学习笔记

在Controller方法上添加自定义注解, 并解析自定义注解

@RequestMapping 注解

jupyter notebook 中使用R语言定义图形大小

fidder抓包

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

数据挖掘应用