82-mlr3+OpenML实例

作者: wonphen | 来源:发表于2020-10-10 12:22 被阅读0次

82-mlr3+OpenML实例
SQL C语言基本操作
Python-数据类型及其操作方法
HTML基础-03
Python 类属性、实例属性、类方法、实例方法
STL算法之常用拷贝和替换
Vue 基础
类中的方法
C语言100例
AWS云计算助手级架构师认证之EC2-实例购买类型

任务：预测美国职业棒球大联盟球队得分。

1、从OpenML上获取数据

library(pacman)
p_load(mlr3, mlr3learners, mlr3pipelines, mlr3measures, OpenML)

moneyball <- getOMLDataSet(data.id = 41021L)

## Downloading from 'http://www.openml.org/api/v1/data/41021' to '/tmp/RtmpwQuBr4/cache/datasets/41021/description.xml'.

## Downloading from 'https://www.openml.org/data/v1/download/18626236/Moneyball.arff' to '/tmp/RtmpwQuBr4/cache/datasets/41021/dataset.arff'

## Loading required package: readr

moneyball

## 
## Data Set 'Moneyball' :: (Version = 2, OpenML ID = 41021)
##   Default Target Attribute: RS

str(moneyball$data)

## 'data.frame':    1232 obs. of  15 variables:
##  $ Team        : Factor w/ 39 levels "ARI","ATL","BAL",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ League      : Factor w/ 2 levels "AL","NL": 2 2 1 1 2 1 2 1 2 1 ...
##  $ Year        : num  2012 2012 2012 2012 2012 ...
##  $ RS          : num  734 700 712 734 613 748 669 667 758 726 ...
##  $ RA          : num  688 600 705 806 759 676 588 845 890 670 ...
##  $ W           : num  81 94 93 69 61 85 97 68 64 88 ...
##  $ OBP         : num  0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
##  $ SLG         : num  0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
##  $ BA          : num  0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
##  $ Playoffs    : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 2 1 1 2 ...
##  $ RankSeason  : Factor w/ 8 levels "1","2","3","4",..: NA 4 5 NA NA NA 2 NA NA 6 ...
##  $ RankPlayoffs: Factor w/ 5 levels "1","2","3","4",..: NA 5 4 NA NA NA 4 NA NA 2 ...
##  $ G           : Factor w/ 8 levels "158","159","160",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ OOBP        : num  0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ...
##  $ OSLG        : num  0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...

DataExplorer::profile_missing(moneyball$data)

##         feature num_missing pct_missing
## 1          Team           0   0.0000000
## 2        League           0   0.0000000
## 3          Year           0   0.0000000
## 4            RS           0   0.0000000
## 5            RA           0   0.0000000
## 6             W           0   0.0000000
## 7           OBP           0   0.0000000
## 8           SLG           0   0.0000000
## 9            BA           0   0.0000000
## 10     Playoffs           0   0.0000000
## 11   RankSeason         988   0.8019481
## 12 RankPlayoffs         988   0.8019481
## 13            G           0   0.0000000
## 14         OOBP         812   0.6590909
## 15         OSLG         812   0.6590909

RankSeason和RankPlayoffs各有988个缺失值，OOBG和OSLG各有812个缺失值。

summary(moneyball$data)

##       Team     League        Year            RS               RA               W              OBP        
##  BAL    : 47   AL:616   Min.   :1962   Min.   : 463.0   Min.   : 472.0   Min.   : 40.0   Min.   :0.2770  
##  BOS    : 47   NL:616   1st Qu.:1977   1st Qu.: 652.0   1st Qu.: 649.8   1st Qu.: 73.0   1st Qu.:0.3170  
##  CHC    : 47            Median :1989   Median : 711.0   Median : 709.0   Median : 81.0   Median :0.3260  
##  CHW    : 47            Mean   :1989   Mean   : 715.1   Mean   : 715.1   Mean   : 80.9   Mean   :0.3263  
##  CIN    : 47            3rd Qu.:2002   3rd Qu.: 775.0   3rd Qu.: 774.2   3rd Qu.: 89.0   3rd Qu.:0.3370  
##  CLE    : 47            Max.   :2012   Max.   :1009.0   Max.   :1103.0   Max.   :116.0   Max.   :0.3730  
##  (Other):950                                                                                             
##       SLG               BA         Playoffs   RankSeason  RankPlayoffs       G            OOBP       
##  Min.   :0.3010   Min.   :0.2140   0:988    2      : 53   1   : 47     162    :954   Min.   :0.2940  
##  1st Qu.:0.3750   1st Qu.:0.2510   1:244    1      : 52   2   : 47     161    :139   1st Qu.:0.3210  
##  Median :0.3960   Median :0.2600            3      : 44   3   : 80     163    : 93   Median :0.3310  
##  Mean   :0.3973   Mean   :0.2593            4      : 44   4   : 68     160    : 23   Mean   :0.3323  
##  3rd Qu.:0.4210   3rd Qu.:0.2680            5      : 21   5   :  2     159    : 10   3rd Qu.:0.3430  
##  Max.   :0.4910   Max.   :0.2940            (Other): 30   NA's:988     164    : 10   Max.   :0.3840  
##                                             NA's   :988                (Other):  3   NA's   :812     
##       OSLG       
##  Min.   :0.3460  
##  1st Qu.:0.4010  
##  Median :0.4190  
##  Mean   :0.4197  
##  3rd Qu.:0.4380  
##  Max.   :0.4990  
##  NA's   :812

RankSeason(排名季节)，RankPlayoffs（排名季后赛）、OOBP（对手基准百分比）、OSLG（对手拍击率）都存在缺失值。

对待缺失值，我们一般有三种方法，1、删除有缺失值的行；2、删除有缺失值的列；3、基于数据基础分布，使用模型来估算缺失值进行插补。

在机器学习中，可以尝试所有选项，然后看看那种方法最能解决潜在的问题。

2、缺失值插补

# 插补数值型变量
imp.num <- po("imputehist", param_vals = list(affect_columns = selector_type("numeric")))
# 插补因子型变量
imp.fac <- po("imputeoor", param_vals = list(affect_columns = selector_type("factor")))

graph <- imp.num %>>% imp.fac
plot(graph)

缺失值插补流程图

变量G代表玩过的游戏，将其转换为数值型。

moneyball$data$G <- as.numeric(moneyball$data$G)
str(moneyball$data)

## 'data.frame':    1232 obs. of  15 variables:
##  $ Team        : Factor w/ 39 levels "ARI","ATL","BAL",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ League      : Factor w/ 2 levels "AL","NL": 2 2 1 1 2 1 2 1 2 1 ...
##  $ Year        : num  2012 2012 2012 2012 2012 ...
##  $ RS          : num  734 700 712 734 613 748 669 667 758 726 ...
##  $ RA          : num  688 600 705 806 759 676 588 845 890 670 ...
##  $ W           : num  81 94 93 69 61 85 97 68 64 88 ...
##  $ OBP         : num  0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
##  $ SLG         : num  0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
##  $ BA          : num  0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
##  $ Playoffs    : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 2 1 1 2 ...
##  $ RankSeason  : Factor w/ 8 levels "1","2","3","4",..: NA 4 5 NA NA NA 2 NA NA 6 ...
##  $ RankPlayoffs: Factor w/ 5 levels "1","2","3","4",..: NA 5 4 NA NA NA 4 NA NA 2 ...
##  $ G           : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ OOBP        : num  0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ...
##  $ OSLG        : num  0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...

3、创建回归任务和学习器

tsk <- TaskRegr$new(id = "moneyball", backend = moneyball$data, target = "RS")
# 查看缺失值情况
tsk$missings()

##           RS           BA            G       League          OBP         OOBP         OSLG     Playoffs 
##            0            0            0            0            0          812          812            0 
##           RA RankPlayoffs   RankSeason          SLG         Team            W         Year 
##            0          988          988            0            0            0            0

# 使用随机森林模型
test.lrn <- LearnerRegrRanger$new()
test.lrn$properties

## [1] "importance" "oob_error"  "weights"

输出中没有显示处理缺失值，所以随机森林模型不能处理缺失值。

4、向学习器中加入新的方法

polrn <- PipeOpLearner$new(mlr_learners$get("regr.ranger"))
# 设置树的数量和重要性
polrn$param_set$values <- list(num.trees = 1000, importance = "permutation")

lrn <- GraphLearner$new(graph = graph %>>% polrn)

plot(graph %>>% polrn)

加入新方法后的流程图

5、训练和预测

# 拆分训练集和测试集
dtrain <- sample(tsk$nrow, 0.85 * tsk$nrow)
dtest <- setdiff(seq_len(tsk$nrow), dtrain)
# 训练
lrn$train(tsk, row_ids = dtrain)
# 预测
pred <- lrn$predict(tsk, row_ids = dtest)
head(data.frame(truth = pred$data$truth,
                response = pred$data$response))

## truth  response
## <dbl>  <dbl>
## 776  749.5148            
## 804  812.0065            
## 619  599.2882            
## 708  709.7483            
## 654  671.5518            
## 619  612.6973

可以看到，预测已经非常的相近。

6、模型评估与重采样

cv10 <- rsmp("cv", folds = 10L)
rs <- resample(tsk, lrn, cv10)

使用平均绝对误差作为性能指标：

scores <- rs$score(msrs(c("regr.mae", "regr.mse")))
scores

##               task   task_id            learner                       learner_id         resampling
##  1: <TaskRegr[42]> moneyball <GraphLearner[31]> imputehist.imputeoor.regr.ranger <ResamplingCV[19]>
##  2: <TaskRegr[42]> moneyball <GraphLearner[31]> imputehist.imputeoor.regr.ranger <ResamplingCV[19]>
##  3: <TaskRegr[42]> moneyball <GraphLearner[31]> imputehist.imputeoor.regr.ranger <ResamplingCV[19]>
##  4: <TaskRegr[42]> moneyball <GraphLearner[31]> imputehist.imputeoor.regr.ranger <ResamplingCV[19]>
##  5: <TaskRegr[42]> moneyball <GraphLearner[31]> imputehist.imputeoor.regr.ranger <ResamplingCV[19]>
##  6: <TaskRegr[42]> moneyball <GraphLearner[31]> imputehist.imputeoor.regr.ranger <ResamplingCV[19]>
##  7: <TaskRegr[42]> moneyball <GraphLearner[31]> imputehist.imputeoor.regr.ranger <ResamplingCV[19]>
##  8: <TaskRegr[42]> moneyball <GraphLearner[31]> imputehist.imputeoor.regr.ranger <ResamplingCV[19]>
##  9: <TaskRegr[42]> moneyball <GraphLearner[31]> imputehist.imputeoor.regr.ranger <ResamplingCV[19]>
## 10: <TaskRegr[42]> moneyball <GraphLearner[31]> imputehist.imputeoor.regr.ranger <ResamplingCV[19]>
##     resampling_id iteration           prediction regr.mae regr.mse
##  1:            cv         1 <PredictionRegr[18]> 20.09583 623.2042
##  2:            cv         2 <PredictionRegr[18]> 19.05409 578.1255
##  3:            cv         3 <PredictionRegr[18]> 19.61329 677.1250
##  4:            cv         4 <PredictionRegr[18]> 22.31534 702.1785
##  5:            cv         5 <PredictionRegr[18]> 17.19854 520.0973
##  6:            cv         6 <PredictionRegr[18]> 19.64526 629.9101
##  7:            cv         7 <PredictionRegr[18]> 20.43314 678.2737
##  8:            cv         8 <PredictionRegr[18]> 18.29452 544.5972
##  9:            cv         9 <PredictionRegr[18]> 19.01164 600.7885
## 10:            cv        10 <PredictionRegr[18]> 19.58444 567.4449

查看RS变量的平均值：

mean(moneyball$data$RS)

## [1] 715.082

查看平均错误：

rs$aggregate(msr("regr.mae"))

## regr.mae 
## 19.52461

每个团队每个赛季约715次跑步，但错误只有约19，所以看起来还不错。

7、验证缺失值插补是否导致性能提升

不插补数值型变量。

graph2 <- as_graph(imp.fac)
# 选取所有没有缺失值的列
fea.names <- colnames(moneyball$data[!sapply(moneyball$data, anyNA)])
fea.names

##  [1] "Team"     "League"   "Year"     "RS"       "RA"       "W"        "OBP"      "SLG"      "BA"      
## [10] "Playoffs" "G"

# 将因子型变量加进去
fea.names <- c(fea.names[fea.names %in% tsk$feature_names], "RankSeason", "RankPlayoffs")
fea.names

##  [1] "Team"         "League"       "Year"         "RA"           "W"            "OBP"          "SLG"         
##  [8] "BA"           "Playoffs"     "G"            "RankSeason"   "RankPlayoffs"

na.select <- po("select")
na.select$param_set$values$selector = selector_name(fea.names)

graph2 = graph2 %>>% na.select
graph2$plot()

新的预处理流程图

最后同样使用十折交叉验证法计算误差：

lrn2 <- GraphLearner$new(graph2 %>>% polrn)

rs2 <- resample(tsk, lrn2, cv10)

rs2$aggregate(msr("regr.mae"))

## regr.mae 
## 19.00884

结果比插补后还略低，说明插补并没有什么帮助。

从大到小查看变量重要性排序：

sort(lrn$model$regr.ranger$model$variable.importance, decreasing = T)

##          SLG          OBP           BA            W           RA         Year   RankSeason RankPlayoffs 
##  3383.089511  2608.974531   871.585381   699.783819   490.893857   242.122994   133.036311   106.005472 
##     Playoffs         OOBP       League         OSLG         Team            G 
##    94.836392    31.602422    26.851604    24.386422    18.048813     4.417069

变量OOBP和OSLG不在列表中，说明它们的重要性很低，当然对它们进行缺失值插补不会发挥多大的作用了。

82-mlr3+OpenML实例
任务：预测美国职业棒球大联盟球队得分。 1、从OpenML上获取数据 RankSeason和RankPlayoff...
SQL C语言基本操作
相关API 打开实例关闭实例获取错误消息操作表实例创建实例插入实例修改实例删除实例回调查询非回调
Python-数据类型及其操作方法
数字类型代码实例: 字符串类型代码实例: 列表代码实例: 元组代码实例字典：代码实例集合代码实例：
HTML基础-03
HTML 标题实例 HTML 段落实例 HTML 链接实例 HTML 图像实例
Python 类属性、实例属性、类方法、实例方法
1、实例属性实例属性，就是赋给由类创建的实例的属性，实例属性属于它所属的实例，不同实例之间的实例属性可以不同。 ...
STL算法之常用拷贝和替换
copy API 实例 replace API 实例 replace_if API 实例 swap API 实例
Vue 基础
Vue 实例 1. Vue实例 2. 实例属性 3. 实例方法/数据 4. 实例方法/事件 5. 实例方法/生命周...
类中的方法
1.实例方法的调用方式实例对象.实例方法（）类对象.实例方法(实例对象) 例如： class Student ...
C语言100例
C 练习实例01C 练习实例02C 练习实例03C 练习实例04C 练习实例05C 练习实例06 C 练习实例07...
AWS云计算助手级架构师认证之EC2-实例购买类型
在购买EC2实例的时候，这里有三种购买类型需要理解：按需实例，预留实例，计划实例，竞价实例。按需实例： ...