任务:预测美国职业棒球大联盟球队得分。
1、从OpenML上获取数据
library(pacman)
p_load(mlr3, mlr3learners, mlr3pipelines, mlr3measures, OpenML)
moneyball <- getOMLDataSet(data.id = 41021L)
## Downloading from 'http://www.openml.org/api/v1/data/41021' to '/tmp/RtmpwQuBr4/cache/datasets/41021/description.xml'.
## Downloading from 'https://www.openml.org/data/v1/download/18626236/Moneyball.arff' to '/tmp/RtmpwQuBr4/cache/datasets/41021/dataset.arff'
## Loading required package: readr
moneyball
##
## Data Set 'Moneyball' :: (Version = 2, OpenML ID = 41021)
## Default Target Attribute: RS
str(moneyball$data)
## 'data.frame': 1232 obs. of 15 variables:
## $ Team : Factor w/ 39 levels "ARI","ATL","BAL",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ League : Factor w/ 2 levels "AL","NL": 2 2 1 1 2 1 2 1 2 1 ...
## $ Year : num 2012 2012 2012 2012 2012 ...
## $ RS : num 734 700 712 734 613 748 669 667 758 726 ...
## $ RA : num 688 600 705 806 759 676 588 845 890 670 ...
## $ W : num 81 94 93 69 61 85 97 68 64 88 ...
## $ OBP : num 0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
## $ SLG : num 0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
## $ BA : num 0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
## $ Playoffs : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 2 1 1 2 ...
## $ RankSeason : Factor w/ 8 levels "1","2","3","4",..: NA 4 5 NA NA NA 2 NA NA 6 ...
## $ RankPlayoffs: Factor w/ 5 levels "1","2","3","4",..: NA 5 4 NA NA NA 4 NA NA 2 ...
## $ G : Factor w/ 8 levels "158","159","160",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ OOBP : num 0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ...
## $ OSLG : num 0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...
DataExplorer::profile_missing(moneyball$data)
## feature num_missing pct_missing
## 1 Team 0 0.0000000
## 2 League 0 0.0000000
## 3 Year 0 0.0000000
## 4 RS 0 0.0000000
## 5 RA 0 0.0000000
## 6 W 0 0.0000000
## 7 OBP 0 0.0000000
## 8 SLG 0 0.0000000
## 9 BA 0 0.0000000
## 10 Playoffs 0 0.0000000
## 11 RankSeason 988 0.8019481
## 12 RankPlayoffs 988 0.8019481
## 13 G 0 0.0000000
## 14 OOBP 812 0.6590909
## 15 OSLG 812 0.6590909
RankSeason和RankPlayoffs各有988个缺失值,OOBG和OSLG各有812个缺失值。
summary(moneyball$data)
## Team League Year RS RA W OBP
## BAL : 47 AL:616 Min. :1962 Min. : 463.0 Min. : 472.0 Min. : 40.0 Min. :0.2770
## BOS : 47 NL:616 1st Qu.:1977 1st Qu.: 652.0 1st Qu.: 649.8 1st Qu.: 73.0 1st Qu.:0.3170
## CHC : 47 Median :1989 Median : 711.0 Median : 709.0 Median : 81.0 Median :0.3260
## CHW : 47 Mean :1989 Mean : 715.1 Mean : 715.1 Mean : 80.9 Mean :0.3263
## CIN : 47 3rd Qu.:2002 3rd Qu.: 775.0 3rd Qu.: 774.2 3rd Qu.: 89.0 3rd Qu.:0.3370
## CLE : 47 Max. :2012 Max. :1009.0 Max. :1103.0 Max. :116.0 Max. :0.3730
## (Other):950
## SLG BA Playoffs RankSeason RankPlayoffs G OOBP
## Min. :0.3010 Min. :0.2140 0:988 2 : 53 1 : 47 162 :954 Min. :0.2940
## 1st Qu.:0.3750 1st Qu.:0.2510 1:244 1 : 52 2 : 47 161 :139 1st Qu.:0.3210
## Median :0.3960 Median :0.2600 3 : 44 3 : 80 163 : 93 Median :0.3310
## Mean :0.3973 Mean :0.2593 4 : 44 4 : 68 160 : 23 Mean :0.3323
## 3rd Qu.:0.4210 3rd Qu.:0.2680 5 : 21 5 : 2 159 : 10 3rd Qu.:0.3430
## Max. :0.4910 Max. :0.2940 (Other): 30 NA's:988 164 : 10 Max. :0.3840
## NA's :988 (Other): 3 NA's :812
## OSLG
## Min. :0.3460
## 1st Qu.:0.4010
## Median :0.4190
## Mean :0.4197
## 3rd Qu.:0.4380
## Max. :0.4990
## NA's :812
RankSeason(排名季节),RankPlayoffs(排名季后赛)、OOBP(对手基准百分比)、OSLG(对手拍击率)都存在缺失值。
对待缺失值,我们一般有三种方法,1、删除有缺失值的行;2、删除有缺失值的列;3、基于数据基础分布,使用模型来估算缺失值进行插补。
在机器学习中,可以尝试所有选项,然后看看那种方法最能解决潜在的问题。
2、缺失值插补
# 插补数值型变量
imp.num <- po("imputehist", param_vals = list(affect_columns = selector_type("numeric")))
# 插补因子型变量
imp.fac <- po("imputeoor", param_vals = list(affect_columns = selector_type("factor")))
graph <- imp.num %>>% imp.fac
plot(graph)

变量G代表玩过的游戏,将其转换为数值型。
moneyball$data$G <- as.numeric(moneyball$data$G)
str(moneyball$data)
## 'data.frame': 1232 obs. of 15 variables:
## $ Team : Factor w/ 39 levels "ARI","ATL","BAL",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ League : Factor w/ 2 levels "AL","NL": 2 2 1 1 2 1 2 1 2 1 ...
## $ Year : num 2012 2012 2012 2012 2012 ...
## $ RS : num 734 700 712 734 613 748 669 667 758 726 ...
## $ RA : num 688 600 705 806 759 676 588 845 890 670 ...
## $ W : num 81 94 93 69 61 85 97 68 64 88 ...
## $ OBP : num 0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
## $ SLG : num 0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
## $ BA : num 0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
## $ Playoffs : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 2 1 1 2 ...
## $ RankSeason : Factor w/ 8 levels "1","2","3","4",..: NA 4 5 NA NA NA 2 NA NA 6 ...
## $ RankPlayoffs: Factor w/ 5 levels "1","2","3","4",..: NA 5 4 NA NA NA 4 NA NA 2 ...
## $ G : num 5 5 5 5 5 5 5 5 5 5 ...
## $ OOBP : num 0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ...
## $ OSLG : num 0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...
3、创建回归任务和学习器
tsk <- TaskRegr$new(id = "moneyball", backend = moneyball$data, target = "RS")
# 查看缺失值情况
tsk$missings()
## RS BA G League OBP OOBP OSLG Playoffs
## 0 0 0 0 0 812 812 0
## RA RankPlayoffs RankSeason SLG Team W Year
## 0 988 988 0 0 0 0
# 使用随机森林模型
test.lrn <- LearnerRegrRanger$new()
test.lrn$properties
## [1] "importance" "oob_error" "weights"
输出中没有显示处理缺失值,所以随机森林模型不能处理缺失值。
4、向学习器中加入新的方法
polrn <- PipeOpLearner$new(mlr_learners$get("regr.ranger"))
# 设置树的数量和重要性
polrn$param_set$values <- list(num.trees = 1000, importance = "permutation")
lrn <- GraphLearner$new(graph = graph %>>% polrn)
plot(graph %>>% polrn)

5、训练和预测
# 拆分训练集和测试集
dtrain <- sample(tsk$nrow, 0.85 * tsk$nrow)
dtest <- setdiff(seq_len(tsk$nrow), dtrain)
# 训练
lrn$train(tsk, row_ids = dtrain)
# 预测
pred <- lrn$predict(tsk, row_ids = dtest)
head(data.frame(truth = pred$data$truth,
response = pred$data$response))
## truth response
## <dbl> <dbl>
## 776 749.5148
## 804 812.0065
## 619 599.2882
## 708 709.7483
## 654 671.5518
## 619 612.6973
可以看到,预测已经非常的相近。
6、模型评估与重采样
cv10 <- rsmp("cv", folds = 10L)
rs <- resample(tsk, lrn, cv10)
使用平均绝对误差作为性能指标:
scores <- rs$score(msrs(c("regr.mae", "regr.mse")))
scores
## task task_id learner learner_id resampling
## 1: <TaskRegr[42]> moneyball <GraphLearner[31]> imputehist.imputeoor.regr.ranger <ResamplingCV[19]>
## 2: <TaskRegr[42]> moneyball <GraphLearner[31]> imputehist.imputeoor.regr.ranger <ResamplingCV[19]>
## 3: <TaskRegr[42]> moneyball <GraphLearner[31]> imputehist.imputeoor.regr.ranger <ResamplingCV[19]>
## 4: <TaskRegr[42]> moneyball <GraphLearner[31]> imputehist.imputeoor.regr.ranger <ResamplingCV[19]>
## 5: <TaskRegr[42]> moneyball <GraphLearner[31]> imputehist.imputeoor.regr.ranger <ResamplingCV[19]>
## 6: <TaskRegr[42]> moneyball <GraphLearner[31]> imputehist.imputeoor.regr.ranger <ResamplingCV[19]>
## 7: <TaskRegr[42]> moneyball <GraphLearner[31]> imputehist.imputeoor.regr.ranger <ResamplingCV[19]>
## 8: <TaskRegr[42]> moneyball <GraphLearner[31]> imputehist.imputeoor.regr.ranger <ResamplingCV[19]>
## 9: <TaskRegr[42]> moneyball <GraphLearner[31]> imputehist.imputeoor.regr.ranger <ResamplingCV[19]>
## 10: <TaskRegr[42]> moneyball <GraphLearner[31]> imputehist.imputeoor.regr.ranger <ResamplingCV[19]>
## resampling_id iteration prediction regr.mae regr.mse
## 1: cv 1 <PredictionRegr[18]> 20.09583 623.2042
## 2: cv 2 <PredictionRegr[18]> 19.05409 578.1255
## 3: cv 3 <PredictionRegr[18]> 19.61329 677.1250
## 4: cv 4 <PredictionRegr[18]> 22.31534 702.1785
## 5: cv 5 <PredictionRegr[18]> 17.19854 520.0973
## 6: cv 6 <PredictionRegr[18]> 19.64526 629.9101
## 7: cv 7 <PredictionRegr[18]> 20.43314 678.2737
## 8: cv 8 <PredictionRegr[18]> 18.29452 544.5972
## 9: cv 9 <PredictionRegr[18]> 19.01164 600.7885
## 10: cv 10 <PredictionRegr[18]> 19.58444 567.4449
查看RS变量的平均值:
mean(moneyball$data$RS)
## [1] 715.082
查看平均错误:
rs$aggregate(msr("regr.mae"))
## regr.mae
## 19.52461
每个团队每个赛季约715次跑步,但错误只有约19,所以看起来还不错。
7、验证缺失值插补是否导致性能提升
不插补数值型变量。
graph2 <- as_graph(imp.fac)
# 选取所有没有缺失值的列
fea.names <- colnames(moneyball$data[!sapply(moneyball$data, anyNA)])
fea.names
## [1] "Team" "League" "Year" "RS" "RA" "W" "OBP" "SLG" "BA"
## [10] "Playoffs" "G"
# 将因子型变量加进去
fea.names <- c(fea.names[fea.names %in% tsk$feature_names], "RankSeason", "RankPlayoffs")
fea.names
## [1] "Team" "League" "Year" "RA" "W" "OBP" "SLG"
## [8] "BA" "Playoffs" "G" "RankSeason" "RankPlayoffs"
na.select <- po("select")
na.select$param_set$values$selector = selector_name(fea.names)
graph2 = graph2 %>>% na.select
graph2$plot()

最后同样使用十折交叉验证法计算误差:
lrn2 <- GraphLearner$new(graph2 %>>% polrn)
rs2 <- resample(tsk, lrn2, cv10)
rs2$aggregate(msr("regr.mae"))
## regr.mae
## 19.00884
结果比插补后还略低,说明插补并没有什么帮助。
从大到小查看变量重要性排序:
sort(lrn$model$regr.ranger$model$variable.importance, decreasing = T)
## SLG OBP BA W RA Year RankSeason RankPlayoffs
## 3383.089511 2608.974531 871.585381 699.783819 490.893857 242.122994 133.036311 106.005472
## Playoffs OOBP League OSLG Team G
## 94.836392 31.602422 26.851604 24.386422 18.048813 4.417069
变量OOBP和OSLG不在列表中,说明它们的重要性很低,当然对它们进行缺失值插补不会发挥多大的作用了。
网友评论