美文网首页数据科学与R语言
45-R语言机器学习:神经网络与深度学习

45-R语言机器学习:神经网络与深度学习

作者: wonphen | 来源:发表于2020-03-12 12:31 被阅读0次

    《精通机器学习:基于R 第二版》学习笔记

    1、神经网络介绍

    “神经网络”的概念相当宽泛,它包括了很多相关的方法。我们主要关注使用反向传播方法进行训练的前馈神经网络。
    神经网络模型的优点在于,可以对输入变量(特征)和响应变量之间的高度复杂关系进行建模,特别是关系呈现高度非线性时。神经网络模型的构建和评价不需要基本假设,对于定量和定性响应变量都适用。
    神经网络的结果是一个黑盒。换言之,没有一个带有系数的等式可供检验并分享给业务伙伴。实际上,结果几乎是无法解释的。另外一种批评意见的主要内容是,当初始的随机输入发生变化时,我们不清楚结果会发生什么变化。还有,神经网络的训练过程需要昂贵的时间和计算成本。
    常用的激活函数:sigmoid、Rectifier、Maxout以及双曲正切函数(tanh)。
    使用R画出sigmoid函数:

    > library(pacman)
    > p_load(ggplot2, dplyr, hrbrthemes)
    > sigmoid <- function(x) {
    +     1/(1 + exp(-x))
    + }
    > 
    > x <- seq(-5, 5, 0.1)
    > df <- tibble(sigmoid.x = sigmoid(x), index = 1:length(x))
    > ggplot(df, aes(index, sigmoid.x)) + geom_point() + theme_ft_rc()
    
    sigmoid函数

    tanh() 函数(双曲正切函数)是sigmoid函数的一种变体,它的输出值为-1 ~ 1。

    > tibble(x = x, sigmoid.x = sigmoid(x), tanh.x = tanh(x)) %>% ggplot(aes(x)) + 
    +     geom_line(aes(y = sigmoid.x, color = "sigmoid"), size = 1) + 
    +     geom_line(aes(y = tanh.x, color = "tanh"), size = 1) + 
    +     theme_ft_rc() + theme(legend.position = "top", legend.title = element_blank())
    
    sigmoid函数和tanh函数

    2、深度学习简介

    深度学习是机器学习的一个分支,它的基础就是神经网络,它的特点其实就是使用机器学习技术(一般是无监督学习)在输入变量的基础之上构建新的特征。

    3、数据理解与数据准备

    > library(pacman)
    > p_load(MASS)
    > 
    > data("shuttle")
    > str(shuttle)
    
    ## 'data.frame':    256 obs. of  7 variables:
    ##  $ stability: Factor w/ 2 levels "stab","xstab": 2 2 2 2 2 2 2 2 2 2 ...
    ##  $ error    : Factor w/ 4 levels "LX","MM","SS",..: 1 1 1 1 1 1 1 1 1 1 ...
    ##  $ sign     : Factor w/ 2 levels "nn","pp": 2 2 2 2 2 2 1 1 1 1 ...
    ##  $ wind     : Factor w/ 2 levels "head","tail": 1 1 1 2 2 2 1 1 1 2 ...
    ##  $ magn     : Factor w/ 4 levels "Light","Medium",..: 1 2 4 1 2 4 1 2 4 1 ...
    ##  $ vis      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
    ##  $ use      : Factor w/ 2 levels "auto","noauto": 1 1 1 1 1 1 1 1 1 1 ...
    

    数据集包括256个观测和7个变量。所有变量都是分类变量,响应变量 use 有两个水平,auto 和 noauto。
     stability :能否稳定定位( stab / xstab )
     error :误差大小( MM / SS / LX )
     sign :误差的符号,正或负( pp / nn )
     wind :风向的符号( head / tail )
     magn :风力强度( Light / Medium / Strong / Out of Range )
     vis :能见度( yes / no )

    > table(shuttle$use)
    
    ## 
    ##   auto noauto 
    ##    145    111
    

    使用自动着陆的决策比例为57%。 table() 函数完美适用于两个变量之间的对比,但如果加入更多的变量,vcd包的 structable()函数会是更好的选择:

    > p_load(vcd)
    > tab1 <- structable(wind + magn ~ use, shuttle)
    > print(tab1)
    
    ##        wind  head                    tail                  
    ##        magn Light Medium Out Strong Light Medium Out Strong
    ## use                                                        
    ## auto           19     19  16     18    19     19  16     19
    ## noauto         13     13  16     14    13     13  16     13
    

    在表中,我们可以看出在逆风(head)的情况下,如果风力为轻微( Light ),那么自动着陆( auto )发生19次,非自动着陆( noauto )发生13次。
    mosaic() 函数,将 structable()函数生成的表格绘制成统计图,同时提供了卡方检验的p值:

    > mosaic(tab1, shade = T)
    
    mosaic统计图

    图中的方块可以表示表格中相应位置的数值比例,p值是不显著的,所以特征与响应变量不相关。这说明,风力强度magn不能帮助我们预测是否使用自动着陆。

    > mosaic(use ~ error + vis, shuttle)
    
    mosaic统计图

    神经网络的数据准备是非常重要的,因为所有协变量和响应变量都必须是数值型。本数据集中所有变量都是分类变量,需要使用caret包快速建立虚拟变量作为输入特征:

    > p_load(caret)
    > dummies <- dummyVars(use ~ ., shuttle, fullRank = T)
    > dummies
    
    ## Dummy Variable Object
    ## 
    ## Formula: use ~ .
    ## <environment: 0x000002699e075128>
    ## 7 variables, 7 factors
    ## Variables and levels will be separated by '.'
    ## A full rank encoding is used
    

    转换为数据框:

    > shuttle.2 <- as.data.frame(predict(dummies, newdata = shuttle))
    > names(shuttle.2)
    
    ##  [1] "stability.xstab" "error.MM"        "error.SS"        "error.XL"       
    ##  [5] "sign.pp"         "wind.tail"       "magn.Medium"     "magn.Out"       
    ##  [9] "magn.Strong"     "vis.yes"
    
    > str(shuttle.2)
    
    ## 'data.frame':    256 obs. of  10 variables:
    ##  $ stability.xstab: num  1 1 1 1 1 1 1 1 1 1 ...
    ##  $ error.MM       : num  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ error.SS       : num  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ error.XL       : num  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ sign.pp        : num  1 1 1 1 1 1 0 0 0 0 ...
    ##  $ wind.tail      : num  0 0 0 1 1 1 0 0 0 1 ...
    ##  $ magn.Medium    : num  0 1 0 0 1 0 0 1 0 0 ...
    ##  $ magn.Out       : num  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ magn.Strong    : num  0 0 1 0 0 1 0 0 1 0 ...
    ##  $ vis.yes        : num  0 0 0 0 0 0 0 0 0 0 ...
    

    现在,我们得到具有10个变量的输入特征空间。对于stability, 0 表示 stab , 1 表示 xstab 。error的基准是 LX ,用3个变量表示其他分类。
    可以用 ifelse() 函数建立响应变量:

    > shuttle.2$use <- ifelse(shuttle$use == "auto", 1, 0)
    > table(shuttle.2$use)
    
    ## 
    ##   0   1 
    ## 111 145
    

    拆分为训练集和测试集:

    > set.seed(123)
    > train.index <- createDataPartition(shuttle.2$use, p = 0.7, list = F)
    > str(train.index)
    
    ##  int [1:180, 1] 1 4 5 6 7 8 9 11 13 14 ...
    ##  - attr(*, "dimnames")=List of 2
    ##   ..$ : NULL
    ##   ..$ : chr "Resample1"
    
    > shuttle.train <- shuttle.2[train.index, ]
    > shuttle.test <- shuttle.2[-train.index, ]
    > 
    > dim(shuttle.train)
    
    ## [1] 180  11
    
    > dim(shuttle.test)
    
    ## [1] 76 11
    

    4、模型建立与模型评价

    以前,我们使用y ~指定数据集中除响应变量之外的所有变量作为输入,但neuralnet中不允许这种写法。绕过这种限制的方式是,使用 as.formula() 函数。先建立一个保存变量名的对象,然后用这个对象作为输入,从而将变量名粘贴到公式右侧。

    > p_load(neuralnet)
    > 
    > n <- names(shuttle.train)
    > form <- as.formula(paste("use ~", paste(n[!n %in% "use"], collapse = "+")))
    > print(form)
    
    ## use ~ stability.xstab + error.MM + error.SS + error.XL + sign.pp + 
    ##     wind.tail + magn.Medium + magn.Out + magn.Strong + vis.yes
    ## <environment: 0x000002699e075128>
    

    建立模型:

    > fit <- neuralnet(form, data = shuttle.train, err.fct = "ce", linear.output = F)
    

    参数说明:
     hidden :每层中隐藏神经元的数量,最多可以设置3个隐藏层,默认值为1
     act.fct :激活函数,默认为逻辑斯蒂函数,也可以设置为tanh函数
     err.fct :计算误差,默认为sse;因为我们处理的是二值结果变量,所以要设置成ce,使用交叉熵
     linear.output :逻辑参数,控制是否忽略act.fct,默认值为 TRUE;对于我们的数据来说,需要设置为 FALSE

    > fit$result.matrix
    
    ##                                      [,1]
    ## error                         0.013651024
    ## reached.threshold             0.009868817
    ## steps                       670.000000000
    ## Intercept.to.1layhid1         5.136942014
    ## stability.xstab.to.1layhid1  -2.485264957
    ## error.MM.to.1layhid1          1.032588807
    ## error.SS.to.1layhid1          2.543705586
    ## error.XL.to.1layhid1          0.030906433
    ## sign.pp.to.1layhid1           0.840732458
    ## wind.tail.to.1layhid1         0.721638821
    ## magn.Medium.to.1layhid1       0.034567106
    ## magn.Out.to.1layhid1         -2.436662220
    ## magn.Strong.to.1layhid1      -0.099174792
    ## vis.yes.to.1layhid1          -7.556133035
    ## Intercept.to.use            -28.580429411
    ## 1layhid1.to.use              66.014874838
    

    可以看到,误差为0.013651024。steps的值是算法达到阈值所需的训练次数,也就是误差函数的偏导数的绝对值小于阈值(默认为0.1)时的训练次数。权重最高的神经元是error.SS.to.1layhid1,权重值为2.543705586。

    查看广义权重(第i个协变量对对数发生比的贡献):

    > head(fit$generalized.weights[[1]])
    
    ##         [,1]      [,2]      [,3]       [,4]      [,5]      [,6]       [,7]
    ## 1  -4.701598 1.9534405  4.812155 0.05846846 1.5904887 1.3651886 0.06539368
    ## 4  -2.355740 0.9787731  2.411135 0.02929567 0.7969158 0.6840290 0.03276556
    ## 5  -2.277955 0.9464547  2.331521 0.02832835 0.7706022 0.6614428 0.03168367
    ## 6  -2.593462 1.0775429  2.654447 0.03225195 0.8773340 0.7530556 0.03607199
    ## 7 -10.097313 4.1952759 10.334750 0.12556887 3.4157882 2.9319260 0.14044172
    ## 8  -9.798060 4.0709409 10.028460 0.12184740 3.3145548 2.8450328 0.13627946
    ##        [,8]        [,9]      [,10]
    ## 1 -4.609652 -0.18761781 -14.294612
    ## 4 -2.309670 -0.09400607  -7.162328
    ## 5 -2.233406 -0.09090206  -6.925833
    ## 6 -2.542743 -0.10349240  -7.885092
    ## 7 -9.899846 -0.40293446 -30.699599
    ## 8 -9.606445 -0.39099273 -29.789758
    

    可视化:

    > plot(fit)
    
    权重统计图

    从这张统计图中,可以知道截距和每个变量的权重。
    查看广义权重:

    > par(mfrow = c(1, 2))
    > # 一直报错:Error in plot.window(...): 'ylim'值不能是无限的
    > # gwplot(fit, selected.covariate = "vis.yes")
    > # gwplot(fit, selected.covariate = "wind.tail")
    > # 换种方式实现
    > fit.covariate <- fit$covariate %>% as.data.frame()
    > plot(fit.covariate$vis.yes,main = "vis.yes",xlab = "",ylab = "")
    > plot(fit.covariate$wind.tail,main = "wind.tail",xlab = "",ylab = "")
    
    广义权重统计图

    wind.tail的连接权重在整体上处于较低的位置。vis.yes的广义权重非常不对称,而wind.tail的权重则分布得非常均匀,说明这个变量基本不具备预测能力。

    看看模型在测试集上的表现:

    > results.train <- compute(fit, shuttle.train[, 1:10])
    > pred.train <- results.train$net.result
    > print(pred.train)
    
    ##             [,1]
    ## 1   1.000000e+00
    ## 4   1.000000e+00
    ## 5   1.000000e+00
    ## 6   1.000000e+00
    ## ---------------------
    ## 251 6.119581e-04
    ## 252 1.484097e-10
    ## 253 2.765947e-08
    ## 255 3.022915e-07
    ## 256 7.566422e-08
    
    > pred.train <- ifelse(pred.train < 0.5, 0, 1)
    > table(pred.train, shuttle.train$use)
    
    ##           
    ## pred.train   0   1
    ##          0  73   0
    ##          1   0 107
    

    神经网络模型的正确率达到了100%!看看它在测试集上的表现:

    > result.test <- compute(fit, shuttle.test[, 1:10])
    > pred.test <- result.test$net.result
    > pred.test <- ifelse(pred.test < 0.5, 0, 1)
    > table(pred.test, shuttle.test$use)
    
    ##          
    ## pred.test  0  1
    ##         0 38  2
    ##         1  0 36
    

    测试集上有两个错误,查看是哪两个:

    > which(pred.test == 0 & shuttle.test$use == 1)
    
    ## [1] 58 59
    

    测试集中的58、59行预测错误。

    5、深度学习示例

    5.1 安装H2O

    1、如果以前安装过H2O包,先删除

    > if ("package:h2o" %in% search()) {
    +     detach("package:h2o", unload = TRUE)
    + }
    
    ## [1] "A shutdown has been triggered. "
    
    > if ("package:h2o" %in% rownames(installed.packages())) {
    +     remove.packages("h2o")
    + }
    

    2、下载并安装h2o包所需的依赖包:

    > library(pacman)
    > p_load(methods, statmod, stats, graphics, RCurl, jsonlite, tools, utils)
    

    3、安装并加载h2o包

    > p_load(h2o)
    

    5.2 将数据上传到H2O平台

    > path <- "./data_set/data-master/bank_DL.csv"
    

    连接到H2O平台,并在集群上启动一个实例。

    > # nthreads=-1使实例可以使用集群上的所有CPU
    > local.h2o <- h2o.init(nthreads = -1)
    
    ##  Connection successful!
    ## 
    ## R is connected to the H2O cluster: 
    ##     H2O cluster uptime:         5 hours 57 minutes 
    ##     H2O cluster timezone:       Asia/Shanghai 
    ##     H2O data parsing timezone:  UTC 
    ##     H2O cluster version:        3.28.0.4 
    ##     H2O cluster version age:    15 days  
    ##     H2O cluster name:           H2O_started_from_R_Admin_wzk082 
    ##     H2O cluster total nodes:    1 
    ##     H2O cluster total memory:   1.57 GB 
    ##     H2O cluster total cores:    4 
    ##     H2O cluster allowed cores:  4 
    ##     H2O cluster healthy:        TRUE 
    ##     H2O Connection ip:          localhost 
    ##     H2O Connection port:        54321 
    ##     H2O Connection proxy:       NA 
    ##     H2O Internal Security:      FALSE 
    ##     H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
    ##     R Version:                  R version 3.6.2 (2019-12-12)
    

    服务已经启动,此时从浏览器中也可以看到:


    h2o

    将数据文件上传到H2O平台:

    > # h2o.importFolder h2o.importURL h2o.importHDFS
    > bank <- h2o.uploadFile(path = path)
    
    > class(bank)
    
    ## [1] "H2OFrame"
    
    > # 在H2O平台中,很多R函数的输出和我们以前用过的函数不一样。
    > str(bank)
    
    ## Class 'H2OFrame' <environment: 0x00000269a00bfc50> 
    ##  - attr(*, "op")= chr "Parse"
    ##  - attr(*, "id")= chr "bank_DL_sid_8edb_2"
    ##  - attr(*, "eval")= logi FALSE
    ##  - attr(*, "nrow")= int 4521
    ##  - attr(*, "ncol")= int 64
    ##  - attr(*, "types")=List of 64
    ##   ..$ : chr "real"
    ## --------------------------------------------------------
    ##   ..$ previous_2         : num  0 0 0 0 0 0 1 0 0 1
    ##   ..$ previous_3         : num  0 0 0 0 0 1 0 0 0 0
    ##   ..$ previous_4         : num  0 1 0 0 0 0 0 0 0 0
    ##   ..$ previous_5         : num  0 0 0 0 0 0 0 0 0 0
    ##   ..$ poutcome_failure   : num  0 1 1 0 0 1 0 0 0 1
    ##   ..$ poutcome_other     : num  0 0 0 0 0 0 1 0 0 0
    ##   ..$ poutcome_success   : num  0 0 0 0 0 0 0 0 0 0
    ##   ..$ poutcome_unknown   : num  1 0 0 1 1 0 0 1 1 0
    ##   ..$ y                  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1
    

    查看相应变量的分布:

    > h2o.table(bank$y)
    
    ##     y Count
    ## 1  no  4000
    ## 2 yes   521
    

    可以看到,有521名客户对银行的营销活动给出了“是”的反应,另外4000名客户的反应则 是“否”。这个响应变量有点不平衡。

    5.3 拆分训练集和测试集

    > # 建立统一的随机数向量
    > rand <- h2o.runif(bank, seed = 123)
    > 
    > train <- bank[rand <= 0.7, ] %>% h2o.assign(key = "train")
    > test <- bank[rand > 0.7, ] %>% h2o.assign(key = "test")
    
    > # 查看拆分是否均衡
    > h2o.table(train[, 64])
    
    ##     y Count
    ## 1  no  2783
    ## 2 yes   396
    
    > h2o.table(test[, 64])
    
    ##     y Count
    ## 1  no  1217
    ## 2 yes   125
    

    5.4 构建模型

    使用随机搜索的方法调整超参数,这种方法比全网格搜索要节省时间。需要检查的超参数有:有舍弃(dropout)和无舍弃的tanh激活函数、3种不同形式的隐藏层(神经元组合)、两种不同的舍弃率,以及两种不同的学习率。

    > # 建立随机搜索超参数的列表
    > hyper.params <- list(activation = c("Tanh", "TanhWithDropout"), 
    +     hidden = list(c(20, 20), c(30, 30), c(30, 30, 30)), 
    +     input_dropout_ratio = c(0, 0.05), rate = c(0.01, 0.25))
    
    > # 建立随机搜索原则的列表,strategy设置 为RandomDiscrete表示随机搜索;如果要进行全网格搜索,就要设置为Cartesian
    > search.criteria <- list(
    +   strategy = "RandomDiscrete",max_runtime_secs = 420,
    +   max_models = 100,seed =123,stopping_rounds = 5,
    +   # 结束标志为:前5个模型之间的误差在1%以内
    +   stopping_tolerance = 0.01
    + )
    
    > random.search <- h2o.grid(
    +   # 深度学习算法
    +   algorithm = "deeplearning",
    +   grid_id = "random.search",
    +   # 训练数据集
    +   training_frame = train,
    +   # 验证数据集
    +   validation_frame = test,
    +   # 输入特征
    +   x = 1:63,
    +   # 响应变量
    +   y = 64,
    +   epochs = 1,
    +   stopping_metric = "misclassification",
    +   hyper_params = hyper.params,
    +   search_criteria = search.criteria
    + )
    

    检查效果最好的前5个模型的结果:

    > grid <- h2o.getGrid("random.search", sort_by = "auc", decreasing = T)
    > grid
    
    ## H2O Grid Details
    ## ================
    ## 
    ## Grid ID: random.search 
    ## Used hyper parameters: 
    ##   -  activation 
    ##   -  hidden 
    ##   -  input_dropout_ratio 
    ##   -  rate 
    ## Number of models: 24 
    ## Number of failed models: 0 
    ## 
    ## Hyper-Parameter Search Summary: ordered by decreasing auc
    ##        activation       hidden input_dropout_ratio rate              model_ids
    ## 1 TanhWithDropout     [30, 30]                 0.0 0.01 random.search_model_17
    ## 2 TanhWithDropout     [30, 30]                0.05 0.01 random.search_model_16
    ## 3 TanhWithDropout     [30, 30]                 0.0 0.25  random.search_model_6
    ## 4 TanhWithDropout [30, 30, 30]                0.05 0.01 random.search_model_23
    ## 5            Tanh [30, 30, 30]                0.05 0.01 random.search_model_14
    ##                  auc
    ## 1 0.8635497124075596
    ## 2 0.8588824979457683
    ## 3 0.8580049301561217
    ## 4   0.85473459326212
    ## 5 0.8506162695152013
    ## 
    ## ---
    ##         activation       hidden input_dropout_ratio rate
    ## 19            Tanh     [20, 20]                0.05 0.25
    ## 20 TanhWithDropout     [20, 20]                 0.0 0.25
    ## 21            Tanh [30, 30, 30]                0.05 0.25
    ## 22            Tanh     [30, 30]                0.05 0.01
    ## 23 TanhWithDropout [30, 30, 30]                 0.0 0.01
    ## 24 TanhWithDropout [30, 30, 30]                0.05 0.25
    ##                 model_ids                auc
    ## 19 random.search_model_18 0.8081840591618734
    ## 20  random.search_model_8  0.802872637633525
    ## 21  random.search_model_3  0.798695152013147
    ## 22  random.search_model_2 0.7974330320460148
    ## 23 random.search_model_22 0.7543467543138865
    ## 24 random.search_model_11  0.716936729663106
    

    所以,第17号模型最终胜出,它使用有舍弃的tanh激活函数、2个隐藏层(每个隐藏层中有30个神经元)、0.0的舍弃率和0.01的学习率,其AUC大概是0.864。
    通过混淆矩阵查看模型在测试集上的表现:

    > best.model <- h2o.getModel(grid@model_ids[[1]])
    
    > h2o.confusionMatrix(best.model, valid = T)
    
    ## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.123302996428301:
    ##          no yes    Error       Rate
    ## no     1133  84 0.069022   =84/1217
    ## yes      60  65 0.480000    =60/125
    ## Totals 1193 149 0.107303  =144/1342
    

    尽管错误率不到11%,但yes标签上的错误太多了,它的假阳性率和假阴性率都非常高。这说明数据不平衡的分类可能是一个问题。

    5.5 使用交叉验证建立模型

    > dlmodel <- h2o.deeplearning(x = 1:63, y = 64, training_frame = train, 
    +     hidden = c(30, 30), 
    +     epochs = 3, nfolds = 5, fold_assignment = "Stratified", balance_classes = T, 
    +     activation = "TanhWithDropout", seed = 123, adaptive_rate = F, input_dropout_ratio = 0, 
    +     stopping_metric = "misclassification", variable_importances = T)
    
    > dlmodel
    
    ## Model Details:
    ## ==============
    ## 
    ## H2OBinomialModel: deeplearning
    ## Model ID:  DeepLearning_model_R_1583821038674_163 
    ## Status of Neuron Layers: predicting y, 2-class classification, bernoulli distribution, CrossEntropy loss,
    ## 2,912 weights/biases, 22.9 KB, 18,957 training samples, mini-batch size 1
    ##   layer units        type dropout       l1       l2 mean_rate rate_rms
    ## 1     1    63       Input  0.00 %       NA       NA        NA       NA
    ## 2     2    30 TanhDropout 50.00 % 0.000000 0.000000  0.004907 0.000000
    ## 3     3    30 TanhDropout 50.00 % 0.000000 0.000000  0.004907 0.000000
    ## 4     4     2     Softmax      NA 0.000000 0.000000  0.004907 0.000000
    ##   momentum mean_weight weight_rms mean_bias bias_rms
    ## 1       NA          NA         NA        NA       NA
    ## 2 0.000000    0.025206   0.814655  0.147166 0.735583
    ## 3 0.000000   -0.005851   0.337015 -0.084604 0.411204
    ## 4 0.000000    0.053744   0.769235 -0.004516 0.303754
    ## 
    ## 
    ## H2OBinomialMetrics: deeplearning
    ## ** Reported on training data. **
    ## ** Metrics reported on full training frame **
    ## 
    ## MSE:  0.1826312
    ## RMSE:  0.4273537
    ## LogLoss:  0.5439753
    ## Mean Per-Class Error:  0.1502497
    ## AUC:  0.9159916
    ## AUCPR:  0.8807761
    ## Gini:  0.8319832
    ## 
    ## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
    ##          no  yes    Error       Rate
    ## no     2108  675 0.242544  =675/2783
    ## yes     161 2617 0.057955  =161/2778
    ## Totals 2269 3292 0.150333  =836/5561
    ## 
    ## Maximum Metrics: Maximum metrics at their respective thresholds
    ##                         metric threshold       value idx
    ## 1                       max f1  0.105880    0.862273 298
    ## 2                       max f2  0.034591    0.918540 352
    ## 3                 max f0point5  0.300881    0.845340 190
    ## 4                 max accuracy  0.125034    0.850207 286
    ## 5                max precision  0.595273    0.942308   4
    ## 6                   max recall  0.004675    1.000000 393
    ## 7              max specificity  0.600539    0.999641   0
    ## 8             max absolute_mcc  0.105880    0.711645 298
    ## 9   max min_per_class_accuracy  0.245084    0.838733 218
    ## 10 max mean_per_class_accuracy  0.125034    0.850278 286
    ## 11                     max tns  0.600539 2782.000000   0
    ## 12                     max fns  0.600539 2764.000000   0
    ## 13                     max fps  0.003311 2783.000000 399
    ## 14                     max tps  0.004675 2778.000000 393
    ## 15                     max tnr  0.600539    0.999641   0
    ## 16                     max fnr  0.600539    0.994960   0
    ## 17                     max fpr  0.003311    1.000000 399
    ## 18                     max tpr  0.004675    1.000000 393
    ## 
    ## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
    ## 
    ## H2OBinomialMetrics: deeplearning
    ## ** Reported on cross-validation data. **
    ## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
    ## 
    ## MSE:  0.08705731
    ## RMSE:  0.2950548
    ## LogLoss:  0.2875898
    ## Mean Per-Class Error:  0.2193127
    ## AUC:  0.872492
    ## AUCPR:  0.4780927
    ## Gini:  0.744984
    ## 
    ## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
    ##          no yes    Error       Rate
    ## no     2497 286 0.102767  =286/2783
    ## yes     133 263 0.335859   =133/396
    ## Totals 2630 549 0.131802  =419/3179
    ## 
    ## Maximum Metrics: Maximum metrics at their respective thresholds
    ##                         metric threshold       value idx
    ## 1                       max f1  0.424191    0.556614 172
    ## 2                       max f2  0.122184    0.657076 295
    ## 3                 max f0point5  0.424191    0.507330 172
    ## 4                 max accuracy  0.717398    0.884869  63
    ## 5                max precision  0.957903    0.800000   3
    ## 6                   max recall  0.000684    1.000000 398
    ## 7              max specificity  0.972282    0.999281   0
    ## 8             max absolute_mcc  0.424191    0.490448 172
    ## 9   max min_per_class_accuracy  0.188988    0.804887 266
    ## 10 max mean_per_class_accuracy  0.122184    0.809987 295
    ## 11                     max tns  0.972282 2781.000000   0
    ## 12                     max fns  0.972282  395.000000   0
    ## 13                     max fps  0.000346 2783.000000 399
    ## 14                     max tps  0.000684  396.000000 398
    ## 15                     max tnr  0.972282    0.999281   0
    ## 16                     max fnr  0.972282    0.997475   0
    ## 17                     max fpr  0.000346    1.000000 399
    ## 18                     max tpr  0.000684    1.000000 398
    ## 
    ## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
    ## Cross-Validation Metrics Summary: 
    ##                 mean          sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
    ## accuracy   0.8826488 0.019553455  0.9073783  0.8575949 0.88621795 0.89269054
    ## auc        0.8777314 0.022875382  0.8861917 0.89000493  0.8754107 0.89764535
    ## aucpr     0.47863057 0.055181786 0.52034676  0.4284161 0.52908653 0.50504553
    ## err       0.11735118 0.019553455 0.09262166 0.14240506 0.11378205 0.10730949
    ## err_count       74.6   12.381437       59.0       90.0       71.0       69.0
    ##           cv_5_valid
    ## accuracy  0.86936235
    ## auc       0.83940434
    ## aucpr     0.41025785
    ## err       0.13063763
    ## err_count       84.0
    ## 
    ## ---
    ##                   mean          sd cv_1_valid cv_2_valid  cv_3_valid
    ## pr_auc      0.47863057 0.055181786 0.52034676  0.4284161  0.52908653
    ## precision    0.5355106    0.077891  0.5882353 0.42741936   0.5903614
    ## r2           0.2003508  0.08083034 0.18793476 0.22176513 0.086192444
    ## recall       0.6443926 0.075244226  0.5633803  0.7361111   0.5697674
    ## rmse        0.29464537  0.02022303 0.28359163 0.28028414  0.32952103
    ## specificity  0.9167673 0.031265505 0.95053005  0.8732143    0.936803
    ##             cv_4_valid cv_5_valid
    ## pr_auc      0.50504553 0.41025785
    ## precision    0.5940594  0.4774775
    ## r2          0.31192234 0.19393937
    ## recall       0.6818182  0.6708861
    ## rmse        0.28509894  0.2947311
    ## specificity  0.9261261  0.8971631
    

    看看在测试集上的表现:

    > perf <- h2o.performance(dlmodel, test)
    
    > perf
    
    ## H2OBinomialMetrics: deeplearning
    ## 
    ## MSE:  0.07173268
    ## RMSE:  0.2678296
    ## LogLoss:  0.2341893
    ## Mean Per-Class Error:  0.2004174
    ## AUC:  0.8735283
    ## AUCPR:  0.3951697
    ## Gini:  0.7470567
    ## 
    ## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
    ##          no yes    Error       Rate
    ## no     1031 186 0.152835  =186/1217
    ## yes      31  94 0.248000    =31/125
    ## Totals 1062 280 0.161699  =217/1342
    ## 
    ## Maximum Metrics: Maximum metrics at their respective thresholds
    ##                         metric threshold       value idx
    ## 1                       max f1  0.257779    0.464198 175
    ## 2                       max f2  0.183122    0.609168 210
    ## 3                 max f0point5  0.575237    0.496183  25
    ## 4                 max accuracy  0.576611    0.915052  23
    ## 5                max precision  0.604630    1.000000   0
    ## 6                   max recall  0.003262    1.000000 399
    ## 7              max specificity  0.604630    1.000000   0
    ## 8             max absolute_mcc  0.257779    0.428554 175
    ## 9   max min_per_class_accuracy  0.183122    0.808000 210
    ## 10 max mean_per_class_accuracy  0.117068    0.816250 242
    ## 11                     max tns  0.604630 1217.000000   0
    ## 12                     max fns  0.604630  124.000000   0
    ## 13                     max fps  0.003262 1217.000000 399
    ## 14                     max tps  0.003262  125.000000 399
    ## 15                     max tnr  0.604630    1.000000   0
    ## 16                     max fnr  0.604630    0.992000   0
    ## 17                     max fpr  0.003262    1.000000 399
    ## 18                     max tpr  0.003262    1.000000 399
    ## 
    ## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
    

    做个对比:
    训练集上的混淆矩阵:

    Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
             no yes    Error       Rate
    no     2497 286 0.102767  =286/2783
    yes     133 263 0.335859   =133/396
    Totals 2630 549 0.131802  =419/3179
    

    测试集上的混淆矩阵:

    Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
             no yes    Error       Rate
    no     1031 186 0.152835  =186/1217
    yes      31  94 0.248000    =31/125
    Totals 1062 280 0.161699  =217/1342
    

    整体错误率提高了,假阴性率有所下降,所以需要更多调优工作。
    最后,可以计算变量重要性,在表中,我们看到变量是按照重要性顺序排列的,但变量重要性会受到抽样变动的影响。如果你换一个随机数种子,变量重要性的顺序也很可能发生改变。以下是按重要性排列的前5个和最后6个变量:

    > dlmodel@model$variable_importances
    
    ## Variable Importances: 
    ##           variable relative_importance scaled_importance percentage
    ## 1         duration            1.000000          1.000000   0.105319
    ## 2 poutcome_success            0.738116          0.738116   0.077738
    ## 3        month_oct            0.415810          0.415810   0.043793
    ## 4        month_mar            0.282554          0.282554   0.029758
    ## 5 poutcome_unknown            0.263573          0.263573   0.027759
    ## 
    ## ---
    ##             variable relative_importance scaled_importance percentage
    ## 58 contact_telephone            0.072925          0.072925   0.007680
    ## 59    job_unemployed            0.071896          0.071896   0.007572
    ## 60        campaign_3            0.070856          0.070856   0.007463
    ## 61     job_housemaid            0.066491          0.066491   0.007003
    ## 62        campaign_6            0.065220          0.065220   0.006869
    ## 63       campaign_10            0.065091          0.065091   0.006855
    

    相关文章

      网友评论

        本文标题:45-R语言机器学习:神经网络与深度学习

        本文链接:https://www.haomeiwen.com/subject/wgyndhtx.html