美文网首页R for statistics
数据分析:二分类变量的logistic regression计算

数据分析:二分类变量的logistic regression计算

作者: 生信学习者2 | 来源:发表于2021-01-24 17:23 被阅读0次

介绍

离散型的二分类变量(预测值只有两种,如0或1)在没有做转化是无法进行回归分析(一般回归分析适用于连续型变量),我们可以借助广义线性回归模型logistic回归对二分类变量进行转化(对预测值使用了逻辑函数,也即是预测值的范围是0到1),最终实现回归分析。更多知识分享请到 https://zouhua.top/

logistic regression基础

logistic 函数:


logistic方程和logistic曲线:逻辑曲线在z=0时,十分敏感,在z>>0或z<<0处,都不敏感,将预测值限定为(0,1)


加载数据

# install.packages("ISLR") # 使用该包的数据

data <- ISLR::Default
head(data)
 default student   balance    income
1      No      No  729.5265 44361.625
2      No     Yes  817.1804 12106.135
3      No      No 1073.5492 31767.139
4      No      No  529.2506 35704.494
5      No      No  785.6559 38463.496
6      No     Yes  919.5885  7491.559

训练模型

  • 构建训练集和测试集

  • 训练模型

set.seed(123)
sample <- sample(c(TRUE, FALSE), nrow(data), replace=TRUE, prob=c(0.7,0.3))
train <- data[sample, ]
test <- data[-sample, ]

fit <- glm(default ~ student+balance+income, family="binomial", data=train)
summary(fit)
Call:
glm(formula = default ~ student + balance + income, family = "binomial", 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5586  -0.1353  -0.0519  -0.0177   3.7973  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.148e+01  6.234e-01 -18.412   <2e-16 ***
studentYes  -4.933e-01  2.857e-01  -1.726   0.0843 .  
balance      5.988e-03  2.938e-04  20.384   <2e-16 ***
income       7.857e-06  9.965e-06   0.788   0.4304    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2021.1  on 6963  degrees of freedom
Residual deviance: 1065.4  on 6960  degrees of freedom
AIC: 1073.4

Number of Fisher Scoring iterations: 8

ROC曲线

  • 获取预测值: type的值可以是“link”, “response”, “terms” (需要了解下有何区别)

    • link:
    • response:因变量分类变量预测为"Yes|No"的概率值
    • terms:自变量预测的概率值(待定)
predicted <- predict(fit, test, type="response")
head(predicted)
           4            6            7           15           17           18 
3.259679e-04 1.648925e-03 1.762607e-03 9.694038e-03 1.536866e-05 1.709650e-04

Notes:每个样本对应的预测值是数字而不是我以为的 default的 ”Yes|No“
Notes:测试样本的default预测为”Yes|No“的概率值(待定)

  • 获取ROC曲线:通过pROC的roc和auc函数分别获取roc对象和auc值
library(ggplot2)
library(pROC)
rocobj <- roc(test$default, predicted)
auc <- round(auc(test$default, predicted),4)
  • 可视化结果

  • legacy.axes 控制specificity(x 轴)是否升序排列

  • geom_abline 添加对角线

  • annotate 添加AUC结果

  • coord_cartesian 设置坐标轴范围

  • family="serif" 设置新罗马字体

ggroc(rocobj, color = "red", linetype = 1, size = 1, alpha = 1, legacy.axes = T)+
                geom_abline(intercept = 0, slope = 1, color="grey", size = 1, linetype=1)+
              labs(x = "False Positive Rate (1 - Specificity)",
                   y = "True Positive Rate (Sensivity or Recall)")+
              annotate("text",x = .75, y = .25,label=paste("AUC =", auc),
                       size = 5, family="serif")+
              coord_cartesian(xlim = c(0, 1), ylim = c(0, 1))+
              theme_bw()+
              theme(panel.background = element_rect(fill = 'transparent'),
                    axis.ticks.length = unit(0.4, "lines"), 
                    axis.ticks = element_line(color='black'),
                    axis.line = element_line(size=.5, colour = "black"),
                    axis.title = element_text(colour='black', size=12,face = "bold"),
                    axis.text = element_text(colour='black',size=10,face = "bold"),
                    text = element_text(size=8, color="black", family="serif"))

R Information

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ISLR_1.2            forcats_0.5.0       stringr_1.4.0       purrr_0.3.4         readr_1.4.0         tidyr_1.1.2        
 [7] tidyverse_1.3.0     xgboost_1.3.1.1     mlbench_2.1-1       survminer_0.4.8     ggpubr_0.4.0        survcomp_1.40.0    
[13] prodlim_2019.11.13  survival_3.2-7      caretEnsemble_2.0.1 pROC_1.16.2         caret_6.0-86        ggplot2_3.3.3      
[19] lattice_0.20-41     data.table_1.13.6   tibble_3.0.4        dplyr_1.0.2        

loaded via a namespace (and not attached):
  [1] readxl_1.3.1         backports_1.2.0      plyr_1.8.6           splines_4.0.3        digest_0.6.27       
  [6] SuppDists_1.1-9.5    foreach_1.5.1        htmltools_0.5.0      fansi_0.4.1          magrittr_1.5        
 [11] openxlsx_4.2.3       recipes_0.1.15       modelr_0.1.8         gower_0.2.2          colorspace_2.0-0    
 [16] rvest_0.3.6          haven_2.3.1          xfun_0.19            crayon_1.3.4         jsonlite_1.7.1      
 [21] libcoin_1.0-7        zoo_1.8-8            iterators_1.0.13     glue_1.4.2           gtable_0.3.0        
 [26] ipred_0.9-9          questionr_0.7.3      car_3.0-10           kernlab_0.9-29       abind_1.4-5         
 [31] scales_1.1.1         mvtnorm_1.1-1        DBI_1.1.0            rstatix_0.6.0        miniUI_0.1.1.1      
 [36] Rcpp_1.0.5           xtable_1.8-4         Cubist_0.2.3         foreign_0.8-80       km.ci_0.5-2         
 [41] Formula_1.2-4        stats4_4.0.3         lava_1.6.8.1         httr_1.4.2           ellipsis_0.3.1      
 [46] pkgconfig_2.0.3      farver_2.0.3         nnet_7.3-14          dbplyr_2.0.0         utf8_1.1.4          
 [51] tidyselect_1.1.0     labeling_0.4.2       rlang_0.4.8          reshape2_1.4.4       later_1.1.0.1       
 [56] munsell_0.5.0        cellranger_1.1.0     tools_4.0.3          cli_2.1.0            generics_0.1.0      
 [61] broom_0.7.3          evaluate_0.14        fastmap_1.0.1        yaml_2.2.1           bootstrap_2019.6    
 [66] ModelMetrics_1.2.2.2 knitr_1.30           fs_1.5.0             zip_2.1.1            survMisc_0.5.5      
 [71] caTools_1.18.0       randomForest_4.6-14  pbapply_1.4-3        nlme_3.1-150         mime_0.9            
 [76] xml2_1.3.2           compiler_4.0.3       rstudioapi_0.12      curl_4.3             e1071_1.7-4         
 [81] ggsignif_0.6.0       reprex_0.3.0         klaR_0.6-15          stringi_1.5.3        highr_0.8           
 [86] Matrix_1.2-18        gbm_2.1.8            ggsci_2.9            survivalROC_1.0.3    KMsurv_0.1-5        
 [91] vctrs_0.3.4          pillar_1.4.6         lifecycle_0.2.0      combinat_0.0-8       cowplot_1.1.1       
 [96] bitops_1.0-6         httpuv_1.5.4         R6_2.5.0             promises_1.1.1       KernSmooth_2.23-18  
[101] gridExtra_2.3        C50_0.1.3.1          rio_0.5.16           codetools_0.2-18     MASS_7.3-53         
[106] assertthat_0.2.1     withr_2.3.0          parallel_4.0.3       hms_0.5.3            grid_4.0.3          
[111] rpart_4.1-15         labelled_2.7.0       timeDate_3043.102    class_7.3-17         rmarkdown_2.5       
[116] inum_1.0-1           carData_3.0-4        partykit_1.2-11      shiny_1.5.0          lubridate_1.7.9     
[121] rmeta_3.0 

参考

  1. logistic回归介绍

  2. logistic回归模型基础

  3. How to Plot a ROC curve using ggplot2

参考文章如引起任何侵权问题,可以与我联系,谢谢。

相关文章

网友评论

    本文标题:数据分析:二分类变量的logistic regression计算

    本文链接:https://www.haomeiwen.com/subject/lsvmzktx.html