AUC和ROC

作者: Hayley笔记 | 来源:发表于2021-05-15 18:55 被阅读0次

房屋信贷违约风险竞争(kaggle)系列2-数据清理和格式化
sklearn notes
《机器学习》第二章
AUC面试
ROC和AUC
AUC和ROC
auc和roc
AUC和ROC
【实战篇】集成算法建模（二）
评价指标

AUC ：曲线下面积（Area Under the Curve）

AUROC ：接受者操作特征曲线下面积（Area Under the Receiver Operating Characteristic curve）

1. ROC曲线概述

ROC曲线是一种评价分类模型的可视化工具。ROC的图形是横纵坐标限定在0-1范围内的曲线，横坐标是假正率FPR（错误的判断为正确的概率），纵坐标是真正率TPR（正确的判断为正确的概率）。通常我们认为，曲线的凸起程度越高，模型性能越好，而曲线越接近于对角线，模型的准确性越低。

2. AUC

AUC表示ROC曲线下方的面积，是对ROC曲线的量化。由于ROC曲线的横纵坐标都是0-1，因此AUC是1x1方格中的一部分，其大小在0-1之间。

3. ROC曲线的绘制

3.1 基础概念

预测概率和阈值：
分类模型的输出结果中包含一个0-1的概率值，该概率值代表着对应的样本被预测为某类别的可能性。然后再通过阈值来进行划分，概率大于阈值的被判断为正，概率小于阈值的被判断为负。
TPR和FPR：ROC曲线的横坐标为FPR，纵坐标为TPR，FPR是错误的预测为正的概率，TPR是错误的预测为正的概率。

3.2 ROC曲线绘制步骤

将全部样本按概率递减排序
阈值从1至0变更，计算各阈值下对应的（FPR，TPR）数值对。
将数值对绘于直角坐标系中。

4. ROC and AUC in R

# install.packages("pROC")
# install.packages("randomForest")
library(pROC) 
library(randomForest) #Random Forest is a way to classify samples and we can change the threshold that we use to make those decisions.
set.seed(420) # this will make my results match yours
num.samples <- 100
weight <- sort(rnorm(n=num.samples, mean=172, sd=29))
obese <- ifelse(test=(runif(n=num.samples) < (rank(weight)/num.samples)), 
                yes=1, no=0)
obese
plot(x=weight, y=obese)

## fit a logistic regression to the data...
glm.fit=glm(obese ~ weight, family=binomial)
lines(weight, glm.fit$fitted.values)

draw ROC and AUC using pROC

#######################################
##
## draw ROC and AUC using pROC
##
#######################################
## NOTE: By default, the graphs come out looking terrible
## The problem is that ROC graphs should be square, since the x and y axes
## both go from 0 to 1. However, the window in which I draw them isn't square
## so extra whitespace is added to pad the sides.
roc(obese, glm.fit$fitted.values, plot=TRUE)
## Now let's configure R so that it prints the graph as a square.
##
par(pty = "s") ## pty sets the aspect ratio of the plot region. Two options:
##                "s" - creates a square plotting region
##                "m" - (the default) creates a maximal plotting region
roc(obese, glm.fit$fitted.values, plot=TRUE)
## NOTE: By default, roc() uses specificity on the x-axis and the values range
## from 1 to 0. This makes the graph look like what we would expect, but the
## x-axis itself might induce a headache. To use 1-specificity (i.e. the 
## False Positive Rate) on the x-axis, set "legacy.axes" to TRUE.
roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE)
## If you want to rename the x and y axes...
roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage")
## We can also change the color of the ROC line, and make it wider...
roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#377eb8", lwd=4)
## If we want to find out the optimal threshold we can store the 
## data used to make the ROC graph in a variable...
roc.info <- roc(obese, glm.fit$fitted.values, legacy.axes=TRUE)
str(roc.info)
## and then extract just the information that we want from that variable.
roc.df <- data.frame(
  tpp=roc.info$sensitivities*100, ## tpp = true positive percentage
  fpp=(1 - roc.info$specificities)*100, ## fpp = false positive precentage
  thresholds=roc.info$thresholds)
head(roc.df) ## head() will show us the values for the upper right-hand corner
## of the ROC graph, when the threshold is so low 
## (negative infinity) that every single sample is called "obese".
## Thus TPP = 100% and FPP = 100%
tail(roc.df) ## tail() will show us the values for the lower left-hand corner
## of the ROC graph, when the threshold is so high (infinity) 
## that every single sample is called "not obese". 
## Thus, TPP = 0% and FPP = 0%
## now let's look at the thresholds between TPP 60% and 80%...
roc.df[roc.df$tpp > 60 & roc.df$tpp < 80,]
## We can calculate the area under the curve...
roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#377eb8", lwd=4, print.auc=TRUE)
## ...and the partial area under the curve.
roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#377eb8", lwd=4, print.auc=TRUE, print.auc.x=45, partial.auc=c(100, 90), auc.polygon = TRUE, auc.polygon.col = "#377eb822")
#######################################
##
## Now let's fit the data with a random forest...
##
#######################################
rf.model <- randomForest(factor(obese) ~ weight)
## ROC for random forest
roc(obese, rf.model$votes[,1], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#4daf4a", lwd=4, print.auc=TRUE)
#######################################
##
## Now layer logistic regression and random forest ROC graphs..
##
#######################################
roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#377eb8", lwd=4, print.auc=TRUE)
plot.roc(obese, rf.model$votes[,1], percent=TRUE, col="#4daf4a", lwd=4, print.auc=TRUE, add=TRUE, print.auc.y=40)
legend("bottomright", legend=c("Logisitic Regression", "Random Forest"), col=c("#377eb8", "#4daf4a"), lwd=4)
#######################################
##
## Now that we're done with our ROC fun, let's reset the par() variables.
## There are two ways to do it...
##
#######################################
par(pty = "m")

参考：
https://www.bilibili.com/video/BV1SK4y1K7v3
https://www.youtube.com/watch?v=qcvAqAH60Yw