AUC :曲线下面积(Area Under the Curve)
AUROC :接受者操作特征曲线下面积(Area Under the Receiver Operating Characteristic curve)
1. ROC曲线概述
ROC曲线是一种评价分类模型的可视化工具。ROC的图形是横纵坐标限定在0-1范围内的曲线,横坐标是假正率FPR(错误的判断为正确的概率),纵坐标是真正率TPR(正确的判断为正确的概率)。通常我们认为,曲线的凸起程度越高,模型性能越好,而曲线越接近于对角线,模型的准确性越低。
2. AUC
AUC表示ROC曲线下方的面积,是对ROC曲线的量化。由于ROC曲线的横纵坐标都是0-1,因此AUC是1x1方格中的一部分,其大小在0-1之间。

3. ROC曲线的绘制
3.1 基础概念
- 预测概率和阈值:
分类模型的输出结果中包含一个0-1的概率值,该概率值代表着对应的样本被预测为某类别的可能性。然后再通过阈值来进行划分,概率大于阈值的被判断为正,概率小于阈值的被判断为负。 - TPR和FPR:ROC曲线的横坐标为FPR,纵坐标为TPR,FPR是错误的预测为正的概率,TPR是错误的预测为正的概率。

3.2 ROC曲线绘制步骤
- 将全部样本按概率递减排序
- 阈值从1至0变更,计算各阈值下对应的(FPR,TPR)数值对。
- 将数值对绘于直角坐标系中。



4. ROC and AUC in R
# install.packages("pROC")
# install.packages("randomForest")
library(pROC)
library(randomForest) #Random Forest is a way to classify samples and we can change the threshold that we use to make those decisions.
set.seed(420) # this will make my results match yours
num.samples <- 100
weight <- sort(rnorm(n=num.samples, mean=172, sd=29))
obese <- ifelse(test=(runif(n=num.samples) < (rank(weight)/num.samples)),
yes=1, no=0)
obese
plot(x=weight, y=obese)

## fit a logistic regression to the data...
glm.fit=glm(obese ~ weight, family=binomial)
lines(weight, glm.fit$fitted.values)

draw ROC and AUC using pROC
#######################################
##
## draw ROC and AUC using pROC
##
#######################################
## NOTE: By default, the graphs come out looking terrible
## The problem is that ROC graphs should be square, since the x and y axes
## both go from 0 to 1. However, the window in which I draw them isn't square
## so extra whitespace is added to pad the sides.
roc(obese, glm.fit$fitted.values, plot=TRUE)
## Now let's configure R so that it prints the graph as a square.
##
par(pty = "s") ## pty sets the aspect ratio of the plot region. Two options:
## "s" - creates a square plotting region
## "m" - (the default) creates a maximal plotting region
roc(obese, glm.fit$fitted.values, plot=TRUE)
## NOTE: By default, roc() uses specificity on the x-axis and the values range
## from 1 to 0. This makes the graph look like what we would expect, but the
## x-axis itself might induce a headache. To use 1-specificity (i.e. the
## False Positive Rate) on the x-axis, set "legacy.axes" to TRUE.
roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE)
## If you want to rename the x and y axes...
roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage")
## We can also change the color of the ROC line, and make it wider...
roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#377eb8", lwd=4)
## If we want to find out the optimal threshold we can store the
## data used to make the ROC graph in a variable...
roc.info <- roc(obese, glm.fit$fitted.values, legacy.axes=TRUE)
str(roc.info)
## and then extract just the information that we want from that variable.
roc.df <- data.frame(
tpp=roc.info$sensitivities*100, ## tpp = true positive percentage
fpp=(1 - roc.info$specificities)*100, ## fpp = false positive precentage
thresholds=roc.info$thresholds)
head(roc.df) ## head() will show us the values for the upper right-hand corner
## of the ROC graph, when the threshold is so low
## (negative infinity) that every single sample is called "obese".
## Thus TPP = 100% and FPP = 100%
tail(roc.df) ## tail() will show us the values for the lower left-hand corner
## of the ROC graph, when the threshold is so high (infinity)
## that every single sample is called "not obese".
## Thus, TPP = 0% and FPP = 0%
## now let's look at the thresholds between TPP 60% and 80%...
roc.df[roc.df$tpp > 60 & roc.df$tpp < 80,]
## We can calculate the area under the curve...
roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#377eb8", lwd=4, print.auc=TRUE)
## ...and the partial area under the curve.
roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#377eb8", lwd=4, print.auc=TRUE, print.auc.x=45, partial.auc=c(100, 90), auc.polygon = TRUE, auc.polygon.col = "#377eb822")
#######################################
##
## Now let's fit the data with a random forest...
##
#######################################
rf.model <- randomForest(factor(obese) ~ weight)
## ROC for random forest
roc(obese, rf.model$votes[,1], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#4daf4a", lwd=4, print.auc=TRUE)
#######################################
##
## Now layer logistic regression and random forest ROC graphs..
##
#######################################
roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#377eb8", lwd=4, print.auc=TRUE)
plot.roc(obese, rf.model$votes[,1], percent=TRUE, col="#4daf4a", lwd=4, print.auc=TRUE, add=TRUE, print.auc.y=40)
legend("bottomright", legend=c("Logisitic Regression", "Random Forest"), col=c("#377eb8", "#4daf4a"), lwd=4)
#######################################
##
## Now that we're done with our ROC fun, let's reset the par() variables.
## There are two ways to do it...
##
#######################################
par(pty = "m")
参考:
https://www.bilibili.com/video/BV1SK4y1K7v3
https://www.youtube.com/watch?v=qcvAqAH60Yw
网友评论