美文网首页Machine Learning
在python中使用ROC曲线和PR曲线进行分类

在python中使用ROC曲线和PR曲线进行分类

作者: Pingouin | 来源:发表于2020-10-20 22:56 被阅读0次

    Ref: How to Use ROC Curves and Precision-Recall Curves for Classification in Python
    Ref: 推荐阅读:一个超级清楚的知乎回答

    基本概念

    ROC: receiver operating characteristic curve
    PRC: precision-recall curve

    ROC曲线和Precision-Recall曲线是帮助解释分类(主要是binary)预测建模问题的概率预测的诊断工具。

    ROC Curves summarize the trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds.
    Precision-Recall curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds.

    • ROC curves : observations are balanced between each class
    • Precision-recall curves: imbalanced datasets.

    Predicting probability

    In a classification problem, we may decide to predict the class values directly. Alternately, it can be more flexible to predict the probabilities for each class instead. Why? It can provide the capability to choose and even calibrate the threshold for how to interpret the predicted probabilities.

    two types of errors when making a prediction for a binary/ two-class classification problem:

    • FP: predict an event when there was no event
    • FN: predict no event when there was an event

    A common way to compare models that predict probabilities for two classes is to use a ROC curve.

    ROC Curve

    sensitvity: true positive rate = TP /(TP+FN)

    false-positive rate = FP / (FP+TN) = 1- specificity
    specificity = TN/ (FP+TN)

    accuracy = (TP + TN) / (TP+TN+FP+FN)
    在binary classification especially when we are interested in minioriry class, accuracy is not that useful.. .e.g in our case, 90% accuracy negative

    precision = TP / (TP + FN) .
    TP/golden set =P(Y =1/ Y^ = 1)

    recall = TP / (TP + FP)
    TP / retrieved set = P(Y^ =1 / Y=1)

    presicion and recall are trade off.
    if we want to cover more sample, then it's easier to make mistakes -> high recall -> low precision
    if we have concerned model -> low recall -> high precision

    AUC: area under the curve. Can be used as a summary of the model skill. AUC的概率意义是随机取一对正负样本,正样本得分大于负样本的概率.

    ROC: x-axis: false positive rate, y-axis: true positive rate. aks false alarm rate vs hit rate.

    Smaller values on the x-axis of the plot indicate lower false positives and higher true negatives.
    Larger values on the y-axis of the plot indicate higher true positives and lower false negatives

    when we predict a binary outcome, it is either a correct prediction (true positive) or not (false positive). There is a tension between these options, the same with a true negative and false negative.

    A skilful model will assign a higher probability to a randomly chosen real positive occurrence than a negative occurrence on average. This is what we mean when we say that the model has skill. Generally, skilful models are represented by curves that bow up to the top left of the plot.

    A no-skill classifier is one that cannot discriminate between the classes and would predict a random class or a constant class in all cases. A model with no skill is represented at the point (0.5, 0.5). A model with no skill at each threshold is represented by a diagonal line from the bottom left of the plot to the top right and has an AUC of 0.5.

    A model with perfect skill is represented at a point (0,1). A model with perfect skill is represented by a line that travels from the bottom left of the plot to the top left and then across the top to the top right.

    An operator may plot the ROC curve for the final model and choose a threshold that gives a desirable balance between the false positives and false negatives.

    F1 score:

    F1 = 2 * Recall * precision / (recall +precision)
    control recall and precision.
    recall -> risk -> sensitivity -> True positive rate 希望是1
    precision -> cost -> specificity -> false positive rate 希望是0

    相关文章

      网友评论

        本文标题:在python中使用ROC曲线和PR曲线进行分类

        本文链接:https://www.haomeiwen.com/subject/wmsjmktx.html