基于R语言的申请评分卡

作者: 奔跑的蜈蚣 | 来源:发表于2017-12-08 00:39 被阅读782次

基于R语言的申请评分卡
申请评分卡目录
信用评分卡模型开发流程
风控(一) 评分卡开发
申请评分卡（3）——建模（R）
基于工业数据申请评分卡
【干货code】R建立评分卡详例
信贷评分卡
R语言信用评分卡建模
开篇：风控评分卡知识总结

1.引言

信贷行业中常见的评分卡包括：申请评分卡(Application)、行为评分卡(Behavior)、催收评分卡(Collection)以及反欺诈评分卡（Anti-Fraud），简称为A卡、B卡、C卡和F卡。
A卡，主要应用于贷前准入环节对新用户的信用评级。
B卡，主要应用于贷中管理环节对存量用户的行为预测。
C卡，主要应用于贷后催收环节对存量用户是否催收的预测管理。
F卡，主要应用于贷前准入环节对新用户可能存在的欺诈行为进行预测。
本文通过历史数据建立Logistic回归模型，预测用户出现违约的概率，从而建立申请评分卡模型。
本文数据来自“klaR”包中的German credit data。

2.数据导入与观察

加载要用到的数据，并进行初步数据观察：

> library(klaR)                     #数据集包
> library(VIM)                      #缺失值可视化
> library(party)                    #随机森林
> library(InformationValue)         #求IV值
> library(smbinning)                #最优分段
> library(ggplot2)                  #可视化
> library(gridExtra)                #可视化
> library(woe)                      #求woe值
> library(car)                      #检验多重共线性
> library(pROC)
> data(GermanCredit)
> str(GermanCredit)
略
> summary(GermanCredit)
略

该数据集包含了1000个样本，每个样本包括21个变量，变量含义如下：

变量解释

3.数据清洗

数据清洗主要工作包括缺失值和异常值处理。

3.1 缺失值处理

查看缺失值情况：

> aggr(x = GermanCredit,prop=T,numbers=T,combined=F)

图1

从以上结果可看出，本数据集不存在缺失值。

3.2 异常值处理

查看定量指标异常值情况：

> quan_index <-c("duration","amount","installment_rate","present_residence",
+        "age","number_credits","people_liable")
> quan_vars <- GermanCredit[,quan_index]
> boxplot(scale(quan_vars),col="lightgray")

图2

从图2可以看出，定量指标中存在异常值。下面，让我们具体来看一下：

> table(boxplot.stats(quan_variables$duration)$out)

45 47 48 54 60 72 
 5  1 48  2 13  1 
> table(boxplot.stats(quan_variables$amount)$out)

 7966  7980  8065  8072  8086  8133  8229  8318  8335  8358  8386  8471  8487  8588 
    1     1     1     1     1     1     1     1     1     1     1     1     1     1 
 8613  8648  8858  8947  8978  9034  9055  9157  9271  9277  9283  9398  9436  9566 
    1     1     1     1     1     1     1     1     1     1     1     1     1     1 
 9572  9629  9857  9960 10127 10144 10222 10297 10366 10477 10623 10722 10875 10961 
    1     1     1     1     1     1     1     1     1     1     1     1     1     1 
10974 11054 11328 11560 11590 11760 11816 11938 11998 12169 12204 12389 12579 12612 
    1     1     1     1     1     1     1     1     1     1     1     1     1     1 
12680 12749 12976 13756 14027 14179 14318 14421 14555 14782 14896 15653 15672 15857 
    1     1     1     1     1     1     1     1     1     1     1     1     1     1 
15945 18424 
    1     1 
> table(boxplot.stats(quan_variables$age)$out)

65 66 67 68 70 74 75 
 5  5  3  3  1  4  2 
> table(boxplot.stats(quan_variables$number_credits)$out)

4 
6 
> table(boxplot.stats(quan_variables$people_liable)$out)

  2 
155

根据具体情况来看，定量指标中存在的异常值是基本符合实际情况的，而且数据集样本数量较少，因此不对异常值做处理。（本例比较特殊，实际工作中的情况肯定会比较复杂）

4.特征变量选择

本数据集包含了定量和定性两类指标，接下来我们用不同的方法，筛选出对违约状态影响最大的指标，作为构建模型的变量。
首先，根据简单随机抽样，将数据集划分为训练集和测试集：

> set.seed(1234)
> GermanCredit$credit_risk <- ifelse(GermanCredit$credit_risk=="good",0,1)
> sam <- sample(nrow(GermanCredit), 800, replace = F)
> train <- GermanCredit[sam, ]
> test <- GermanCredit[-sam, ]

4.1 定量指标

以下用随机森林法和Logistic回归方法，寻找对因变量影响最显著的自变量：

> # 提取定量指标
> quant_vars<-c("duration","amount","installment_rate","present_residence","age",
+               "number_credits","people_liable","credit_risk")
> quant_data<-GermanCredit[,quant_vars]   
> # 随机森林法
> fit1 <- cforest(credit_risk~.,data = quant_data,controls = cforest_unbiased(mtry = 2, ntree = 50))
> # 调整变量间的相关系数，获取自变量的重要性
> sort(varimp(fit1,conditional = T),decreasing=T)
         duration               age            amount     people_liable  installment_rate 
     0.0046574142      0.0032983348      0.0028402490      0.0009693390      0.0007194029 
   number_credits present_residence 
     0.0005876221     -0.0001082808 
> # 调整样本变量不平衡性，获取自变量的重要性
> sort(varimpAUC(fit1),decreasing = T)
         duration            amount               age  installment_rate     people_liable 
     0.0185098322      0.0118307629      0.0089927377      0.0040057809      0.0026379584 
   number_credits present_residence 
     0.0011849564     -0.0009655443
> # Logistic回归
> fit2 <- glm(credit_risk~.,data = quant_data,family = binomial())
> fit2 <- step(fit2,trace = 0)
> summary(fit2)

Call:
glm(formula = credit_risk ~ duration + installment_rate + age, 
    family = binomial(), data = quant_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4943  -0.8464  -0.6890   1.2015   2.2806  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -1.427935   0.362767  -3.936 8.28e-05 ***
duration          0.037535   0.006553   5.728 1.02e-08 ***
installment_rate  0.158810   0.074329   2.137  0.03263 *  
age              -0.021756   0.007686  -2.831  0.00464 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
---
略

综合以上两种方法的结果，我们筛选出了对违约状态影响最显著的四个指标：duration(3)、age(3)、amount(2)和installment_rate(2)。

4.2 定性指标

通过R中的informationvalue包，计算各指标的IV值，得到各定性指标间的重要性度量，选取其中的high predictive指标：

> # 提取定性指标
> qualt_vars<-c("status","credit_history","purpose","savings","employment_duration",
+                "personal_status_sex","other_debtors","property",
+  "other_installment_plans","housing","job","telephone","foreign_worker","credit_risk")
> qualt_data<-train[,qualt_vars]
> # 求指标iv值
> all_iv <- data.frame(vars=qualt_vars,iv=numeric(length(qualt_vars)),
+                      strength=character(length(qualt_vars)),stringsAsFactors = F)
> for (i in qualt_vars){
+   all_iv[all_iv$vars==i,]$iv <- IV(X=qualt_data[,i], Y=qualt_data$credit_risk)
+   all_iv[all_iv$vars==i,]$strength <- attr(IV(X=qualt_data[,i], Y=qualt_data$credit_risk),"howgood")
+   }
> (all_iv<-all_iv[order(-all_iv$iv),] )
                      vars         iv            strength
1                   status 0.62161414   Highly Predictive
2           credit_history 0.28847143   Highly Predictive
4                  savings 0.21358882   Highly Predictive
3                  purpose 0.21326499   Highly Predictive
8                 property 0.09828292 Somewhat Predictive
5      employment_duration 0.08935235 Somewhat Predictive
10                 housing 0.07040316 Somewhat Predictive
6      personal_status_sex 0.06871949 Somewhat Predictive
9  other_installment_plans 0.04371291 Somewhat Predictive
7            other_debtors 0.04350575 Somewhat Predictive
13          foreign_worker 0.04262905 Somewhat Predictive
12               telephone 0.01669733      Not Predictive
11                     job 0.01002765      Not Predictive
14             credit_risk 0.00000000      Not Predictive

根据以上结果，我们选择status、credit_history、savings和purpose四个high predictive指标构建模型。
综上，我们共选择了8个变量作为入模变量。

> # 入模指标
> quant_model_vars <- c("duration","amount","installment_rate","age")
> qualt_model_vars <- c("status","credit_history","savings","purpose")

5.WOE计算

5.1 变量分箱

5.1.1 定量指标

计算定量指标的WOE之前，需要先对定量指标进行分段。下面，优先采用最优分段，其原理是基于条件推理树（conditional inference trees, Ctree）的递归分割算法，核心算法用函数ctree()表示。

> # 对duration进行最优分段
> result1<-smbinning(df=train,y="credit_risk",x="duration",p=0.05)
> 查看分段效果
> smbinning.plot(result1,option="WoE",sub="Duration")
> breaks1 <- c(0,7,30,Inf)
> train$cut_duration <- cut(train$duration,breaks = breaks1)

图3

从上图可以看出，woe值相差较大，分段效果不错。以下针对amount、age采用相同分段方法。

> #对amount进行最优分段
> result2<-smbinning(df=train,y="credit_risk",x="amount")
> smbinning.plot(result2,option="WoE",sub="Amount")
> breaks2 <-  c(0,1372,3913,Inf)
> train$cut_amount <- cut(train$amount,breaks = breaks2)
> result3<-smbinning(df=train,y="credit_risk",x="age")
> smbinning.plot(result3,option="WoE",sub="Age")
> breaks3 <- c(0,25,Inf)
> train$cut_age <- cut(train$age,breaks = breaks3)
> train$cut_rate <- cut(train$installment_rate,4)
> rate_woe <- woe(Data = train,Independent ="cut_rate",Continuous = F,Dependent = "credit_risk",
+                 C_Bin = 4,Bad = 0,Good = 1)
> ggplot(rate_woe, aes(x = BIN, y = WOE)) + 
+   geom_bar(stat = "identity",fill = "blue", colour = "grey60",
+            size = 0.2, alpha = 0.2)+
+   labs(title = "等距分段")+
+   theme(plot.title = element_text(hjust = 0.5))

installment_rate只有1、2、3、4四个值，四个值对应的woe值差别较大，且具有单调性，采用等距分段。

图4

5.1.2 定性指标

接下来我们需要对定性指标做必要的降维处理，方便计算其WOE值。首先，我们查看一下入模的定性指标的概况：

> discrete_data<-train[,qual_model_vars]
> summary(discrete_data)
                                        status                                        credit_history
 ... < 100 DM                              :211   no credits taken/all credits paid back duly: 34   
 0 <= ... < 200 DM                         :214   all credits at this bank paid back duly    : 41   
 ... >= 200 DM / salary for at least 1 year: 50   existing credits paid back duly till now   :422   
 no checking account                       :325   delay in paying off in the past            : 66   
                                                  critical account/other credits existing    :237   
                                                                                                    
                                                                                                    
                       savings                   purpose   
 ... < 100 DM              :478   domestic appliances:216  
 100 <= ... < 500 DM       : 84   car (new)          :189  
 500 <= ... < 1000 DM      : 48   radio/television   :145  
 ... >= 1000 DM            : 41   car (used)         : 83  
 unknown/no savings account:149   others             : 83  
                                  retraining         : 39  
                                  (Other)            : 45

由以上概况可知，定性指标status、credit_history、和savings的维数最高为5维，最低为4维，维数适中，可以不进行处理。
定性指标purpose的维数多于7维，明显高于其他定性指标。为了避免“维数灾难”，我们根据三条准则进行降维：1.维度间属性相似；2.合并后woe有明显变化；3.单个维度样本量不应过小。

> # 未进行降维前
> purpose_woe1 <- woe(Data = train,Independent ="purpose",Continuous = F,Dependent = "credit_risk",
+     C_Bin = 10,Bad = 0,Good = 1)
> ggplot(purpose_woe1, aes(x = BIN, y = WOE)) + 
+   geom_bar(stat = "identity",fill = "blue", colour = "grey60",size = 0.2, alpha = 0.2)+
+   labs(title = "Purpose")+
+   theme(plot.title = element_text(hjust = 0.5))

图5

> # 类似属性合并
> train <- within(train,{
+   cut_purpose <- NA
+   cut_purpose[purpose=="car (new)"] <- "car(new/used)"
+   cut_purpose[purpose=="car (used)"] <- "car(new/used)"
+   cut_purpose[purpose=="furniture/equipment"] <- "furniture/equipment/radio/television/domestic appliances"
+   cut_purpose[purpose=="radio/television"] <- "furniture/equipment/radio/television/domestic appliances"
+   cut_purpose[purpose=="domestic appliances"] <- "furniture/equipment/radio/television/domestic appliances"
+   cut_purpose[purpose=="repairs"] <- "repairs/business/others"
+   cut_purpose[purpose=="education"] <- "education/retraining"
+   cut_purpose[purpose=="retraining"] <- "education/retraining"
+   cut_purpose[purpose=="business"] <- "repairs/business/others"
+   cut_purpose[purpose=="others"] <- "repairs/business/others"})
> purpose_woe2 <- woe(Data = train,Independent ="cut_purpose",Continuous = F,Dependent = "credit_risk",
+                       C_Bin = 10,Bad = 0,Good = 1)
>   ggplot(purpose_woe2, aes(x = BIN, y = WOE)) + 
+     geom_bar(stat = "identity",fill = "blue", colour = "grey60",
+              size = 0.2, alpha = 0.2)+
+     labs(title = "Purpose")+
+     theme(plot.title = element_text(hjust = 0.5))+
+     theme(axis.text.x = element_text(vjust = 0.2, hjust = 0.2, angle = 10)

图6

5.2 WOE计算

用klaR包中的woe()函数获取入模变量的woe值。

> newtrain <- cbind(discrete_data[,-4],train[,c(21:26)])
> str(newtrain)
'data.frame':   800 obs. of  9 variables:
 $ status        : Factor w/ 4 levels "... < 100 DM",..: 4 4 4 1 4 4 2 3 3 4 ...
 $ credit_history: Factor w/ 5 levels "no credits taken/all credits paid back duly",..: 5 5 3 3 5 4 5 3 3 3 ...
 $ savings       : Factor w/ 5 levels "... < 100 DM",..: 1 1 1 1 1 1 1 1 1 2 ...
 $ credit_risk   : num  1 1 0 1 0 0 1 1 0 0 ...
 $ cut_duration  : Factor w/ 3 levels "(0,7]","(7,30]",..: 3 2 2 3 2 3 2 3 2 3 ...
 $ cut_amount    : Factor w/ 3 levels "(0,1.37e+03]",..: 3 2 2 2 2 3 3 3 1 3 ...
 $ cut_age       : Factor w/ 2 levels "(0,25]","(25,Inf]": 1 2 2 1 2 1 2 2 1 2 ...
 $ cut_rate      : Factor w/ 4 levels "(0.997,1.75]",..: 4 3 4 4 4 2 4 4 3 4 ...
 $ cut_purpose   : chr  "car(new/used)" "car(new/used)" "furniture/equipment/radio/television/domestic appliances" "furniture/equipment/radio/television/domestic appliances" ...
> newtrain$credit_risk <- as.factor(newtrain$credit_risk)
> newtrain$cut_purpose <- as.factor(newtrain$cut_purpose)
>  # 获取woe值
> woemodel<-klaR::woe(credit_risk~.,data = newtrain,zeroadj=0.5,applyontrain=TRUE)
> traindata <- predict(woemodel, newtrain, replace = TRUE)  
> str(traindata)
'data.frame':   800 obs. of  9 variables:
 $ credit_risk       : Factor w/ 2 levels "0","1": 2 2 1 2 1 1 2 2 1 1 ...
 $ woe.status        : num  1.127 1.127 1.127 -0.772 1.127 ...
 $ woe.credit_history: num  0.6989 0.6989 -0.0638 -0.0638 0.6989 ...
 $ woe.savings       : num  -0.292 -0.292 -0.292 -0.292 -0.292 ...
 $ woe.cut_duration  : num  -0.8058 0.0808 0.0808 -0.8058 0.0808 ...
 $ woe.cut_amount    : num  -0.55 0.419 0.419 0.419 0.419 ...
 $ woe.cut_age       : num  -0.612 0.161 0.161 -0.612 0.161 ...
 $ woe.cut_rate      : num  -0.15884 -0.00808 -0.15884 -0.15884 -0.15884 ...
 $ woe.cut_purpose   : num  -0.0199 -0.0199 0.1853 0.1853 0.1853 ...

至此，我们已经获得了入模变量对应的woe值。值的注意的是，我们之前将好客户设定为0，坏客户设定为1，所以woe值越大，代表客户违约的概率越大，但traindata中的woe实际是按照好客户为1，坏客户为0计算的，所以与之前变量分箱中计算的woe正好相反。
下面正式开始构建模型，并转换为标准评分卡。

6 模型构建与验证

6.1 构建逻辑回归模型

> # 用获得的woe数据进行逻辑回归
> trainmodel<-glm(credit_risk~.,data=traindata,family = binomial())
> summary(trainmodel)

Call:
glm(formula = credit_risk ~ ., family = binomial(), data = traindata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7274  -0.7140  -0.4202   0.7805   2.5005  

Coefficients:
                   Estimate Std. Error z value             Pr(>|z|)    
(Intercept)         -0.9024     0.0924  -9.766 < 0.0000000000000002 ***
woe.status          -0.7947     0.1205  -6.594      0.0000000000427 ***
woe.credit_history  -0.8322     0.1712  -4.861      0.0000011658520 ***
woe.savings         -0.8522     0.2102  -4.055      0.0000501789230 ***
woe.cut_duration    -0.7785     0.1875  -4.151      0.0000330526950 ***
woe.cut_amount      -0.7891     0.2259  -3.493             0.000478 ***
woe.cut_age         -0.9719     0.2803  -3.468             0.000525 ***
woe.cut_rate        -1.7358     0.5309  -3.269             0.001078 ** 
woe.cut_purpose     -1.0409     0.3958  -2.630             0.008541 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 963.44  on 799  degrees of freedom
Residual deviance: 754.15  on 791  degrees of freedom
AIC: 772.15

Number of Fisher Scoring iterations: 5

从以上逻辑回归的结果来看，各个变量都通过了显著性检验。同时，为防止多重共线性问题的出现，我们对模型进行VIF检验：

> vif(trainmodel)
        woe.status woe.credit_history        woe.savings   woe.cut_duration 
          1.043140           1.024758           1.039515           1.048207 
    woe.cut_amount        woe.cut_age       woe.cut_rate    woe.cut_purpose 
          1.088700           1.038714           1.055447           1.036517

从结果可知，所有变量VIF均小于4，可以判断模型中不存在多重共线性问题。

6.2 转换为标准评分卡

根据信用评分卡模型的建立，我们可以得到：

其中，woe=ln(odds)，odds为good_rate/bad_rate，beita为回归系数，altha为截距，n为变量个数，offset为偏移量（视风险偏好而定），比例因子factor。
假定odds=50，对应的评分为600，在此基础上评分值增加20分（1个PDO），可以使odds翻番，则可以得出：

6.2.1 转换为标准评分卡

获取基础分以及训练集中各变量的分数：

> # 设定alpha为比例因子factor，beta为风险偏移量offset
> alpha_beta <- function(basepoints,baseodds,pdo){
+   alpha <- pdo/log(2)
+   beta <- basepoints-alpha*log(baseodds)
+   return(list(alpha=alpha,beta=beta))
+ }
> # 指定0dds=50时，基础分为600分，比率翻番的分数为20，计算评分卡的系数alpha和beta
> (x <- alpha_beta(600,50,20))
$alpha
[1] 28.8539

$beta
[1] 487.1229
> # 获得模型系数
> (coefficients <- trainmodel$coefficients)
       (Intercept)         woe.status woe.credit_history        woe.savings 
        -0.9023751         -0.7946682         -0.8321739         -0.8522369 
  woe.cut_duration     woe.cut_amount        woe.cut_age       woe.cut_rate 
        -0.7785099         -0.7891044         -0.9718816         -1.7357656 
   woe.cut_purpose 
        -1.0409225 
> #构造计算分值函数：
> vars_score<-function(i){
+   score = -round(x$alpha*coefficients[i]*traindata[,names(coefficients[i])])
+   return(score)
+ }
> # 计算基础分值
> (basepoint <- round(x$beta-x$alpha*coefficients[1]))
(Intercept) 
        513 
> # 1.status_score
> status_score <- vars_score(2)
> colnames(status_score)<-"status_score" 
> # 2.credit_history_score
> credit_history_score <- vars_score(3)
> colnames(credit_history_score)<-"credit_history_score"
> # 3.savings_score
> savings_score <- vars_score(4)
> colnames(savings_score)<-"savings_score"
> # 4.duration_score
> duration_score <- vars_score(5)
> colnames(duration_score)<-"duration_score"
> # 5.amount_score
> amount_score <- vars_score(6)
> colnames(amount_score)<-"amount_score"
> # 6.age_score
> age_score <- vars_score(7)
> colnames(age_score)<-"age_score"
> # 7.rate_score
> rate_score <- vars_score(8)
> colnames(rate_score)<-"rate_score"
> # 8.purpose_score
> purpose_score <- vars_score(9)
> colnames(purpose_score)<-"purpose_score"

6.2.2 输出标准评分卡

输出CSV格式的标准评分卡：

> # 基础分
> a <- c("","basepoint",513)
> b <- matrix(r1,nrow = 1)
> colnames(b)<-c("Variable","Basepoint","Score")
> #2.duration的分值
> duration_Cutpoint <- as.matrix(newtrain$cut_duration,stringsAsFactors=F)
> duration_scoreCard<-cbind(as.matrix(c("Duration","",""),ncol=1),
+                           unique(cbind(duration_Cutpoint,duration_score)))
> #3.amount的分值
> amount_Cutpoint <- as.matrix(newtrain$cut_amount,stringsAsFactors=F)
> amount_scoreCard<-cbind(as.matrix(c("Amount","",""),ncol=1),
+                         unique(cbind(amount_Cutpoint,amount_score)))
> #4.age的分值
> age_Cutpoint <- as.matrix(newtrain$cut_age,stringsAsFactors=F)
> age_scoreCard<-cbind(as.matrix(c("Age",""),ncol=1),
+                      unique(cbind(age_Cutpoint,age_score)))
> #5.installment_rate的分值
> rate_Cutpoint <- as.matrix(newtrain$cut_rate,stringsAsFactors=F)
> rate_scoreCard<-cbind(as.matrix(c("Installment_rate","","",""),ncol=1),
+                       unique(cbind(rate_Cutpoint,rate_score)))
> #6.status的分值
> status <- as.matrix(newtrain$status,stringsAsFactors=F)
> status_scoreCard<-cbind(as.matrix(c("Status","","",""),ncol=1),
+                         unique(cbind(status,status_score)))
> #7.credit_history的分值
> credit_history <- as.matrix(newtrain$credit_history,stringsAsFactors=F)
> credit_history_scoreCard<-cbind(as.matrix(c("Credit_history","","","",""),ncol=1),
+                                 unique(cbind(credit_history,credit_history_score)))
> #8.savings的分值
> savings <- as.matrix(newtrain$savings,stringsAsFactors=F)
> savings_scoreCard<-cbind(as.matrix(c("Savings","","","",""),ncol=1),
+                          unique(cbind(savings,savings_score)))
> #9.purpose的分值
> purpose <- as.matrix(newtrain$cut_purpose,stringsAsFactors=F)
> purpose_scoreCard<-cbind(as.matrix(c("Purpose","","",""),ncol=1),
+                          unique(cbind(purpose,purpose_score)))
> scoreCard_CSV<-rbind(m1,duration_scoreCard,amount_scoreCard,age_scoreCard,
+                      rate_scoreCard,status_scoreCard,credit_history_scoreCard,
+                      savings_scoreCard,purpose_scoreCard)
> scoreCard_CSV<-rbind(b,duration_scoreCard,amount_scoreCard,age_scoreCard,
+                      rate_scoreCard,status_scoreCard,credit_history_scoreCard,
+                      savings_scoreCard,purpose_scoreCard)
> #输出标准评分卡到文件中
> write.csv(scoreCard_CSV,"C:/Users/Administrator/Desktop/ScoreCard.CSV")

7 模型验证

对测试集中的样本做同样的降维处理：

> # 对duration分段
> breaks1 <- c(0,7,30,Inf)
> test$cut_duration <- cut(test$duration,breaks = breaks1)
> # 对amount分段
> breaks2 <-  c(0,1372,3913,Inf)
> test$cut_amount <- cut(test$amount,breaks = breaks2)
> # 对age分段
> breaks3 <- c(0,25,Inf)
> test$cut_age <- cut(test$age,breaks = breaks3)
> # 对installment_rate分段
> test$cut_rate <- cut(test$installment_rate,4)
> # 对purpose分段
> test <- within(test,{
+   cut_purpose <- NA
+   cut_purpose[purpose=="car (new)"] <- "car(new/used)"
+   cut_purpose[purpose=="car (used)"] <- "car(new/used)"
+   cut_purpose[purpose=="furniture/equipment"] <- "furniture/equipment/radio/television/domestic appliances"
+   cut_purpose[purpose=="radio/television"] <- "furniture/equipment/radio/television/domestic appliances"
+   cut_purpose[purpose=="domestic appliances"] <- "furniture/equipment/radio/television/domestic appliances"
+   cut_purpose[purpose=="repairs"] <- "repairs/business/others"
+   cut_purpose[purpose=="education"] <- "education/retraining"
+   cut_purpose[purpose=="retraining"] <- "education/retraining"
+   cut_purpose[purpose=="business"] <- "repairs/business/others"
+   cut_purpose[purpose=="others"] <- "repairs/business/others"}) 
> newtest <- cbind(test[,qualt_model_vars][-4],test[,c(21:26)])
> newtest$credit_risk <- as.factor(newtest$credit_risk)
> newtest$cut_purpose <- as.factor(newtest$cut_purpose)
> # 将newtest中的各个变量转换为对应的woe值
> woemodel_test<-klaR::woe(credit_risk~.,data = newtest,zeroadj=0.5,applyontrain=TRUE)
> # 获得woe数据框
> testdata <- predict(woemodel_test, newtest, replace = TRUE)  
> # 测试集验证
> prob <- predict(trainmodel,testdata,type="response")
> logit.pred <- ifelse(prob>0.5,1,0)  #阈值简单设为0.5
> (Freq <- table(logit.pred,testdata$credit_risk))
          
logit.pred   0   1
         0 121  35
         1  11  33 
> # 准确率
> (ACC <- sum(diag(Freq))/sum(Freq))
[1] 0.77
> # AUC、Gini系数
> modelroc <- roc(testdata$credit_risk,prob)
> plot(modelroc, print.auc=TRUE, auc.polygon=T, grid=c(0.1, 0.2),
+      grid.col=c("green", "red"), max.auc.polygon=TRUE,
+      auc.polygon.col="skyblue", print.thres=T)
> modelauc<- auc(modelroc)
> (Gini <- 2*modelauc-1)
[1] 0.6106283

图7

从以上结果可知，模型准确率ACC为0.77，AUC为0.805，Gini系数为0.61，整体效果尚可。

8 总结

本文通过对Germancredit数据的挖掘分析，从数据清洗、变量筛选、WOE计算、建模分析到模型验证，创建了一个简单的申请评分卡。
本文用到的数据集比较简单，在实操中，数据清洗应该会占用更多时间和精力。
本文仅进行了一次样本抽样，在实操中，应进行K折交叉检验，提升模型准确度。
开发的模型是基于某一时间的特定样本的，随着时间的推移和信贷政策的变化，样本会发生变化，从而造成模型的区分能力和稳定性变差。一般需要定期对模型的使用情况进行检测并报告模型区分能力和稳定性的变化情况，必要时应采取包括修正模型或重建模型等措施。这是后面需要认真学习的地方！