美文网首页
基于LendingClub数据集的申请评分卡开发

基于LendingClub数据集的申请评分卡开发

作者: 70fa0b237415 | 来源:发表于2020-12-15 21:41 被阅读0次

1.数据源介绍

本文以美国的Lending Club公司公开的借款数据来建立申请评分卡

2.数据获取与预处理

本文使用Lending Club2019年一季度的申请数据,数据下载链接:https://www.lendingclub.com/info/statistics.action
百度网盘下载:
链接:https://pan.baidu.com/s/1sKLLLyyO8rxR4oHzZwJItQ
提取码:a551

image.png

2.1 读取数据

首先导入后续分析需要的包以及相关设置

import os,sys
sys.path.append("./lendingclub_data")
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import time,datetime
import variable_bin_methods as varbin_meth
import variable_encode as var_encode
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_curve, auc,confusion_matrix,recall_score,precision_score,accuracy_score
from sklearn.linear_model import LogisticRegression
from feature_selector import FeatureSelector
from imblearn.over_sampling import SMOTE
import missingno as msno
import matplotlib
matplotlib.use(arg='Qt5Agg')
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore") ##忽略警告

import seaborn as sns
# 设置风格
sns.set(style='white', font_scale=1.2)
%matplotlib inline
plt.rcParams["font.sans-serif"] = "SimHei"
plt.rcParams['axes.titlesize'] = 20
plt.rcParams['axes.unicode_minus']=False
# 多行输出设置
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

将所需文件和数据放在当前目录下“lendingclub_data”的文件夹内

file_path = "./lendingclub_data/data"
file_name = "LoanStats_2019Q1.csv"
df_1 = pd.read_csv(os.path.join(file_path, file_name), header=1, low_memory=False)
df_1.head()
image.png

该数据集共有115677个样本,144个字段。根据变量描述,变量可以分为贷前变量和贷后变量。

2.2 好坏样本定义

数据集中的变量loan_status为标签变量,标签变量的取值情况如下:

image.png
df_1.groupby(by = ["loan_status"])[["int_rate"]].count()
image.png

这里将逾期15天以上的样本定义为坏样本,则好坏样本的定义如下:

  1. 逾期15天以上的样本为坏样本,即变量loan_status中取值为:'Late (16-30 days)'、'Late (31-120 days)'、 'Charged Off'的样本
  2. 不确定样本,即截止到建模阶段逾期天数不足15天的样本。loan_status中取值为 'In Grace Period'的样本
  3. 好样本是loan_status中取值为'Current'、'Fully Paid'的样本

根据上述逻辑做标签变量映射,将字符型数据转化为数值型数据

df_1.rename(columns={"loan_status":"target"}, inplace=True)
print(df_1.shape)
# 删除target中的空值
df_1 = df_1.loc[~(df_1["target"].isnull())]
print(df_1.shape)
# 目标变量映射
def target_mapping(lst):
    # 'Late (16-30 days)'、'Late (31-120 days)'、 'Charged Off'映射为1,坏样本
    # 'In Grace Period'映射为2,不确定样本
    # 'Current'、'Fully Paid'映射为0,好样本
    mapping = {}
    for elem in lst:
        if elem in ['Late (16-30 days)', 'Late (31-120 days)', 'Charged Off']:
            mapping[elem] = 1
        elif elem in ['In Grace Period']:
            mapping[elem] = 2
        elif elem in ['Current','Fully Paid']:
            mapping[elem] = 0
        else:
            mapping[elem] = 3
    return mapping

df_1["target"] = df_1["target"].map(target_mapping(df_1["target"].unique()))
df_1["target"].unique()

df_1 = df_1.loc[df_1["target"]<=1]
print(df_1.shape)

变量映射后,目标变量的可能取值为{0,1,2,3},删除不确定样本,保留好样本和坏样本的数据(也即标签为0或1的样本),得到115348条样本。
计算好坏样本的比例:

sum(df_1["target"]==0) / sum(df_1["target"]==1)

得到好样本与坏样本的比例为139:1,样本不均衡非常严重。

2.3 数据清洗与预处理

2.3.1 删除贷后变量

贷后变量对申请评分卡模型而言是一种数据泄露数据,为了防止数据泄露影响模型效果,需要删除贷后变量。

#1.删除贷后数据
var_del = [ 'collection_recovery_fee','initial_list_status','last_credit_pull_d','last_pymnt_amnt',
       'last_pymnt_d','next_pymnt_d','out_prncp','out_prncp_inv','recoveries','total_pymnt',
       'total_pymnt_inv','total_rec_int','total_rec_late_fee','total_rec_prncp','settlement_percentage' ]
df_1 = df_1.drop(var_del, axis=1)

2.3.2 删除LendingClub公司评估结果

#2.删除LC公司信用评估的结果,利率也是LC公司的结果,且利率越高风险越大,也是数据泄露的变量
var_del_1 = ['grade','sub_grade','int_rate']
df_1 = df_1.drop(var_del_1, axis=1)

2.3.3 删除缺失值较多的变量

这里删除缺失率95%以上的变量

def del_na(df, colname_1, rate):
    # df:dataframe
    # colname_1:列名list
    # rate:缺失值比例,大于rate变量删除
    na_cols = df[colname_1].isna().sum().sort_values(ascending=False)/float(df.shape[0])
    na_del = na_cols[na_cols >= rate]
    df = df.drop(na_del.index, axis=1)
    return df, na_del
df_1, na_del = del_na(df_1, list(df_1.columns), rate=0.95)
na_del

2.3.4 删除唯一值变量

如果变量的取值只有一种,则该类变量对目标变量没有任何预测能力,需要删除该变量

def constant_del(df, cols):
    dele_list = []
    for col in cols:
        uniq_vals = list(df[col].unique())
        if pd.isnull(uniq_vals).any():
            if len(uniq_vals) == 2:
                dele_list.append(col)
                print("{} 变量只有一种取值,该变量被删除".format(col))
        elif len(df[col].unique())==1:
            dele_list.append(col)
            print("{} 变量只有一种取值,该变量被删除".format(col))
    df = df.drop(dele_list, axis=1)
    return df, dele_list

cols_name = list(df_1.columns)
cols_name.remove("target")
df_1, dele_list = constant_del(df_1, cols_name)
image.png

2.3.5 删除分布不均匀的变量

这里定义变量的某一个取值占所有样本量的90%以上为不均匀

def tail_del(df, cols, rate):
    dele_list = []
    len_1 = df.shape[0]
    for col in cols:
        if len(df[col].unique())<5:
            if df[col].value_counts().max()/len_1 >= rate:
                dele_list.append(col)
                print("{} 变量分布不均衡,该变量被删除".format(col))
    df = df.drop(dele_list, axis=1)
    return df, dele_list

cols_name_1 = list(df_1.columns)
cols_name_1.remove("target")
df_1, dele_list = tail_del(df_1, cols_name_1, rate=0.9)

变量debt_settlement_flag被删除

2.3.6 删除无用变量

有些变量与其他变量含义相同,需要删除,如title和purpose;emp_title为工作岗位级别,离散程度较大,这里直接删除;zip_code为邮编信息,这里直接删除

len(df_1.emp_title.unique())
var_del_2 = ["emp_title", "zip_code", "title"]
df_1 = df_1.drop(var_del_2, axis=1) 

2.3.7 特殊格式清洗

观察数据类型发现,数据集的数据类型有字符型、浮点型和整型等,其中时间类型也被保存为字符型。变量revol_util(循环信用额度使用率)中包含%,需要进行特殊字符替换并转化为浮点型数据

df_1["revol_util"] = df_1["revol_util"].str.replace("%","").astype("float")

2.3.8 时间数据格式化

var_date = ["issue_d","earliest_cr_line","sec_app_earliest_cr_line"]
def trans_format(time_string, from_format, to_format):
    if pd.isnull(time_string):
        return np.nan
    else:
        time_struct = time.strptime(time_string, from_format)
        times = time.strftime(to_format, time_struct)
        times = datetime.datetime.strptime(times, "%Y-%m")
        return times
    
df_1['issue_d'] = df_1['issue_d'].apply(trans_format,args=('%b-%Y','%Y-%m',))
df_1['earliest_cr_line'] = df_1['earliest_cr_line'].apply(trans_format,args=('%b-%Y','%Y-%m',))
df_1['sec_app_earliest_cr_line'] = df_1['sec_app_earliest_cr_line'].apply(trans_format,args=('%b-%Y','%Y-%m',))

3. 特征工程

3.1 将时间差值转化为月份

df_1['mth_interval']=df_1['issue_d']-df_1['earliest_cr_line']
df_1['sec_mth_interval']=df_1['issue_d']-df_1['sec_app_earliest_cr_line']

df_1['mth_interval'] = df_1['mth_interval'].apply(lambda x: round(x.days/30,0))
df_1['sec_mth_interval'] = df_1['sec_mth_interval'].apply(lambda x: round(x.days/30,0))
df_1['issue_m']=df_1['issue_d'].apply(lambda x: x.month)
##删除原始日期变量
df_1 = df_1.drop(var_date, axis=1)

3.2 比例特征

# 年还款总额占年收入的百分比
index_1 = df_1["annual_inc"]==0
if sum(index_1)>0:
    df_1.loc[index_1,"annual_inc"] = 10
df_1["pay_in_rate"] = df_1["installment"]*12/df_1["annual_inc"]
index_s1 = (df_1['pay_in_rate'] >=1) & (df_1['pay_in_rate'] <2) 
if sum(index_s1)>0:
    df_1.loc[index_s1,'pay_in_rate'] = 1
index_s2 = df_1['pay_in_rate'] >=2
if sum(index_s2)>0:
    df_1.loc[index_s2,'pay_in_rate'] = 2 
# 信用借款账户数与总账户数比
df_1["credit_open_rate"] = df_1["open_acc"]/df_1["total_acc"]
# 周转余额和所有账户余额比
df_1['revol_total_rate'] = df_1.revol_bal/df_1.tot_cur_bal
##欠款总额和本次借款比
df_1['coll_loan_rate'] = df_1.tot_coll_amt/df_1.installment
index_s3 = df_1['coll_loan_rate'] >=1
if sum(index_s3)>0:
    df_1.loc[index_s3,'coll_loan_rate'] = 1
##银行卡状态较好的个数与总银行卡数的比
df_1['good_bankcard_rate'] = df_1.num_bc_sats/df_1.num_bc_tl
##余额大于零的循环账户数与所有循环账户数的比
df_1['good_rev_accts_rate'] = df_1.num_rev_tl_bal_gt_0/df_1.num_rev_accts

3.3 变量分箱

#离散变量与连续变量区分
def category_continue_separation(df,feature_names):
    categorical_var = []
    numerical_var = []
    if 'target' in feature_names:
        feature_names.remove('target')
    ##先判断类型,如果是int或float就直接作为连续变量
    numerical_var = list(df[feature_names].select_dtypes(include=['int','float','int32','float32','int64','float64']).columns.values)
    categorical_var = [x for x in feature_names if x not in numerical_var]
    return categorical_var,numerical_var

categorical_var,numerical_var = category_continue_separation(df_1,list(df_1.columns))
for s in set(numerical_var):
    if len(df_1[s].unique())<=10:
        print('变量'+s+'可能取值'+str(len(df_1[s].unique())))
        categorical_var.append(s)
        numerical_var.remove(s)
        ##同时将后加的数值变量转为字符串
        index_1 = df_1[s].isnull()
        if sum(index_1) > 0:
            df_1.loc[~index_1,s] = df_1.loc[~index_1,s].astype('str')
        else:
            df_1[s] = df_1[s].astype('str')

划分训练集与测试集

#划分测试集与训练集
data_train, data_test = train_test_split(df_1,  test_size=0.2,stratify=df_1.target,random_state=25)
sum(data_train.target==0)/data_train.target.sum()
sum(data_test.target==0)/data_test.target.sum()

连续变量分箱

# 连续变量分箱
dict_cont_bin = {}
for i in numerical_var:
    dict_cont_bin[i],gain_value_save,gain_rate_save = varbin_meth.cont_var_bin(data_train[i], data_train.target, method=2, mmin=4, mmax=12,
                                 bin_rate=0.01, stop_limit=0.05, bin_min_num=20)
image.png

离散变量分箱

dict_disc_bin = {}
del_key = []
for i in categorical_var:
    dict_disc_bin[i],gain_value_save , gain_rate_save ,del_key_1 = varbin_meth.disc_var_bin(data_train[i], data_train.target, method=2, mmin=4,
                                 mmax=10, stop_limit=0.05, bin_min_num=20)
    if len(del_key_1)>0 :
        del_key.extend(del_key_1)

#删除分箱数只有1个的变量
if len(del_key) > 0:
    for j in del_key:
        del dict_disc_bin[j]

训练数据与测试数据分箱

# 连续变量分箱映射
df_cont_bin_train = pd.DataFrame()
for i in dict_cont_bin.keys():
    df_cont_bin_train = pd.concat([ df_cont_bin_train , varbin_meth.cont_var_bin_map(data_train[i], dict_cont_bin[i]) ], axis = 1)
# 离散变量分箱映射
df_disc_bin_train = pd.DataFrame()
for i in dict_disc_bin.keys():
    df_disc_bin_train = pd.concat([ df_disc_bin_train , varbin_meth.disc_var_bin_map(data_train[i], dict_disc_bin[i]) ], axis = 1)

# 连续变量分箱映射
df_cont_bin_test = pd.DataFrame()
for i in dict_cont_bin.keys():
    df_cont_bin_test = pd.concat([ df_cont_bin_test , varbin_meth.cont_var_bin_map(data_test[i], dict_cont_bin[i]) ], axis = 1)
# 离散变量分箱映射
df_disc_bin_test = pd.DataFrame()
for i in dict_disc_bin.keys():
    df_disc_bin_test = pd.concat([ df_disc_bin_test , varbin_meth.disc_var_bin_map(data_test[i], dict_disc_bin[i]) ], axis = 1)

组成分箱后的训练集与测试集

df_disc_bin_train["target"] = data_train["target"]
data_train_bin = pd.concat([df_cont_bin_train, df_disc_bin_train], axis=1)
df_disc_bin_test["target"] = data_test["target"]
data_test_bin = pd.concat([df_cont_bin_test, df_disc_bin_test], axis=1)

data_train_bin.reset_index(inplace=True, drop=True)
data_test_bin.reset_index(inplace=True, drop=True)
var_all_bin = list(data_train_bin.columns)
var_all_bin.remove("target")
print(len(var_all_bin))

特征处理后还剩余94个特征

3.4 WOE编码

data_path = file_path
##训练集WOE编码
df_train_woe, dict_woe_map, dict_iv_values ,var_woe_name = var_encode.woe_encode(data_train_bin,data_path,var_all_bin, data_train_bin.target,'dict_woe_map', flag='train')
##测试集WOE编码
df_test_woe, var_woe_name = var_encode.woe_encode(data_test_bin,data_path,var_all_bin, data_test_bin.target, 'dict_woe_map',flag='test')
image.png image.png

4 特征选择

4.1 IV值初步筛选

# iv筛选
def iv_selection_func(bin_data, data_params, iv_low=0.02, iv_up=5, label='target'):
    # 简单看一下IV,太小的不要
    selected_features = []
    for k, v in data_params.items():
        if iv_low <= v < iv_up and k in bin_data.columns:
            selected_features.append(k+'_woe')
        else:
            print('{0} 变量的IV值为 {1},小于阈值删除'.format(k, v))
    selected_features.append(label)
    return bin_data[selected_features]
# IV值初步筛选,选择iv大于等于0.01的变量
df_train_woe = iv_selection_func(df_train_woe,dict_iv_values,iv_low=0.01)

4.2 相关性筛选

##相关性分析,相关系数即皮尔逊相关系数大于0.8的,删除IV值小的那个变量。
sel_var = list(df_train_woe.columns) 
sel_var.remove('target')
###循环,变量与多个变量相关系数大于0.8,则每次只删除IV值最小的那个,直到没有大于0.8的变量为止
while True:
    pearson_corr = (np.abs(df_train_woe[sel_var].corr()) >= 0.8)
    if pearson_corr.sum().sum() <= len(sel_var):
        break
    del_var = []
    for i in sel_var:
        var_1 = list(pearson_corr.index[pearson_corr[i]].values)
        if len(var_1)>1 :
            df_temp = pd.DataFrame({'value':var_1,'var_iv':[ dict_iv_values[x.split(sep='_woe')[0]] for x in var_1 ]})
            del_var.extend(list(df_temp.value.loc[df_temp.var_iv == df_temp.var_iv.min(),].values))
    del_var1 = list(np.unique(del_var) )      
    ##删除这些,相关系数大于0.8的变量
    sel_var = [s for s in sel_var if s not in del_var1]

4.3 树模型变量筛选

这里使用feature_selector包来完成树模型变量的选择

##特征选择
fs = FeatureSelector(data = df_train_woe[sel_var], labels = data_train_bin.target)
##一次性去除所有的不满足特征
fs.identify_all(selection_params = {'missing_threshold': 0.9, 
                                     'correlation_threshold': 0.8, 
                                     'task': 'classification', 
                                     'eval_metric': 'binary_error',
                                     'max_depth':2,
                                     'cumulative_importance': 0.90})

df_train_woe = fs.remove(methods = 'all')
df_train_woe['target'] = data_train_bin.target

5. 模型训练

5.1 SMOTE样本生成

var_woe_name = list(df_train_woe.columns)
var_woe_name.remove('target')
df_temp_normal = df_train_woe[df_train_woe["target"]==0]
df_temp_normal.reset_index(drop=True, inplace=True)
index_1 = np.random.randint( low = 0,high = df_temp_normal.shape[0]-1,size=20000)
index_1 = np.unique(index_1)
index_1
df_temp =  df_temp_normal.loc[index_1]
index_2 = [x for x in range(df_temp_normal.shape[0]) if x not in index_1 ]
df_temp_other = df_temp_normal.loc[index_2]
df_temp = pd.concat([df_temp,df_train_woe[df_train_woe.target==1]],axis=0,ignore_index=True)

##用随机抽取的样本做样本生成
sm_sample_1 = SMOTE(random_state=10,sampling_strategy=1,k_neighbors=5)
x_train, y_train = sm_sample_1.fit_resample(df_temp[var_woe_name], df_temp.target)
x_train.shape

##合并数据
x_train = np.vstack([x_train, np.array(df_temp_other[var_woe_name])])
y_train = np.hstack([y_train, np.array(df_temp_other.target)])

sum(y_train==0)/sum(y_train)

del_list = []
for s in var_woe_name:
    index_s = df_test_woe[s].isnull()
    if sum(index_s)> 0:
        del_list.extend(list(df_test_woe.index[index_s]))
if len(del_list)>0:
    list_1 = [x for x in list(df_test_woe.index) if x not in del_list ]
    df_test_woe = df_test_woe.loc[list_1]

    x_test = df_test_woe[var_woe_name]
    x_test = np.array(x_test)
    y_test = np.array(df_test_woe.target.loc[list_1])
else:
    x_test = df_test_woe[var_woe_name]
    x_test = np.array(x_test)
    y_test = np.array(df_test_woe.target)

5.2 模型构建

##设置待优化的超参数
lr_param = {'C': [0.01, 0.1, 0.2, 0.5, 1, 1.5, 2],
            'class_weight': [{1: 1, 0: 1},  {1: 2, 0: 1}, {1: 3, 0: 1}, {1: 5, 0: 1}]}
##初始化网格搜索
lr_gsearch = GridSearchCV(
    estimator=LogisticRegression(random_state=0, fit_intercept=True, penalty='l2', solver='saga'),
    param_grid=lr_param, cv=3, scoring='f1', n_jobs=-1, verbose=2)
##执行超参数优化
lr_gsearch.fit(x_train, y_train)
print('logistic model best_score_ is {0},and best_params_ is {1}'.format(lr_gsearch.best_score_,
                                                                         lr_gsearch.best_params_))

##用最优参数,初始化logistic模型
LR_model = LogisticRegression(C=lr_gsearch.best_params_['C'], penalty='l2', solver='saga',
                                class_weight=lr_gsearch.best_params_['class_weight'])

5.3 模型评估

5.3.1 计算混淆矩阵、召回率与精确率

LR_model_fit = LR_model.fit(x_train, y_train)

##模型评估
y_pred = LR_model_fit.predict(x_test)
##计算混淆矩阵与recall、precision
cnf_matrix = confusion_matrix(y_test, y_pred)
recall_value = recall_score(y_test, y_pred)
precision_value = precision_score(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)
print("混淆矩阵:")
print(cnf_matrix)
print('召回率:{0},精确率:{1}'.format(recall_value,precision_value)) 
image.png

5.3.2 计算基尼系数AR值、KS值和ROC曲线下面积AUC的值

##给出概率预测结果
y_score_test = LR_model_fit.predict_proba(x_test)[:, 1]
##计算AR, gini等
fpr, tpr, thresholds = roc_curve(y_test, y_score_test)
roc_auc = auc(fpr, tpr)
ks = max(tpr - fpr)
ar = 2*roc_auc-1 
print('基尼系数AR值:{0},ks值:{1},ROC曲线下面积auc值:{2}'.format(ar,ks,roc_auc)) 
image.png

5.3.3 绘制KS曲线

#ks曲线
plt.figure(figsize=(10,6))
fontsize_1 = 12
plt.plot(np.linspace(0,1,len(tpr)),tpr,'--',color='black', label='tpr')
plt.plot(np.linspace(0,1,len(tpr)),fpr,':',color='black', label='fpr')
plt.plot(np.linspace(0,1,len(tpr)),tpr - fpr,'-',color='g',label="ks值")
plt.grid()
plt.xticks( fontsize=fontsize_1)
plt.yticks( fontsize=fontsize_1)
plt.xlabel('概率分组',fontsize=fontsize_1)
plt.ylabel('累积占比%',fontsize=fontsize_1)
plt.legend(loc="upper left",fontsize=fontsize_1)
image.png

5.3.4 绘制ROC曲线

#ROC曲线
plt.figure(figsize=(10,6))
lw = 2
fontsize_1 = 12
plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='black', lw=lw, linestyle='--', label="随机判断")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xticks( fontsize=fontsize_1)
plt.yticks( fontsize=fontsize_1)
plt.xlabel('FPR',fontsize=fontsize_1)
plt.ylabel('TPR',fontsize=fontsize_1)
plt.title('ROC',fontsize=fontsize_1)
plt.legend(loc="lower right",fontsize=fontsize_1)
image.png

6.评分卡生成

6.1 提取Logistic模型的权重和截距

#保存模型的参数用于计算评分
var_woe_name.append('intercept')
##提取权重
weight_value = list(LR_model_fit.coef_.flatten())
##提取截距项
weight_value.extend(list(LR_model_fit.intercept_))
dict_params = dict(zip(var_woe_name,weight_value))

#查看训练集、验证集与测试集
y_score_train = LR_model_fit.predict_proba(x_train)[:, 1]
y_score_test = LR_model_fit.predict_proba(x_test)[:, 1]

pd.DataFrame(dict_params,index=["weight_value"]).T
image.png

6.2 生成评分卡

def score_params_cal(base_point, odds, PDO):
    ##给定预期分数,与翻倍分数,确定参数A,B
    B = PDO/np.log(2)  
    A = base_point + B*np.log(odds)
    return A, B 
def myfunc(x):
    return str(x[0])+'_'+str(x[1])
##生成评分卡
def create_score(dict_woe_map,dict_params,dict_cont_bin,dict_disc_bin):
    ##假设Odds在1:60时对应的参考分值为600分,分值调整刻度PDO为20,则计算得到分值转化的参数B = 28.85,A= 481.86。
    params_A,params_B = score_params_cal(base_point=600, odds=1/60, PDO=20)
    # 计算基础分
    base_points = round(params_A - params_B * dict_params['intercept'])
    df_score = pd.DataFrame()
    dict_bin_score = {}
    for k in dict_params.keys():
#        k='duration_BIN'
#        k = 'foreign_worker_BIN'
        if k !='intercept':
            df_temp =  pd.DataFrame([dict_woe_map[k.split(sep='_woe')[0]]]).T
            df_temp.reset_index(inplace=True)
            df_temp.columns = ['bin','woe_val']
            ##计算分值
            df_temp['score'] = round(-params_B*df_temp.woe_val*dict_params[k])
            dict_bin_score[k.split(sep='_BIN')[0]] = dict(zip(df_temp['bin'],df_temp['score']))
            ##连续变量的计算
            if k.split(sep='_BIN')[0] in dict_cont_bin.keys():
                df_1 = dict_cont_bin[k.split(sep='_BIN')[0]]
                df_1['var_name'] = df_1[['bin_low', 'bin_up']].apply(myfunc,axis=1)
                df_1 = df_1[['total', 'var_name']]
                df_temp = pd.merge(df_temp , df_1,on='bin')
                df_temp['var_name_raw'] = k.split(sep='_BIN')[0]
                df_score = pd.concat([df_score,df_temp],axis=0)
            ##离散变量的计算
            elif k.split(sep='_BIN')[0] in dict_disc_bin.keys():
                df_temp = pd.merge(df_temp , dict_disc_bin[k.split(sep='_BIN')[0]],on='bin')
                df_temp['var_name_raw'] = k.split(sep='_BIN')[0]
                df_score = pd.concat([df_score,df_temp],axis=0)

    df_score['score_base'] =  base_points 
    return df_score,dict_bin_score,params_A,params_B,base_points

df_score,dict_bin_score,params_A,params_B,score_base = create_score(dict_woe_map,dict_params,dict_cont_bin,dict_disc_bin)
image.png image.png
##计算样本分数
def cal_score(df_1,dict_bin_score,dict_cont_bin,dict_disc_bin,base_points):
    ##先对原始数据分箱映射,然后,用分数字典dict_bin_score映射分数,基础分加每项的分数就是最终得分
    df_1.reset_index(drop=True,inplace = True)
    df_all_score = pd.DataFrame()
    ##连续变量
    for i in dict_cont_bin.keys():
        if i in dict_bin_score.keys():
            df_all_score = pd.concat([ df_all_score , varbin_meth.cont_var_bin_map(df_1[i], dict_cont_bin[i]).map(dict_bin_score[i]) ], axis = 1)
    ##离散变量
    for i in dict_disc_bin.keys():
        if i in dict_bin_score.keys():
            df_all_score = pd.concat([ df_all_score ,varbin_meth.disc_var_bin_map(df_1[i], dict_disc_bin[i]).map(dict_bin_score[i]) ], axis = 1)
    
    df_all_score.columns = [x.split(sep='_BIN')[0] for x in list(df_all_score.columns)]
    df_all_score['base_score'] = base_points    
    df_all_score['score'] = df_all_score.apply(sum,axis=1)
    df_all_score['target'] = df_1.target
    return df_all_score

##计算样本评分
df_all = pd.concat([data_train,data_test],axis = 0)
df_all_score = cal_score(df_all,dict_bin_score,dict_cont_bin,dict_disc_bin,score_base)
df_all_score.score.max()
df_all_score.score.min()
image.png

6.3 计算不同分数区间的指标

样本最大分值为981,最小分值为335,这里设置分数区间为[300,900],50分为一组,分组代码及结果如下

##简单的分数区间计算
df_all_score.score[df_all_score.score >900] = 900
good_total = sum(df_all_score.target == 0)
bad_total = sum(df_all_score.target == 1)
score_bin = np.arange(300,950,50)
bin_rate = []
bad_rate = []
ks = []
good_num = []
bad_num = []
score_bin_list = []
for i in range(len(score_bin)-1):
    ##取出分数区间的样本
    if score_bin[i+1] == 900:
        index_1 = (df_all_score.score >= score_bin[i]) & (df_all_score.score <= score_bin[i+1]) 
    else:
        index_1 = (df_all_score.score >= score_bin[i]) & (df_all_score.score < score_bin[i+1]) 
    df_temp = df_all_score.loc[index_1,['target','score']]
    # 分数区间
    score_bin_list.append("{}_{}".format(score_bin[i], score_bin[i+1]))
    ##计算该分数区间的指标
    good_num.append(sum(df_temp.target==0))
    bad_num.append(sum(df_temp.target==1))
    ##区间样本率
    bin_rate.append("{:.2f}%".format(df_temp.shape[0]/df_all_score.shape[0]*100))
    ##坏样本率
    bad_rate.append("{:.2f}%".format(df_temp.target.sum()/df_temp.shape[0]*100))
    ##以该分数为注入分数的ks值
    ks.append(sum(bad_num[0:i+1])/bad_total - sum(good_num[0:i+1])/good_total )


df_result = pd.DataFrame({'score_bin':score_bin_list, 'good_num':good_num,'bad_num':bad_num,'bin_rate':bin_rate,
                         'bad_rate':bad_rate,'ks':ks}) 
df_result
image.png

参考书籍:

1.《Python金融大数据风控建模实战:基于机器学习》

相关文章

网友评论

      本文标题:基于LendingClub数据集的申请评分卡开发

      本文链接:https://www.haomeiwen.com/subject/gitygktx.html