1.数据源介绍
本文以美国的Lending Club公司公开的借款数据来建立申请评分卡
2.数据获取与预处理
本文使用Lending Club2019年一季度的申请数据,数据下载链接:https://www.lendingclub.com/info/statistics.action
百度网盘下载:
链接:https://pan.baidu.com/s/1sKLLLyyO8rxR4oHzZwJItQ
提取码:a551
2.1 读取数据
首先导入后续分析需要的包以及相关设置
import os,sys
sys.path.append("./lendingclub_data")
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import time,datetime
import variable_bin_methods as varbin_meth
import variable_encode as var_encode
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_curve, auc,confusion_matrix,recall_score,precision_score,accuracy_score
from sklearn.linear_model import LogisticRegression
from feature_selector import FeatureSelector
from imblearn.over_sampling import SMOTE
import missingno as msno
import matplotlib
matplotlib.use(arg='Qt5Agg')
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore") ##忽略警告
import seaborn as sns
# 设置风格
sns.set(style='white', font_scale=1.2)
%matplotlib inline
plt.rcParams["font.sans-serif"] = "SimHei"
plt.rcParams['axes.titlesize'] = 20
plt.rcParams['axes.unicode_minus']=False
# 多行输出设置
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
将所需文件和数据放在当前目录下“lendingclub_data”的文件夹内
file_path = "./lendingclub_data/data"
file_name = "LoanStats_2019Q1.csv"
df_1 = pd.read_csv(os.path.join(file_path, file_name), header=1, low_memory=False)
df_1.head()
image.png
该数据集共有115677个样本,144个字段。根据变量描述,变量可以分为贷前变量和贷后变量。
2.2 好坏样本定义
数据集中的变量loan_status为标签变量,标签变量的取值情况如下:
image.pngdf_1.groupby(by = ["loan_status"])[["int_rate"]].count()
image.png
这里将逾期15天以上的样本定义为坏样本,则好坏样本的定义如下:
- 逾期15天以上的样本为坏样本,即变量loan_status中取值为:'Late (16-30 days)'、'Late (31-120 days)'、 'Charged Off'的样本
- 不确定样本,即截止到建模阶段逾期天数不足15天的样本。loan_status中取值为 'In Grace Period'的样本
- 好样本是loan_status中取值为'Current'、'Fully Paid'的样本
根据上述逻辑做标签变量映射,将字符型数据转化为数值型数据
df_1.rename(columns={"loan_status":"target"}, inplace=True)
print(df_1.shape)
# 删除target中的空值
df_1 = df_1.loc[~(df_1["target"].isnull())]
print(df_1.shape)
# 目标变量映射
def target_mapping(lst):
# 'Late (16-30 days)'、'Late (31-120 days)'、 'Charged Off'映射为1,坏样本
# 'In Grace Period'映射为2,不确定样本
# 'Current'、'Fully Paid'映射为0,好样本
mapping = {}
for elem in lst:
if elem in ['Late (16-30 days)', 'Late (31-120 days)', 'Charged Off']:
mapping[elem] = 1
elif elem in ['In Grace Period']:
mapping[elem] = 2
elif elem in ['Current','Fully Paid']:
mapping[elem] = 0
else:
mapping[elem] = 3
return mapping
df_1["target"] = df_1["target"].map(target_mapping(df_1["target"].unique()))
df_1["target"].unique()
df_1 = df_1.loc[df_1["target"]<=1]
print(df_1.shape)
变量映射后,目标变量的可能取值为{0,1,2,3},删除不确定样本,保留好样本和坏样本的数据(也即标签为0或1的样本),得到115348条样本。
计算好坏样本的比例:
sum(df_1["target"]==0) / sum(df_1["target"]==1)
得到好样本与坏样本的比例为139:1,样本不均衡非常严重。
2.3 数据清洗与预处理
2.3.1 删除贷后变量
贷后变量对申请评分卡模型而言是一种数据泄露数据,为了防止数据泄露影响模型效果,需要删除贷后变量。
#1.删除贷后数据
var_del = [ 'collection_recovery_fee','initial_list_status','last_credit_pull_d','last_pymnt_amnt',
'last_pymnt_d','next_pymnt_d','out_prncp','out_prncp_inv','recoveries','total_pymnt',
'total_pymnt_inv','total_rec_int','total_rec_late_fee','total_rec_prncp','settlement_percentage' ]
df_1 = df_1.drop(var_del, axis=1)
2.3.2 删除LendingClub公司评估结果
#2.删除LC公司信用评估的结果,利率也是LC公司的结果,且利率越高风险越大,也是数据泄露的变量
var_del_1 = ['grade','sub_grade','int_rate']
df_1 = df_1.drop(var_del_1, axis=1)
2.3.3 删除缺失值较多的变量
这里删除缺失率95%以上的变量
def del_na(df, colname_1, rate):
# df:dataframe
# colname_1:列名list
# rate:缺失值比例,大于rate变量删除
na_cols = df[colname_1].isna().sum().sort_values(ascending=False)/float(df.shape[0])
na_del = na_cols[na_cols >= rate]
df = df.drop(na_del.index, axis=1)
return df, na_del
df_1, na_del = del_na(df_1, list(df_1.columns), rate=0.95)
na_del
2.3.4 删除唯一值变量
如果变量的取值只有一种,则该类变量对目标变量没有任何预测能力,需要删除该变量
def constant_del(df, cols):
dele_list = []
for col in cols:
uniq_vals = list(df[col].unique())
if pd.isnull(uniq_vals).any():
if len(uniq_vals) == 2:
dele_list.append(col)
print("{} 变量只有一种取值,该变量被删除".format(col))
elif len(df[col].unique())==1:
dele_list.append(col)
print("{} 变量只有一种取值,该变量被删除".format(col))
df = df.drop(dele_list, axis=1)
return df, dele_list
cols_name = list(df_1.columns)
cols_name.remove("target")
df_1, dele_list = constant_del(df_1, cols_name)
image.png
2.3.5 删除分布不均匀的变量
这里定义变量的某一个取值占所有样本量的90%以上为不均匀
def tail_del(df, cols, rate):
dele_list = []
len_1 = df.shape[0]
for col in cols:
if len(df[col].unique())<5:
if df[col].value_counts().max()/len_1 >= rate:
dele_list.append(col)
print("{} 变量分布不均衡,该变量被删除".format(col))
df = df.drop(dele_list, axis=1)
return df, dele_list
cols_name_1 = list(df_1.columns)
cols_name_1.remove("target")
df_1, dele_list = tail_del(df_1, cols_name_1, rate=0.9)
变量debt_settlement_flag被删除
2.3.6 删除无用变量
有些变量与其他变量含义相同,需要删除,如title和purpose;emp_title为工作岗位级别,离散程度较大,这里直接删除;zip_code为邮编信息,这里直接删除
len(df_1.emp_title.unique())
var_del_2 = ["emp_title", "zip_code", "title"]
df_1 = df_1.drop(var_del_2, axis=1)
2.3.7 特殊格式清洗
观察数据类型发现,数据集的数据类型有字符型、浮点型和整型等,其中时间类型也被保存为字符型。变量revol_util(循环信用额度使用率)中包含%,需要进行特殊字符替换并转化为浮点型数据
df_1["revol_util"] = df_1["revol_util"].str.replace("%","").astype("float")
2.3.8 时间数据格式化
var_date = ["issue_d","earliest_cr_line","sec_app_earliest_cr_line"]
def trans_format(time_string, from_format, to_format):
if pd.isnull(time_string):
return np.nan
else:
time_struct = time.strptime(time_string, from_format)
times = time.strftime(to_format, time_struct)
times = datetime.datetime.strptime(times, "%Y-%m")
return times
df_1['issue_d'] = df_1['issue_d'].apply(trans_format,args=('%b-%Y','%Y-%m',))
df_1['earliest_cr_line'] = df_1['earliest_cr_line'].apply(trans_format,args=('%b-%Y','%Y-%m',))
df_1['sec_app_earliest_cr_line'] = df_1['sec_app_earliest_cr_line'].apply(trans_format,args=('%b-%Y','%Y-%m',))
3. 特征工程
3.1 将时间差值转化为月份
df_1['mth_interval']=df_1['issue_d']-df_1['earliest_cr_line']
df_1['sec_mth_interval']=df_1['issue_d']-df_1['sec_app_earliest_cr_line']
df_1['mth_interval'] = df_1['mth_interval'].apply(lambda x: round(x.days/30,0))
df_1['sec_mth_interval'] = df_1['sec_mth_interval'].apply(lambda x: round(x.days/30,0))
df_1['issue_m']=df_1['issue_d'].apply(lambda x: x.month)
##删除原始日期变量
df_1 = df_1.drop(var_date, axis=1)
3.2 比例特征
# 年还款总额占年收入的百分比
index_1 = df_1["annual_inc"]==0
if sum(index_1)>0:
df_1.loc[index_1,"annual_inc"] = 10
df_1["pay_in_rate"] = df_1["installment"]*12/df_1["annual_inc"]
index_s1 = (df_1['pay_in_rate'] >=1) & (df_1['pay_in_rate'] <2)
if sum(index_s1)>0:
df_1.loc[index_s1,'pay_in_rate'] = 1
index_s2 = df_1['pay_in_rate'] >=2
if sum(index_s2)>0:
df_1.loc[index_s2,'pay_in_rate'] = 2
# 信用借款账户数与总账户数比
df_1["credit_open_rate"] = df_1["open_acc"]/df_1["total_acc"]
# 周转余额和所有账户余额比
df_1['revol_total_rate'] = df_1.revol_bal/df_1.tot_cur_bal
##欠款总额和本次借款比
df_1['coll_loan_rate'] = df_1.tot_coll_amt/df_1.installment
index_s3 = df_1['coll_loan_rate'] >=1
if sum(index_s3)>0:
df_1.loc[index_s3,'coll_loan_rate'] = 1
##银行卡状态较好的个数与总银行卡数的比
df_1['good_bankcard_rate'] = df_1.num_bc_sats/df_1.num_bc_tl
##余额大于零的循环账户数与所有循环账户数的比
df_1['good_rev_accts_rate'] = df_1.num_rev_tl_bal_gt_0/df_1.num_rev_accts
3.3 变量分箱
#离散变量与连续变量区分
def category_continue_separation(df,feature_names):
categorical_var = []
numerical_var = []
if 'target' in feature_names:
feature_names.remove('target')
##先判断类型,如果是int或float就直接作为连续变量
numerical_var = list(df[feature_names].select_dtypes(include=['int','float','int32','float32','int64','float64']).columns.values)
categorical_var = [x for x in feature_names if x not in numerical_var]
return categorical_var,numerical_var
categorical_var,numerical_var = category_continue_separation(df_1,list(df_1.columns))
for s in set(numerical_var):
if len(df_1[s].unique())<=10:
print('变量'+s+'可能取值'+str(len(df_1[s].unique())))
categorical_var.append(s)
numerical_var.remove(s)
##同时将后加的数值变量转为字符串
index_1 = df_1[s].isnull()
if sum(index_1) > 0:
df_1.loc[~index_1,s] = df_1.loc[~index_1,s].astype('str')
else:
df_1[s] = df_1[s].astype('str')
划分训练集与测试集
#划分测试集与训练集
data_train, data_test = train_test_split(df_1, test_size=0.2,stratify=df_1.target,random_state=25)
sum(data_train.target==0)/data_train.target.sum()
sum(data_test.target==0)/data_test.target.sum()
连续变量分箱
# 连续变量分箱
dict_cont_bin = {}
for i in numerical_var:
dict_cont_bin[i],gain_value_save,gain_rate_save = varbin_meth.cont_var_bin(data_train[i], data_train.target, method=2, mmin=4, mmax=12,
bin_rate=0.01, stop_limit=0.05, bin_min_num=20)
image.png
离散变量分箱
dict_disc_bin = {}
del_key = []
for i in categorical_var:
dict_disc_bin[i],gain_value_save , gain_rate_save ,del_key_1 = varbin_meth.disc_var_bin(data_train[i], data_train.target, method=2, mmin=4,
mmax=10, stop_limit=0.05, bin_min_num=20)
if len(del_key_1)>0 :
del_key.extend(del_key_1)
#删除分箱数只有1个的变量
if len(del_key) > 0:
for j in del_key:
del dict_disc_bin[j]
训练数据与测试数据分箱
# 连续变量分箱映射
df_cont_bin_train = pd.DataFrame()
for i in dict_cont_bin.keys():
df_cont_bin_train = pd.concat([ df_cont_bin_train , varbin_meth.cont_var_bin_map(data_train[i], dict_cont_bin[i]) ], axis = 1)
# 离散变量分箱映射
df_disc_bin_train = pd.DataFrame()
for i in dict_disc_bin.keys():
df_disc_bin_train = pd.concat([ df_disc_bin_train , varbin_meth.disc_var_bin_map(data_train[i], dict_disc_bin[i]) ], axis = 1)
# 连续变量分箱映射
df_cont_bin_test = pd.DataFrame()
for i in dict_cont_bin.keys():
df_cont_bin_test = pd.concat([ df_cont_bin_test , varbin_meth.cont_var_bin_map(data_test[i], dict_cont_bin[i]) ], axis = 1)
# 离散变量分箱映射
df_disc_bin_test = pd.DataFrame()
for i in dict_disc_bin.keys():
df_disc_bin_test = pd.concat([ df_disc_bin_test , varbin_meth.disc_var_bin_map(data_test[i], dict_disc_bin[i]) ], axis = 1)
组成分箱后的训练集与测试集
df_disc_bin_train["target"] = data_train["target"]
data_train_bin = pd.concat([df_cont_bin_train, df_disc_bin_train], axis=1)
df_disc_bin_test["target"] = data_test["target"]
data_test_bin = pd.concat([df_cont_bin_test, df_disc_bin_test], axis=1)
data_train_bin.reset_index(inplace=True, drop=True)
data_test_bin.reset_index(inplace=True, drop=True)
var_all_bin = list(data_train_bin.columns)
var_all_bin.remove("target")
print(len(var_all_bin))
特征处理后还剩余94个特征
3.4 WOE编码
data_path = file_path
##训练集WOE编码
df_train_woe, dict_woe_map, dict_iv_values ,var_woe_name = var_encode.woe_encode(data_train_bin,data_path,var_all_bin, data_train_bin.target,'dict_woe_map', flag='train')
##测试集WOE编码
df_test_woe, var_woe_name = var_encode.woe_encode(data_test_bin,data_path,var_all_bin, data_test_bin.target, 'dict_woe_map',flag='test')
image.png
image.png
4 特征选择
4.1 IV值初步筛选
# iv筛选
def iv_selection_func(bin_data, data_params, iv_low=0.02, iv_up=5, label='target'):
# 简单看一下IV,太小的不要
selected_features = []
for k, v in data_params.items():
if iv_low <= v < iv_up and k in bin_data.columns:
selected_features.append(k+'_woe')
else:
print('{0} 变量的IV值为 {1},小于阈值删除'.format(k, v))
selected_features.append(label)
return bin_data[selected_features]
# IV值初步筛选,选择iv大于等于0.01的变量
df_train_woe = iv_selection_func(df_train_woe,dict_iv_values,iv_low=0.01)
4.2 相关性筛选
##相关性分析,相关系数即皮尔逊相关系数大于0.8的,删除IV值小的那个变量。
sel_var = list(df_train_woe.columns)
sel_var.remove('target')
###循环,变量与多个变量相关系数大于0.8,则每次只删除IV值最小的那个,直到没有大于0.8的变量为止
while True:
pearson_corr = (np.abs(df_train_woe[sel_var].corr()) >= 0.8)
if pearson_corr.sum().sum() <= len(sel_var):
break
del_var = []
for i in sel_var:
var_1 = list(pearson_corr.index[pearson_corr[i]].values)
if len(var_1)>1 :
df_temp = pd.DataFrame({'value':var_1,'var_iv':[ dict_iv_values[x.split(sep='_woe')[0]] for x in var_1 ]})
del_var.extend(list(df_temp.value.loc[df_temp.var_iv == df_temp.var_iv.min(),].values))
del_var1 = list(np.unique(del_var) )
##删除这些,相关系数大于0.8的变量
sel_var = [s for s in sel_var if s not in del_var1]
4.3 树模型变量筛选
这里使用feature_selector包来完成树模型变量的选择
##特征选择
fs = FeatureSelector(data = df_train_woe[sel_var], labels = data_train_bin.target)
##一次性去除所有的不满足特征
fs.identify_all(selection_params = {'missing_threshold': 0.9,
'correlation_threshold': 0.8,
'task': 'classification',
'eval_metric': 'binary_error',
'max_depth':2,
'cumulative_importance': 0.90})
df_train_woe = fs.remove(methods = 'all')
df_train_woe['target'] = data_train_bin.target
5. 模型训练
5.1 SMOTE样本生成
var_woe_name = list(df_train_woe.columns)
var_woe_name.remove('target')
df_temp_normal = df_train_woe[df_train_woe["target"]==0]
df_temp_normal.reset_index(drop=True, inplace=True)
index_1 = np.random.randint( low = 0,high = df_temp_normal.shape[0]-1,size=20000)
index_1 = np.unique(index_1)
index_1
df_temp = df_temp_normal.loc[index_1]
index_2 = [x for x in range(df_temp_normal.shape[0]) if x not in index_1 ]
df_temp_other = df_temp_normal.loc[index_2]
df_temp = pd.concat([df_temp,df_train_woe[df_train_woe.target==1]],axis=0,ignore_index=True)
##用随机抽取的样本做样本生成
sm_sample_1 = SMOTE(random_state=10,sampling_strategy=1,k_neighbors=5)
x_train, y_train = sm_sample_1.fit_resample(df_temp[var_woe_name], df_temp.target)
x_train.shape
##合并数据
x_train = np.vstack([x_train, np.array(df_temp_other[var_woe_name])])
y_train = np.hstack([y_train, np.array(df_temp_other.target)])
sum(y_train==0)/sum(y_train)
del_list = []
for s in var_woe_name:
index_s = df_test_woe[s].isnull()
if sum(index_s)> 0:
del_list.extend(list(df_test_woe.index[index_s]))
if len(del_list)>0:
list_1 = [x for x in list(df_test_woe.index) if x not in del_list ]
df_test_woe = df_test_woe.loc[list_1]
x_test = df_test_woe[var_woe_name]
x_test = np.array(x_test)
y_test = np.array(df_test_woe.target.loc[list_1])
else:
x_test = df_test_woe[var_woe_name]
x_test = np.array(x_test)
y_test = np.array(df_test_woe.target)
5.2 模型构建
##设置待优化的超参数
lr_param = {'C': [0.01, 0.1, 0.2, 0.5, 1, 1.5, 2],
'class_weight': [{1: 1, 0: 1}, {1: 2, 0: 1}, {1: 3, 0: 1}, {1: 5, 0: 1}]}
##初始化网格搜索
lr_gsearch = GridSearchCV(
estimator=LogisticRegression(random_state=0, fit_intercept=True, penalty='l2', solver='saga'),
param_grid=lr_param, cv=3, scoring='f1', n_jobs=-1, verbose=2)
##执行超参数优化
lr_gsearch.fit(x_train, y_train)
print('logistic model best_score_ is {0},and best_params_ is {1}'.format(lr_gsearch.best_score_,
lr_gsearch.best_params_))
##用最优参数,初始化logistic模型
LR_model = LogisticRegression(C=lr_gsearch.best_params_['C'], penalty='l2', solver='saga',
class_weight=lr_gsearch.best_params_['class_weight'])
5.3 模型评估
5.3.1 计算混淆矩阵、召回率与精确率
LR_model_fit = LR_model.fit(x_train, y_train)
##模型评估
y_pred = LR_model_fit.predict(x_test)
##计算混淆矩阵与recall、precision
cnf_matrix = confusion_matrix(y_test, y_pred)
recall_value = recall_score(y_test, y_pred)
precision_value = precision_score(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)
print("混淆矩阵:")
print(cnf_matrix)
print('召回率:{0},精确率:{1}'.format(recall_value,precision_value))
image.png
5.3.2 计算基尼系数AR值、KS值和ROC曲线下面积AUC的值
##给出概率预测结果
y_score_test = LR_model_fit.predict_proba(x_test)[:, 1]
##计算AR, gini等
fpr, tpr, thresholds = roc_curve(y_test, y_score_test)
roc_auc = auc(fpr, tpr)
ks = max(tpr - fpr)
ar = 2*roc_auc-1
print('基尼系数AR值:{0},ks值:{1},ROC曲线下面积auc值:{2}'.format(ar,ks,roc_auc))
image.png
5.3.3 绘制KS曲线
#ks曲线
plt.figure(figsize=(10,6))
fontsize_1 = 12
plt.plot(np.linspace(0,1,len(tpr)),tpr,'--',color='black', label='tpr')
plt.plot(np.linspace(0,1,len(tpr)),fpr,':',color='black', label='fpr')
plt.plot(np.linspace(0,1,len(tpr)),tpr - fpr,'-',color='g',label="ks值")
plt.grid()
plt.xticks( fontsize=fontsize_1)
plt.yticks( fontsize=fontsize_1)
plt.xlabel('概率分组',fontsize=fontsize_1)
plt.ylabel('累积占比%',fontsize=fontsize_1)
plt.legend(loc="upper left",fontsize=fontsize_1)
image.png
5.3.4 绘制ROC曲线
#ROC曲线
plt.figure(figsize=(10,6))
lw = 2
fontsize_1 = 12
plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='black', lw=lw, linestyle='--', label="随机判断")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xticks( fontsize=fontsize_1)
plt.yticks( fontsize=fontsize_1)
plt.xlabel('FPR',fontsize=fontsize_1)
plt.ylabel('TPR',fontsize=fontsize_1)
plt.title('ROC',fontsize=fontsize_1)
plt.legend(loc="lower right",fontsize=fontsize_1)
image.png
6.评分卡生成
6.1 提取Logistic模型的权重和截距
#保存模型的参数用于计算评分
var_woe_name.append('intercept')
##提取权重
weight_value = list(LR_model_fit.coef_.flatten())
##提取截距项
weight_value.extend(list(LR_model_fit.intercept_))
dict_params = dict(zip(var_woe_name,weight_value))
#查看训练集、验证集与测试集
y_score_train = LR_model_fit.predict_proba(x_train)[:, 1]
y_score_test = LR_model_fit.predict_proba(x_test)[:, 1]
pd.DataFrame(dict_params,index=["weight_value"]).T
image.png
6.2 生成评分卡
def score_params_cal(base_point, odds, PDO):
##给定预期分数,与翻倍分数,确定参数A,B
B = PDO/np.log(2)
A = base_point + B*np.log(odds)
return A, B
def myfunc(x):
return str(x[0])+'_'+str(x[1])
##生成评分卡
def create_score(dict_woe_map,dict_params,dict_cont_bin,dict_disc_bin):
##假设Odds在1:60时对应的参考分值为600分,分值调整刻度PDO为20,则计算得到分值转化的参数B = 28.85,A= 481.86。
params_A,params_B = score_params_cal(base_point=600, odds=1/60, PDO=20)
# 计算基础分
base_points = round(params_A - params_B * dict_params['intercept'])
df_score = pd.DataFrame()
dict_bin_score = {}
for k in dict_params.keys():
# k='duration_BIN'
# k = 'foreign_worker_BIN'
if k !='intercept':
df_temp = pd.DataFrame([dict_woe_map[k.split(sep='_woe')[0]]]).T
df_temp.reset_index(inplace=True)
df_temp.columns = ['bin','woe_val']
##计算分值
df_temp['score'] = round(-params_B*df_temp.woe_val*dict_params[k])
dict_bin_score[k.split(sep='_BIN')[0]] = dict(zip(df_temp['bin'],df_temp['score']))
##连续变量的计算
if k.split(sep='_BIN')[0] in dict_cont_bin.keys():
df_1 = dict_cont_bin[k.split(sep='_BIN')[0]]
df_1['var_name'] = df_1[['bin_low', 'bin_up']].apply(myfunc,axis=1)
df_1 = df_1[['total', 'var_name']]
df_temp = pd.merge(df_temp , df_1,on='bin')
df_temp['var_name_raw'] = k.split(sep='_BIN')[0]
df_score = pd.concat([df_score,df_temp],axis=0)
##离散变量的计算
elif k.split(sep='_BIN')[0] in dict_disc_bin.keys():
df_temp = pd.merge(df_temp , dict_disc_bin[k.split(sep='_BIN')[0]],on='bin')
df_temp['var_name_raw'] = k.split(sep='_BIN')[0]
df_score = pd.concat([df_score,df_temp],axis=0)
df_score['score_base'] = base_points
return df_score,dict_bin_score,params_A,params_B,base_points
df_score,dict_bin_score,params_A,params_B,score_base = create_score(dict_woe_map,dict_params,dict_cont_bin,dict_disc_bin)
image.png
image.png
##计算样本分数
def cal_score(df_1,dict_bin_score,dict_cont_bin,dict_disc_bin,base_points):
##先对原始数据分箱映射,然后,用分数字典dict_bin_score映射分数,基础分加每项的分数就是最终得分
df_1.reset_index(drop=True,inplace = True)
df_all_score = pd.DataFrame()
##连续变量
for i in dict_cont_bin.keys():
if i in dict_bin_score.keys():
df_all_score = pd.concat([ df_all_score , varbin_meth.cont_var_bin_map(df_1[i], dict_cont_bin[i]).map(dict_bin_score[i]) ], axis = 1)
##离散变量
for i in dict_disc_bin.keys():
if i in dict_bin_score.keys():
df_all_score = pd.concat([ df_all_score ,varbin_meth.disc_var_bin_map(df_1[i], dict_disc_bin[i]).map(dict_bin_score[i]) ], axis = 1)
df_all_score.columns = [x.split(sep='_BIN')[0] for x in list(df_all_score.columns)]
df_all_score['base_score'] = base_points
df_all_score['score'] = df_all_score.apply(sum,axis=1)
df_all_score['target'] = df_1.target
return df_all_score
##计算样本评分
df_all = pd.concat([data_train,data_test],axis = 0)
df_all_score = cal_score(df_all,dict_bin_score,dict_cont_bin,dict_disc_bin,score_base)
df_all_score.score.max()
df_all_score.score.min()
image.png
6.3 计算不同分数区间的指标
样本最大分值为981,最小分值为335,这里设置分数区间为[300,900],50分为一组,分组代码及结果如下
##简单的分数区间计算
df_all_score.score[df_all_score.score >900] = 900
good_total = sum(df_all_score.target == 0)
bad_total = sum(df_all_score.target == 1)
score_bin = np.arange(300,950,50)
bin_rate = []
bad_rate = []
ks = []
good_num = []
bad_num = []
score_bin_list = []
for i in range(len(score_bin)-1):
##取出分数区间的样本
if score_bin[i+1] == 900:
index_1 = (df_all_score.score >= score_bin[i]) & (df_all_score.score <= score_bin[i+1])
else:
index_1 = (df_all_score.score >= score_bin[i]) & (df_all_score.score < score_bin[i+1])
df_temp = df_all_score.loc[index_1,['target','score']]
# 分数区间
score_bin_list.append("{}_{}".format(score_bin[i], score_bin[i+1]))
##计算该分数区间的指标
good_num.append(sum(df_temp.target==0))
bad_num.append(sum(df_temp.target==1))
##区间样本率
bin_rate.append("{:.2f}%".format(df_temp.shape[0]/df_all_score.shape[0]*100))
##坏样本率
bad_rate.append("{:.2f}%".format(df_temp.target.sum()/df_temp.shape[0]*100))
##以该分数为注入分数的ks值
ks.append(sum(bad_num[0:i+1])/bad_total - sum(good_num[0:i+1])/good_total )
df_result = pd.DataFrame({'score_bin':score_bin_list, 'good_num':good_num,'bad_num':bad_num,'bin_rate':bin_rate,
'bad_rate':bad_rate,'ks':ks})
df_result
image.png
参考书籍:
1.《Python金融大数据风控建模实战:基于机器学习》
网友评论