美文网首页
贷款违约数据挖掘(信用评分卡模型)

贷款违约数据挖掘(信用评分卡模型)

作者: 路人乙yh | 来源:发表于2020-07-07 20:20 被阅读0次

    本文采用lending club官网公开数据中2017年Q2部分,数据内容为贷款申请人信息包括申请人的年龄、性别、婚姻状况、学历、贷款金额、申请人财产情况等(自变量)和贷款履行情况(因变量)。(使用2017年数据是为了方便与其他人的结果对比)
    本文基于对象过去行为和属性预测其未来是否逾期,流程主要包括处理缺失值、将原始变量进行WOE编码,通过IV值、相关系数、显著性依次筛选变量,使用SMOTE解决类别不平衡问题,通过逻辑回归算法解决二元分类问题(判定贷款申请人是否违约),再计算出每个样本的评分(为了方便业务使用,类似于芝麻信用分)。
    最终结果,auc=0.953, ks=0.802, accuracy_score=0.938 。
    完整代码

    1.数据下载与读取

    百度网盘:网盘地址 密码:let1
    先看一下大概数据

    In[1]: df.head()
    Out[1]: 
       loan_amnt  funded_amnt  ...  total_bc_limit total_il_high_credit_limit
    0     7500.0       7500.0  ...         35000.0                    92511.0
    1    20000.0      20000.0  ...         22900.0                    42517.0
    2    12000.0      12000.0  ...          9200.0                    30780.0
    3     6025.0       6025.0  ...         17600.0                        0.0
    4     4000.0       4000.0  ...          5000.0                    15523.0
    

    查看df的信息

    [In]: df.info()
    [Out]: <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 105453 entries, 0 to 105452
    Columns: 145 entries, id to settlement_term
    dtypes: float64(107), object(38)
    memory usage: 116.7+ MB
    

    可以看到这个数据共有105453行,145列,在列中有107列是float64类型也就是数值变量,38列是object类型

    2.数据预处理

    2.1 因变量映射

    在这份数据中,因变量列是loan_status,但是观察这一列的值发现并不是二元的

    In[46]:  df.loan_status.value_counts() # 7个类别
    Out[46]: 
    Current               77347
    Fully Paid            19652
    Charged Off            4519
    Late (31-120 days)     2089
    In Grace Period        1083
    Late (16-30 days)       598
    Default                 163
    Name: loan_status, dtype: int64
    

    查找业务解释

    Fully Paid:已结清; Current:当前已还款;  
    Charged Off:坏账;  Late (31-120 days):预期31-120天;  Late (16-30 days):预期16-30天
    In Grace Period :已逾期但在宽限期类;  Default:逾期超过90天
    

    上面只有Fully Paid和Current是未逾期,将7个值映射为{0, 1}

    d = {'Current':0,
         'Fully Paid':0,
         'Charged Off':1,
         'Late (31-120 days)':1,
         'Late (16-30 days)':1,
         'In Grace Period':1,
        'Default':1}
    df.loan_status = df.loan_status.map(d)
    df = df[df['loan_status'].notnull()]
    

    此时再次查看loan_status列

    In: df['loan_status'].value_counts(normalize=True)
    Out: 
    0    0.919849
    1    0.080151
    Name: loan_status, dtype: float64
    

    可以看到已经映射完毕,同时发现未逾期(0)和逾期(1)存在类别不平衡,在后续建模时可以使用2种方法:1.损失函数中使用类别权重 2.使用SMOTE对原始数据类别少的过采样或者对于原始数据类别多的多次下采样构建多个判别器bagging

    2.1 缺失值处理

    我们关注的因变量是loan_status查看这一列发现有2个null,将这两行去除
    df.loan_status.isnull().sum() # out = 2
    df = df[df.loan_status.notnull()]
    缺失率超过50%的列往往不能带来什么信息,直接删除

    miss_large_col = \
        [k for k,v in dict(df.isnull().sum()/df.shape[0]).items() if v>=0.5]
    df = df.drop(miss_large_col,axis=1)
    

    miss_large_col一共有42列,删除后数据的规模是(105451, 103),查看当前缺失值情况

    In[53]: (df.isnull().sum() / df.shape[0]).sort_values(ascending=False)
    Out[53]: 
    mths_since_last_delinq        0.484765
    next_pymnt_d                  0.229215
    il_util                       0.126884
    mths_since_recent_inq         0.113313
    emp_title                     0.064314
    emp_length                    0.063508
    num_tl_120dpd_2m              0.050052
    mo_sin_old_il_acct            0.025149
    mths_since_rcnt_il            0.025149
    bc_util                       0.011294
    percent_bc_gt_75              0.010868
    bc_open_to_buy                0.010839
    mths_since_recent_bc          0.010270
    last_pymnt_d                  0.001157
    revol_util                    0.000711
    dti                           0.000711
    all_util                      0.000123
    avg_cur_bal                   0.000019
    out_prncp                     0.000000
    total_acc                     0.000000
    initial_list_status           0.000000
    

    其中13列存在缺失值,mths_since_last_delinq这一列缺失率虽然没有超过0.5但也有0.48,删掉
    df = df.drop(['mths_since_last_delinq'], axis=1)
    接下来遍历每一列,查看那些列中有一种值占比超过0.95

    In[58]: 
    tmp_list = []
    for x in df.drop(['loan_status'],axis=1).columns:
        if df[x].value_counts(normalize=True).iloc[0] >=0.95:
            tmp_list.append((x, df[x].value_counts(normalize=True).iloc[0]))
    tmp_list
    Out[58]: 
    [('pymnt_plan', 0.9995637784373785),
     ('total_rec_late_fee', 0.9741111985661587),
     ('recoveries', 0.9834235806203829),
     ('collection_recovery_fee', 0.9834425467752795),
     ('collections_12_mths_ex_med', 0.9786156603540981),
     ('policy_code', 1.0),
     ('acc_now_delinq', 0.9948127566357834),
     ('chargeoff_within_12_mths', 0.9915979933808119),
     ('delinq_amnt', 0.9955714028316469),
     ('num_tl_120dpd_2m', 0.9990516406616553),
     ('num_tl_30dpd', 0.9965860921186144),
     ('tax_liens', 0.9542346682345355),
     ('hardship_flag', 0.9993835999658609),
     ('disbursement_method', 0.9998198215284825),
     ('debt_settlement_flag', 0.9971740429204086)]
    

    如果某一属性所有的样本都是同一个值,那么这个属性肯定对是否逾期没有影响,所以将占比最多的值超过0.95的列删除。

    not_col=[]
    for x in df.drop(['loan_status'],axis=1).columns:
        if df[x].value_counts(normalize=True).iloc[0] >=0.95:
            not_col.append(x)
    df = df.drop(not_col,axis=1)
    print(df.shape[1]) # out = 88
    

    此时还剩下88列,查看剩下的这些列
    df.dtypes.sort_values()
    这62列中,loan_status是int64类型,'sub_grade', 'grade', 'initial_list_status', 'int_rate', 'term', 'emp_title', 'application_type', 'emp_length', 'issue_d', 'last_credit_pull_d', 'verification_status', 'purpose', 'title', 'zip_code', 'addr_state', 'next_pymnt_d', 'last_pymnt_d', 'revol_util' 'home_ownership', 'earliest_cr_line'共20列是object类型,其余67列是float64类型。
    查看这20列,对于unique值超过100的emp_title, zip_code这两个类别型变量直接删除,int_rate是百分比数字因为%被识别为文本需要转化为数字,sub_grade是grade信用评级的细分,和grade信息有交叠这里先删掉(也许用sub_grade比grade更好,需要实验);emp_length的unique值也可以映射为年份的数值变量,add_state与偿还能力无关,删掉;earliest_cr_line, last_pymnt_d, next_pymnt_d,last_credit_pull_d都是时间类型,可以与当前时间做差值计算时间间隔;revol_util可以转化为数值变量。

    In[68]: 
    object_col = list(df.select_dtypes(include=['O']).columns)
    df.loc[:,object_col].describe().T
    
    Out[68]: 
                          count unique                 top   freq
    term                 105451      2           36 months  77105
    int_rate             105451     65              16.02%   4956
    grade                105451      7                   C  36880
    sub_grade            105451     35                  C1   8088
    emp_title             98669  38551             Teacher   1999
    emp_length            98754     11           10+ years  35438
    home_ownership       105451      5            MORTGAGE  52502
    verification_status  105451      3     Source Verified  42033
    issue_d              105451      3            Jun-2017  38087
    purpose              105451     13  debt_consolidation  58557
    title                105451     12  Debt consolidation  58564
    zip_code             105451    851               112xx   1100
    addr_state           105451     49                  CA  13751
    earliest_cr_line     105451    627            Sep-2004    892
    revol_util           105376   1076                  0%    468
    initial_list_status  105451      2                   w  79488
    last_pymnt_d         105329     16            Jun-2018  54794
    next_pymnt_d          81280      2            Jul-2018  56176
    last_credit_pull_d   105451     17            Jun-2018  84157
    application_type     105451      2          Individual  98638
    

    对于unique值超过100的emp_title, zip_code这两个类别型变量直接删除,int_rate是百分比数字因为%被识别为文本需要转化为数字,sub_grade是grade信用评级的细分,和grade信息有交叠这里先删掉(也许用sub_grade比grade更好,需要实验);emp_length的unique值也可以映射为年份的数值变量,add_state与偿还能力无关,删掉;earliest_cr_line, last_pymnt_d, next_pymnt_d,last_credit_pull_d都是时间类型,可以与当前时间做差值计算时间间隔;revol_util可以转化为数值变量。操作如下

    df = df.drop(['emp_title', 'zip_code', 'sub_grade', 'addr_state'], axis=1)
    df['revol_util'] = df['revol_util']\
        .map(lambda x: float(x.split('%')[0])/100 if not pd.isnull(x) else x)
    df['int_rate'] = df['int_rate']\
        .map(lambda x: float(x.split('%')[0])/100 if not pd.isnull(x) else x)
    
    df['emp_length'].unique()
    d = {'10+ years':10, '< 1 year':0, '7 years':7,'2 years':2, '1 year':1,
           '3 years':3, '9 years':9, '8 years':8, '5 years':5, '6 years':6, '4 years':4}
    df['emp_length'] = df['emp_length'].map(d)
    

    再次查看经过处理后的object列

    object_col = list(df.select_dtypes(include=['O']).columns)
    object_col
    df.loc[:,object_col].describe().T
    # 依次对剩余列检查
    for ob in object_col:
       print(ob, dict(df[ob].value_counts(normalize=True)))
    依次查看每一列关于loan_status的分组条形图
    

    发现home_ownership中{'MORTGAGE': 0.50, 'RENT': 0.39, 'ANY': 4.7415387241467604e-05, 'OWN': 0.11, 'NONE': 1.896615489658704e-05} ‘ANY’和NONE占比太少,用最多的MORTGAGE替换
    df.loc[df.home_ownership.isin(['ANY', 'NONE']), 'home_ownership'] = 'MORTGAGE'

    for i in object_col:
        pvt=pd.pivot_table(df[['loan_status',i]],index=i,columns="loan_status",aggfunc=len) 
        pvt.plot(kind="bar")
    

    这里只展示几个列,下面的term关于loan_status的条形图可以看出,多数选择36个月,而且选择36个月的客户违约率更低。


    term变量的分组条形图
    grade变量的分组条形图

    grade中各个评级对于违约率的影响我们不能直接看出来,那么怎么衡量这个变量对于loan_status有没有影响,这就需要用到信用卡评分模型中常用的WOE编码(详细点击这篇文章[待完成])。

    2.2 缺失值填充

    接下来填充缺失值,策略是对于数值类型变量,如果缺失值超过0.05,用-999代替作为一个特征值,没有超过0.05用中位数填充;对于类别型变量,本项目中没有缺失值,如果有可以用新的值或者占比最多的值填充。

    rate = dict(df.isnull().sum()/df.shape[0])
    rate
    # 1.对于数值型数据,缺失率超过 0.05, 用 -999代替nan
    # 2.对于类别数据,
    cate_col = list(df.select_dtypes(include=['O']).columns)#4
    num_col = [x for x in df.columns if x not in cate_col and x!='loan_status']#57
    
    d1 = [k for k,v in rate.items() if k in num_col and v>=0.05]
    for i in d1:
        df[i] = df[i].fillna(-999)
    d2 = [x for x in num_col if x not in d1]
    for i in d2:
        df[i] = df[i].fillna(df[i].median())
       
    df.loc[:,cate_col].isnull().sum()# 类别类型无缺失
    

    3.WOE编码

    3.1 对类别变量进行WOE编码

    def binning_cate(df,col,target):
        total = df[target].count()
        bad = df[target].sum()
        good = total-bad
        group = df.groupby([col],as_index=True)
        bin_df = pd.DataFrame()
        bin_df['total'] = group[target].count()
        bin_df['totalrate'] = bin_df['total']/total
        bin_df['bad'] = group[target].sum()
        bin_df['badrate'] = bin_df['bad']/bin_df['total']
        bin_df['good'] = bin_df['total'] - bin_df['bad']
        bin_df['goodrate'] = bin_df['good']/bin_df['total']
        bin_df['badattr'] = bin_df['bad']/bad
        bin_df['goodattr'] = (bin_df['total']-bin_df['bad'])/good
        bin_df['woe'] = np.log(bin_df['badattr']/bin_df['goodattr'])
        bin_df['bin_iv'] = (bin_df['badattr']-bin_df['goodattr'])*bin_df['woe']
        bin_df['iv'] = bin_df['bin_iv'].sum()
        return bin_df
       
    cate_bin_df_list = []
    for col in cate_col:
        bin_df = binning_cate(df, col, 'loan_status')
        cate_bin_df_list.append(bin_df)
    
    # 存类别变量名、IV值
    cate_iv_df = pd.DataFrame({'col':cate_col, 'iv':[x['iv'].iloc[0] for x in cate_bin_df_list]}).sort_values('iv',ascending=False).reset_index(drop=True)
    cate_iv_df
    

    结果是

    Out[168]: 
                       col        iv
    0              purpose       inf
    1                grade  0.476388
    2  verification_status  0.083826
    3  initial_list_status  0.022144
    4                title  0.018638
    5       home_ownership  0.017939
    6                 term  0.016072
    7              issue_d  0.005004
    8     application_type  0.000880
    

    purpose的iv值居然是正无穷,显然不符合常理,出现这种情况是因为purpose变量中某一类的数量太少,我们用查看这一列值的分布,显然wedding只有一个样本,将这个样本删掉df = df.loc[df.purpose != 'wedding']

    df['purpose'].value_counts()
    Out[169]: 
    debt_consolidation    58557
    credit_card           21261
    home_improvement       9222
    other                  7140
    major_purchase         2616
    medical                1648
    car                    1334
    vacation               1170
    small_business         1034
    moving                  945
    house                   453
    renewable_energy         70
    wedding                   1
    Name: purpose, dtype: int64
    

    3.2 对数值变量进行WOE编码

    对数值变量进行WOE编码的方法是对于某一个自变量比如说last_pymnt_d,使用这个变量和loan_status构建单变量决策树模型,决策树节点的分裂区间来对数值变量进行分箱。

    In[181]: # 对数值变量分箱, 使用单变量决策树方法
    def tree_split(df,col,target,max_bin,min_binpct,nan_value):
        missing_rate = df[df[col]==nan_value].shape[0]/df.shape[0]
        if missing_rate < 0.05:
            x = np.array(df[col]).reshape(-1,1)
            y = np.array(df[target])
            tree = DecisionTreeClassifier(max_leaf_nodes=max_bin,min_samples_leaf=min_binpct)
            tree.fit(x,y)
            threshold = tree.tree_.threshold
            threshold = threshold[threshold!=_tree.TREE_UNDEFINED]
            split_list = sorted(threshold.tolist())
        else:
            x = np.array(df[df[col]!=nan_value][col]).reshape(-1,1)
            y = np.array(df[df[col]!=nan_value][target])
            tree = DecisionTreeClassifier(max_leaf_nodes=max_bin-1,min_samples_leaf=min_binpct)
            tree.fit(x,y)
            threshold = tree.tree_.threshold
            threshold = threshold[threshold!=_tree.TREE_UNDEFINED]
            split_list = sorted(threshold.tolist())
            split_list.insert(0,nan_value)
        return split_list
    
    # 数值型特征的分箱,计算woe,IV
    def binning_num(df,col,target,cut):
        
        total = df[target].count()
        bad = df[target].sum()
        good = total-bad
        
        bucket = pd.cut(df[col],cut)
        group = df.groupby(bucket)
        bin_df = pd.DataFrame()
    
        bin_df['total'] = group[target].count()
        bin_df['totalrate'] = bin_df['total']/total
        bin_df['bad'] = group[target].sum()
        bin_df['badrate'] = bin_df['bad']/bin_df['total']
        bin_df['good'] = bin_df['total'] - bin_df['bad']
        bin_df['goodrate'] = bin_df['good']/bin_df['total']
        bin_df['badattr'] = bin_df['bad']/bad
        bin_df['goodattr'] = (bin_df['total']-bin_df['bad'])/good
        bin_df['woe'] = np.log(bin_df['badattr']/bin_df['goodattr'])
        bin_df['bin_iv'] = (bin_df['badattr']-bin_df['goodattr'])*bin_df['woe']
        bin_df['iv'] = bin_df['bin_iv'].sum()
        
        return bin_df
    
    num_dict={}
    for col in num_col:
        split_list = tree_split(df,col,'loan_status',5,0.05,-999)
        split_list.insert(0,float('-inf'))
        split_list.append(float('inf'))
        bin_df = binning_num(df,col,'loan_status',split_list)
        num_dict.setdefault(col,{})
        num_dict[col]['bin_df']=bin_df
        num_dict[col]['cut'] = split_list
    
    num_iv_df = pd.DataFrame({'col':num_col,'iv':[num_dict[x]['bin_df']['iv'].iloc[0] for x in num_col]})\
                                  .sort_values('iv',ascending=False).reset_index(drop=True)
    num_iv_df.head()
    Out[181]: 
                   col        iv
    0     last_pymnt_d  2.059883
    1  total_rec_prncp  1.171917
    2  last_pymnt_amnt  0.687479
    3        out_prncp  0.567522
    4    out_prncp_inv  0.567459
    

    4.变量筛选

    4.1 根据IV值筛选

    根据业务经验,将阈值设定为0.03,将大于0.03的变量筛选出来,最后得到32个数值变量、2个类别变量

    #根据业务经验将阈值定在0.03,大于0.03筛选得23个数值型字段,1个类别型字段。
    iv_select_num_col = list(num_iv_df[num_iv_df.iv>0.03]['col'])
    select_num_dict = {k:v for k,v in num_dict.items() if k in iv_select_num_col}
    len(iv_select_num_col)
    
    iv_select_cate_col = list(cate_iv_df[cate_iv_df.iv>0.03]['col'])
    len(iv_select_cate_col)
    
    iv_select_df = pd.concat([num_iv_df[num_iv_df.iv>0.03],cate_iv_df[cate_iv_df.iv>0.03]],axis=0).\
                                             sort_values('iv',ascending=False).reset_index(drop=True)
    df2 = df.loc[:,iv_select_num_col+iv_select_cate_col+['loan_status']]
    df2.shape
    

    4.2 将原始变量转化为WOE变量

    将原始变量转化为 WOE 变量

        woe_list = list(select_num_dict[col]['bin_df']['woe'])
        cut = select_num_dict[col]['cut']
        df2[col+'_woe'] = pd.cut(df2[col], bins=cut, labels=woe_list)
    for col in iv_select_cate_col:
        woe_dict = dict([x for x in cate_bin_df_list if x.index.name==col][0]['woe'])
        df2[col+'_woe'] = df2[col].map(woe_dict)
    df2.head()
    
    df2_woe = df2.loc[:, [x for x in df2.columns if x.find('woe')>0]+['loan_status']]
    df2_woe.head()
    for col in df2_woe.columns:
        df2_woe[col] = df2_woe[col].astype('float64')
    

    此时共有35列,34个自变量,1个因变量

    4.3 使用前向逐步回归发根据相关系数筛选变量

    首先选定一个变量,每次加入一个变量,将当前相关系数大于0.7的变量去除

    # 根据相关系数去除多重共线性
    def forward_corr_delete(data,col_list):
        corr_list=[]
        corr_list.append(col_list[0])
        delete_col=[]
        for col in col_list[1:]:
            corr_list.append(col)
            corr = data.loc[:,corr_list].corr()
            corr_tup = [(k,v) for k,v in zip(corr[col].index,corr[col].values)]
            corr_value = [v for k,v in corr_tup if k!=col]
            if len([x for x in corr_value if abs(x)>=0.65])>0:
                delete_col.append(col)
        select_corr_col=[x for x in col_list if x not in delete_col]
        return select_corr_col
    
    corr_col = [x+'_woe' for x in iv_select_df.col]
    select_corr_col = forward_corr_delete(df2_woe,corr_col)
    len(select_corr_col)
    
    df2_woe2 = df2_woe.loc[:,select_corr_col+['loan_status']]
    df2_woe2.head()
    

    经过筛选,得到了17个变量

    4.4 根据方差膨胀因子(VIF)去除多重共线性

    在这一步,没有发现多重共线性

    # 根据方差膨胀因子去除共线性
    def vif_delete(df,list_corr):
        col_list = list_corr.copy()
        vifs_matrix = np.matrix(df[col_list])
        vifs_list = [variance_inflation_factor(vifs_matrix,i)for i in range(vifs_matrix.shape[1])]
        vif_high = [x for x,y in zip(col_list,vifs_list) if y>10]
        if len(vif_high)>0:
            for col in reversed(vif_high):
                col_list.remove(col)
                vif_matrix=np.matrix(df[col_list])
                vifs = [variance_inflation_factor(vif_matrix,i)for i in range(vif_matrix.shape[1])]
                if len([x for x in vifs if x>10])==0:
                    break
        return col_list
    
    vif_select_col = vif_delete(df2_woe2,select_corr_col)
    len(vif_select_col)
    

    4.5 根据显著性筛选变量

    使用statistic模块根据p值做显著性检验,删除inq_fi_woe变量

    # 显著性筛选 根据p值
    def forward_pvalue_delete(x,y):
        col_list = x.columns.tolist()
        pvalues_col=[]
        for col in col_list:
            pvalues_col.append(col)
            x_const = sm.add_constant(x.loc[:,pvalues_col])
            sm_lr = sm.Logit(y,x_const)
            sm_lr = sm_lr.fit()
            pvalue = sm_lr.pvalues[col]
            if pvalue>=0.5:
                pvalues_col.remove(col)
        return pvalues_col
    
    # 将数据集分为特征集X和标签集Y
    x = df2_woe2.drop(['loan_status'],axis=1)
    y = df2_woe2['loan_status']
    # 做显著性筛选
    pvalues_col = forward_pvalue_delete(x,y)
    
    df2_woe3 = df2_woe2.loc[:, pvalues_col+['loan_status']]
    

    5. 建模

    使用 sklearn中的逻辑回归模型作为分类器

    5.1 简单建模,超参数使用默认

    x2 = df2_woe3.drop(['loan_status'],axis=1)
    y2 = df2_woe3['loan_status']
    x_train,x_test,y_train,y_test = train_test_split(x2,y2,test_size=0.2,random_state=2020)
    
    lr_model = LogisticRegression().fit(x_train,y_train)
    

    对使用默认的参数训练的模型衡量指标,包括auc, ks, 敏感性,特异性,精准性

    #绘制roc曲线
    def plot_roc(y_label,y_pred):
        tpr,fpr,threshold = metrics.roc_curve(y_label,y_pred)
        AUC = metrics.roc_auc_score(y_label,y_pred)
        fig = plt.figure(figsize=(6,4))
        ax = fig.add_subplot(1,1,1)
        ax.plot(tpr,fpr,color='blue',label='AUC=%.3f'%AUC)
        ax.plot([0,1],[0,1],'r--')
        ax.set_xlim(0,1)
        ax.set_ylim(0,1)
        ax.set_title('ROC')
        ax.legend(loc='best')
        return plt.show(ax)
    #绘制KS曲线 
    def plot_model_ks(y_label,y_pred):
        pred_list = list(y_pred)
        label_list = list(y_label)
        total_bad = sum(label_list)
        total_good = len(label_list)-total_bad
        items = sorted(zip(pred_list,label_list),key=lambda x :x[0])
        step = (max(pred_list)-min(pred_list))/200
        
        pred_bin = []
        good_rate = []
        bad_rate = []
        ks_list = []
        for i in range(1,201):
            idx = min(pred_list)+i*step
            pred_bin .append(idx)
            label_bin = [x[1] for x in items if x[0]<idx]
            bad_num = sum(label_bin)
            good_num = len(label_bin)-bad_num
            goodrate = good_num/total_good
            badrate =  bad_num/total_bad
            ks = abs(goodrate-badrate)
            good_rate.append(goodrate)
            bad_rate.append(badrate)
            ks_list.append(ks)
        fig = plt.figure(figsize=(6,4))
        ax = fig.add_subplot(1,1,1)
        ax.plot(pred_bin,good_rate,color='green',label='good_rate')
        ax.plot(pred_bin,bad_rate,color='red',label='bad_rate')
        ax.plot(pred_bin,ks_list,color='blue',label='good-bad')
        ax.set_title('KS:{:.3f}'.format(max(ks_list)))
        ax.legend(loc='best')
        return plt.show(ax)
    
    y_pred = lr_model.predict_proba(x_test)[:,1]
    plot_roc(y_test,y_pred) #
    plot_model_ks(y_test,y_pred)
    fpr,tpr,thre=roc_curve(y_test, y_pred)
    ks=max(tpr-fpr)
    

    此时的auc=0.950, ks=0.798, roc和ks曲线如下


    ROC曲线
    KS曲线

    5.2 使用网格搜索交叉验证选择最优参数

    In[157]:
    #利用交叉验证和网格搜索
    from sklearn.model_selection import GridSearchCV  #网格搜索
    from sklearn.linear_model import LogisticRegression # 逻辑回归
    from sklearn.model_selection import train_test_split # 测试集与训练集划分
    #构建网格参数组合
    param_test1={"C":[0.01,0.1,1.0,10.0,20.0,30.0,100.0,200.0,300.0,1000.0], #正则化系数
                "penalty":["l1","l2"], #正则化参数
                "max_iter":[100,200,300,400,500]} #算法收敛的最大迭代次数
    gsearch1=GridSearchCV(LogisticRegression(),param_grid=param_test1,cv=10)
    gsearch1.fit(x_train,y_train)  #训练模型
    gsearch1.best_params_, gsearch1.best_score_   #查看评分最高的参数组合与最佳评分
    Out[157]:
    ({'C': 10.0, 'max_iter': 100, 'penalty': 'l2'}, 0.9728544333807492)
    

    最优的参数是C=10.0, max_iter=100, peanalty=l2(正则化项)
    使用最优参数构建分类器,训练得到的auc和ks并没有较大提升,说明在本项目里选定了逻辑回归,改变一个超参数对结果影响不大。

    5.3 使用SMOTE解决类别不平衡问题

    在当前数据中,逾期的类只占了0.08,有一些不平衡,使用SMOTE算法对少数类进行过采样生成均衡的数据集,检验指标是否有提升。注意:使用SMOTE算法只能对训练集进行过采样。

    In[237]: y.value_counts(normalize=True)
    Out[237]: 
    0.0    0.919848
    1.0    0.080152
    
    # 使用SMOTE算法解决类别不平衡
    from imblearn.over_sampling import SMOTE # 导入SMOTE算法模块
    # 处理不平衡数据
    smo = SMOTE(random_state=42)    # 处理过采样的方法
    x_train2, y_train2 = smo.fit_sample(x_train, y_train)
    print('通过SMOTE方法平衡正负样本后')
    n_sample = y_train2.shape[0]
    n_pos_sample = y_train2[y_train2 == 0].shape[0]
    n_neg_sample = y_train2[y_train2 == 1].shape[0]
    print('样本个数:{}; 正样本占{:.2%}; 负样本占{:.2%}'.format(n_sample,
                                                       n_pos_sample / n_sample,
                                                       n_neg_sample / n_sample))
    
    lr_model_smo = LogisticRegression().fit(x_train2,y_train2)
    y_pred_smo = lr_model_smo.predict_proba(x_test)[:,1]
    plot_roc(y_test,y_pred_smo)
    plot_model_ks(y_test, y_pred_smo)
    

    使用SMOTE算法,得到的auc=0.953, ks=0.802, 说明本项目中将训练集构造成均衡的数据集有效果。

    6.计算每个样本的评分

    在当前的数据集上,每个样本都是特征+预测得到的逾期概率,为了在业务上有更好的解释性,需要将概率转化为信用评分(类似于芝麻信用分)

    # 计算基础分
    def cal_scale(score,odds,PDO,model):
        B = PDO/np.log(2)
        A = score+B*np.log(odds)
        base_score = A-B*model.intercept_[0]
        return A,B,base_score
    A,B,base_score = cal_scale(400,999/1,20,lr_model)
    
    x_test_score = x_test.copy()
    for col in x_test_score.columns:
        col_coe = coe_dict[col]
        x_test_score[col.replace('woe','score')]=x_test_score[col].map(lambda x:round(x*-B*col_coe))
    x_test_score['score'] = round(base_score)
    for col in [x for x in x_test_score.columns if x.find('_score')>=0]:
        x_test_score['score']+=x_test_score[col]
    x_test_score['label']=list(y_test)
    
    sns.kdeplot(x_test_score[x_test_score['label']==1].score,shade=True,label='bad')
    sns.kdeplot(x_test_score[x_test_score['label']==0].score,shade=True,label='good')
    
    正负样本的区分程度

    在上图中可以看出,正负样本的区分度还是很高的,但是正样本与负样本都不是标准的正态分布,说明模型还是有局限性。

    7.其他模型、模型融合(待完成)

    在业务中因为逻辑回归模型并行化、训练速度快、可解释性强等优点被广泛使用,但是预测是否逾期是一个很典型的机器学习问题,当然要使用

    7.1 LightGBM

    7.2 DNN

    7.3 模型融合

    参考文章:https://zhuanlan.zhihu.com/p/152128764这篇文章的auc只有0.67左右

    相关文章

      网友评论

          本文标题:贷款违约数据挖掘(信用评分卡模型)

          本文链接:https://www.haomeiwen.com/subject/jhpgqktx.html