美文网首页
金融风控之贷款违约预测挑战赛 Task4

金融风控之贷款违约预测挑战赛 Task4

作者: 怕热的波波 | 来源:发表于2020-09-25 00:39 被阅读0次

    1、导入数据 略

    2、查看数据 略

    3、特征工程 略

    4、建模与调参

    4.1 模型原理学习

    逻辑回归模型 (已学完)

    • 训练速度快、可解释性好、占用资源少
    • 需要处理缺失值和异常值;不能解决非线性问题;难处理多重共线性数据,难处理数据不均衡;准确率不高

    决策树模型

    • 简单直观、可解释性强、数据预处理简单
    • 容易过拟合,泛化能力弱;采用贪心算法,容易得到局部最优解

    基于Boosting思想的算法:GBDT模型、XGBoost模型、LightGBM模型、Catboost模型

    4.2 模型评估方法
    • 训练集上的误差称为训练误差经验误差,测试集上误差称为测试误差
    • 把训练样本的某些特点当成所有潜在样本的普遍特点,就会产生过拟合。所以会把已有数据集分成训练集和测试集,测试集用来评估模型对新样本的判别能力
    • 数据集的划分要保持和数据总体保持相同分布,彼此互斥
    4.3 建模

    导入相关包

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    import warnings
    import os
    import datetime
    %matplotlib inline
    warnings.filterwarnings('ignore')
    sns.set()
    # 有五种seaborn的绘图风格,它们分别是:darkgrid, whitegrid, dark, white, ticks。默认的主题是darkgrid。
    sns.set_style("whitegrid")
    # 有四个预置的环境,按大小从小到大排列分别为:paper, notebook, talk, poster。其中,notebook是默认的。
    sns.set_context('talk')
    # 解决保存图像是负号'-'显示为方块的问题
    plt.rcParams['axes.unicode_minus'] = False
    plt.rcParams['font.sans-serif'] = ['SimHei']
    sns.set(font='SimHei')
    
    from tqdm import tqdm
    from sklearn.preprocessing import LabelEncoder
    from sklearn.feature_selection import SelectKBest
    #卡方检验
    from sklearn.feature_selection import chi2
    from sklearn.preprocessing import MinMaxScaler
    import xgboost as xgb
    import lightgbm as lgb
    from catboost import CatBoostRegressor
    from sklearn.model_selection import StratifiedKFold, KFold
    from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
    

    特征处理

    • 分别处理数值变量和类别变量
    • 填充缺失值
    • 时间数据处理
    • 有顺序的类别变量处理成数值变量
    • 类别变量转换
    • 异常值处理
    • 连续数值分箱
    • 高维类别特征转码
    • 准备拟合
    numerical_fea = list(train_data.select_dtypes(exclude=['object']).columns)
    catagory_fea = list(filter(lambda x:x not in numerical_fea, list(train_data.columns)))
    label= 'isDefault'
    numerical_fea.remove(label)
    
    #填充空缺值
    #数值型数据采用中位数填充,避免极值影响
    train_data[numerical_fea] = train_data[numerical_fea].fillna(train_data[numerical_fea].median())
    test_data[numerical_fea] = test_data[numerical_fea].fillna(test_data[numerical_fea].median())
    #类别型数据采用众数填充
    train_data[catagory_fea] = train_data[catagory_fea].fillna(train_data[catagory_fea].mode())
    test_data[catagory_fea] = test_data[catagory_fea].fillna(test_data[catagory_fea].mode())
    
    #时间数据处理
    for data in [train_data, test_data]:
        data['issueDate'] = pd.to_datetime(data['issueDate'], format="%Y-%m-%d")
        startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
        #构造新的时间特征
        data['issueDateDT'] = data['issueDate'].apply(lambda x:x-startdate).dt.days
    
    #有顺序的类别变量转换
    def employmentLength_to_int(s):
        if pd.isnull(s):
            return s
        else:
            return np.int8(s.split()[0])
    for data in [train_data, test_data]:
        data['employmentLength'].replace(to_replace='10+ years', value='10 years', inplace=True)
        data['employmentLength'].replace('< 1 year', '0 year', inplace=True)
        data['employmentLength'] = data['employmentLength'].apply(employmentLength_to_int)
    for data in [train_data, test_data]:
        data['earliesCreditLine'] = data['earliesCreditLine'].apply(lambda s:int(s[-4:]))
    
    #类别变量转换
    for data in [train_data, test_data]:
        data['grade'] = data['grade'].map({'A':1,'B':2,'C':3,'D':4,'E':5,'F':6,'G':7})
    for data in [train_data, test_data]:
        data = pd.get_dummies(data, columns=['subGrade', 'homeOwnership', 'verificationStatus', 'purpose', 'regionCode'], drop_first=True)
    
    #找出异常值
    def find_outliers_by_3segama(data,fea):
        data_std = np.std(data[fea])
        data_mean = np.mean(data[fea])
        outliers_cut_off = data_std * 3
        lower_rule = data_mean - outliers_cut_off
        upper_rule = data_mean + outliers_cut_off
        data[fea+'_outliers'] = data[fea].apply(lambda x:str('异常值') if x > upper_rule or x < lower_rule else '正常值')
        return data
    train_data = train_data.copy()
    for fea in numerical_fea:
        train_data = find_outliers_by_3segama(train_data, fea)
    
    #删除异常值
    for fea in numerical_fea:
        train_data = train_data[train_data[fea+'_outliers']=='正常值']
        train_data = train_data.reset_index(drop=True)
    
    #分箱
    for data in [train_data, test_data]:
        data['loanAmnt_bin1'] = np.floor_divide(data['loanAmnt'], 1000)
        data['loanAmnt_bin2'] = np.floor(np.log10(data['loanAmnt']))
        data['loanAmnt_bin3'] = pd.qcut(data['loanAmnt'], 10, labels=False)
    
    # 高维类别特征需要进行转换
    for col in tqdm(['employmentTitle', 'postCode', 'title','subGrade']):
        le = LabelEncoder()
        le.fit(list(train_data[col].astype(str).values) + list(test_data[col].astype(str).values))
        train_data[col] = le.transform(list(train_data[col].astype(str).values))
        test_data[col] = le.transform(list(test_data[col].astype(str).values))
    print('Label Encoding 完成')
    
    # 删除不需要的数据
    for data in [train_data, test_data]:
        data.drop(['issueDate','id'], axis=1,inplace=True)
    
    #准备工作完成
    features = [f for f in train_data.columns if f not in ['id','issueDate','isDefault'] and '_outliers' not in f]
    x_train = train_data[features]
    x_test = test_data[features]
    y_train = train_data['isDefault']
    

    定义模型

    def cv_model(clf, train_x, train_y, test_x, clf_name):
        folds = 5
        seed = 2020
        kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
    
        train = np.zeros(train_x.shape[0])
        test = np.zeros(test_x.shape[0])
    
        cv_scores = []
    
        for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
            print('************************************ {} ************************************'.format(str(i+1)))
            trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
    
            if clf_name == "lgb":
                train_matrix = clf.Dataset(trn_x, label=trn_y)
                valid_matrix = clf.Dataset(val_x, label=val_y)
    
                params = {
                    'boosting_type': 'gbdt',
                    'objective': 'binary',
                    'metric': 'auc',
                    'min_child_weight': 5,
                    'num_leaves': 2 ** 5,
                    'lambda_l2': 10,
                    'feature_fraction': 0.8,
                    'bagging_fraction': 0.8,
                    'bagging_freq': 4,
                    'learning_rate': 0.1,
                    'seed': 2020,
                    'nthread': 28,
                    'n_jobs':24,
                    'silent': True,
                    'verbose': -1,
                }
    
                model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], verbose_eval=200,early_stopping_rounds=200)
                val_pred = model.predict(val_x, num_iteration=model.best_iteration)
                test_pred = model.predict(test_x, num_iteration=model.best_iteration)
                
                # print(list(sorted(zip(features, model.feature_importance("gain")), key=lambda x: x[1], reverse=True))[:20])
                    
            if clf_name == "xgb":
                train_matrix = clf.DMatrix(trn_x , label=trn_y)
                valid_matrix = clf.DMatrix(val_x , label=val_y)
                
                params = {'booster': 'gbtree',
                          'objective': 'binary:logistic',
                          'eval_metric': 'auc',
                          'gamma': 1,
                          'min_child_weight': 1.5,
                          'max_depth': 5,
                          'lambda': 10,
                          'subsample': 0.7,
                          'colsample_bytree': 0.7,
                          'colsample_bylevel': 0.7,
                          'eta': 0.04,
                          'tree_method': 'exact',
                          'seed': 2020,
                          'nthread': 36,
                          "silent": True,
                          }
                
                watchlist = [(train_matrix, 'train'),(valid_matrix, 'eval')]
                
                model = clf.train(params, train_matrix, num_boost_round=50000, evals=watchlist, verbose_eval=200, early_stopping_rounds=200)
                val_pred  = model.predict(valid_matrix, ntree_limit=model.best_ntree_limit)
                test_pred = model.predict(test_x , ntree_limit=model.best_ntree_limit)
                     
            if clf_name == "cat":
                params = {'learning_rate': 0.05, 'depth': 5, 'l2_leaf_reg': 10, 'bootstrap_type': 'Bernoulli',
                          'od_type': 'Iter', 'od_wait': 50, 'random_seed': 11, 'allow_writing_files': False}
                
                model = clf(iterations=20000, **params)
                model.fit(trn_x, trn_y, eval_set=(val_x, val_y),
                          cat_features=[], use_best_model=True, verbose=500)
                
                val_pred  = model.predict(val_x)
                test_pred = model.predict(test_x)
                
            train[valid_index] = val_pred
            test = test_pred / kf.n_splits
            cv_scores.append(roc_auc_score(val_y, val_pred))
            
            print(cv_scores)
            
        print("%s_scotrainre_list:" % clf_name, cv_scores)
        print("%s_score_mean:" % clf_name, np.mean(cv_scores))
        print("%s_score_std:" % clf_name, np.std(cv_scores))
        return train, test
    
    def lgb_model(x_train, y_train, x_test):
        lgb_train, lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb")
        return lgb_train, lgb_test
    
    def xgb_model(x_train, y_train, x_test):
        xgb_train, xgb_test = cv_model(xgb, x_train, y_train, x_test, "xgb")
        return xgb_train, xgb_test
    
    def cat_model(x_train, y_train, x_test):
        cat_train, cat_test = cv_model(CatBoostRegressor, x_train, y_train, x_test, "cat")
    

    模型拟合

    lgb_train, lgb_test = lgb_model(x_train, y_train, x_test)
    cat_train, cat_test = cat_model(x_train, y_train, x_test)
    xgb_train, xgb_test = xgb_model(x_train, y_train, x_test)
    
    #成绩
    #lgb_score_mean: 0.732
    #cat_score_mean: 0.732
    #xgb_score_mean: 0.736
    

    三组模型的时间,lgb最快,xgb非常慢,和设想的一样,所以尽管xgb结果好一点,调参部分还是选择lgb或cat展开。

    相关文章

      网友评论

          本文标题:金融风控之贷款违约预测挑战赛 Task4

          本文链接:https://www.haomeiwen.com/subject/memayktx.html