美文网首页
【阿旭机器学习实战】【24】信用卡用户流失预测实战

【阿旭机器学习实战】【24】信用卡用户流失预测实战

作者: 阿旭123 | 来源:发表于2022-11-27 08:30 被阅读0次

    【阿旭机器学习实战】系列文章主要介绍机器学习的各种算法模型及其实战案例,欢迎点赞,关注共同学习交流。

    本文针对某国外匿名化处理后的信用卡真实数据集,通过建模判断该用户是否已经流失,包括特征处理与分类模型建模评估。

    目录

    问题描述

    依据某国外匿名化处理后的真实数据集,通过建模,判断该用户是否已经流失。

    1. 读取数据并分离特征与标签

    import pandas as pd
    import numpy as np
    
    # 读取数据
    train_data = pd.read_csv('./Churn-Modelling.csv')
    test_data = pd.read_csv('./Churn-Modelling-Test-Data.csv')
    
    x_train = train_data.iloc[:,:-1]
    y_train = train_data.iloc[:,-1].astype(int)
    x_test = test_data.iloc[:,:-1]
    y_test = test_data.iloc[:,-1].astype(int)
    
    x_train.head()
    
    RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary
    0 1 15634602 Hargrave 619 France Female 42 2 0.00 1 1 1 101348.88
    1 2 15647311 Hill 608 Spain Female 41 1 83807.86 1 0 1 112542.58
    2 3 15619304 Onio 502 France Female 42 8 159660.80 3 1 0 113931.57
    3 4 15701354 Boni 699 France Female 39 1 0.00 2 0 0 93826.63
    4 5 15737888 Mitchell 850 Spain Female 43 2 125510.82 1 1 1 79084.10

    数据说明:
    RowNumber:行号
    CustomerID:用户编号
    Surname:用户姓名
    CreditScore:信用分数
    Geography:用户所在国家/地区
    Gender:用户性别
    Age:年龄
    Tenure:当了本银行多少年用户
    Balance:存贷款情况
    NumOfProducts:使用产品数量
    HasCrCard:是否有本行信用卡
    IsActiveMember:是否活跃用户
    EstimatedSalary:估计收入
    Exited:是否已流失,这将作为我们的标签数据

    2.特征工程

    2.1 删除无用特征

    # 删除前三列没用的数据
    x_train = x_train.drop(labels=x_train.columns[[0,1,2]], axis=1)
    x_test = x_test.drop(labels=x_test.columns[[0,1,2]], axis=1)
    
    x_train.head()
    
    CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary
    0 619 France Female 42 2 0.00 1 1 1 101348.88
    1 608 Spain Female 41 1 83807.86 1 0 1 112542.58
    2 502 France Female 42 8 159660.80 3 1 0 113931.57
    3 699 France Female 39 1 0.00 2 0 0 93826.63
    4 850 Spain Female 43 2 125510.82 1 1 1 79084.10
    y_train[:5]
    
    0    1
    1    0
    2    1
    3    0
    4    0
    Name: Exited, dtype: int32
    

    2.2 将字符串特征进行编码

    # 国家与性别两列为非数值型数据,使用LabelEncoder进行编码,将其转换为数值数据
    from sklearn.preprocessing import LabelEncoder
    Lb1 = LabelEncoder()
    x_train.iloc[:,1] = Lb1.fit_transform(x_train.iloc[:,1])
    x_test.iloc[:,1] = Lb1.transform(x_test.iloc[:,1])
    Lb2 = LabelEncoder()
    x_train.iloc[:,2] = Lb2.fit_transform(x_train.iloc[:,2])
    x_test.iloc[:,2] = Lb2.transform(x_test.iloc[:,2])
    
    x_train[:5]
    
    CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary
    0 619 0 0 42 2 0.00 1 1 1 101348.88
    1 608 2 0 41 1 83807.86 1 0 1 112542.58
    2 502 0 0 42 8 159660.80 3 1 0 113931.57
    3 699 0 0 39 1 0.00 2 0 0 93826.63
    4 850 2 0 43 2 125510.82 1 1 1 79084.10
    x_train.info()
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 10000 entries, 0 to 9999
    Data columns (total 10 columns):
    CreditScore        10000 non-null int64
    Geography          10000 non-null int64
    Gender             10000 non-null int64
    Age                10000 non-null int64
    Tenure             10000 non-null int64
    Balance            10000 non-null float64
    NumOfProducts      10000 non-null int64
    HasCrCard          10000 non-null int64
    IsActiveMember     10000 non-null int64
    EstimatedSalary    10000 non-null float64
    dtypes: float64(2), int64(8)
    memory usage: 781.3 KB
    

    2.3 对特征数据进行归一化

    from sklearn.preprocessing import StandardScaler
    sc = StandardScaler()
    x_train = sc.fit_transform(x_train)
    x_test = sc.transform(x_test)
    
    x_train[:5]
    
    array([[-0.32622142, -0.90188624, -1.09598752,  0.29351742, -1.04175968,
            -1.22584767, -0.91158349,  0.64609167,  0.97024255,  0.02188649],
           [-0.44003595,  1.51506738, -1.09598752,  0.19816383, -1.38753759,
             0.11735002, -0.91158349, -1.54776799,  0.97024255,  0.21653375],
           [-1.53679418, -0.90188624, -1.09598752,  0.29351742,  1.03290776,
             1.33305335,  2.52705662,  0.64609167, -1.03067011,  0.2406869 ],
           [ 0.50152063, -0.90188624, -1.09598752,  0.00745665, -1.38753759,
            -1.22584767,  0.80773656, -1.54776799, -1.03067011, -0.10891792],
           [ 2.06388377,  1.51506738, -1.09598752,  0.38887101, -1.04175968,
             0.7857279 , -0.91158349,  0.64609167,  0.97024255, -0.36527578]])
    

    3. 建模预测与评估

    # 使用逻辑回归进行建模
    from sklearn.linear_model import LogisticRegression
    
    lr=LogisticRegression()
    sgd=SGDClassifier()
    lr.fit(x_train,y_train)
    lr_y_predict=lr.predict(x_test)
    
    #使用逻辑斯蒂回归墨香自带的评分函数score获得模型在测试集上的准确性结果
    print('LogisticRegression测试集准确度:',lr.score(x_test,y_test))
    print('LogisticRegression训练集准确度:',lr.score(x_train,y_train))
    
    LogisticRegression测试集准确度: 0.761
    LogisticRegression训练集准确度: 0.809
    
    from sklearn.metrics import classification_report
    #使用classificaion_report模块获得LogisticRegression其他三个指标的结果
    print(classification_report(y_test,lr_y_predict,target_names=['Exited','UnExited']))
    
                 precision    recall  f1-score   support
    
         Exited       0.77      0.97      0.86       740
       UnExited       0.68      0.15      0.25       260
    
    avg / total       0.74      0.76      0.70      1000
    

    结果表明该模型准确率只有76%,还有一定的优化空间。

    如果内容对你有帮助,感谢点赞+关注哦!

    更多干货内容持续更新中…

    相关文章

      网友评论

          本文标题:【阿旭机器学习实战】【24】信用卡用户流失预测实战

          本文链接:https://www.haomeiwen.com/subject/tephfdtx.html