数据集来源 :kaggleLending Club Loan Data
数据集包含了2007-2015年Lending Club平台的贷款数据,所以数据量非常庞大
大概有90W行*74列
本来想用EXCEL进行分析,但是看到连打开都需要1分钟之后我放弃了这样的想法
本次主要使用pandas进行数据清理
思路为:
- 读取数据, 分别查看object 和 float64 的缺失比例
- 缺失值比例达到25%以上的列直接剔除,低于25%的列视其意义进行空值填充
- 处理后含有缺失值的行剔除
#导入相关库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#读取数据
loanData = pd.read_csv('../input/loan.csv')
loanData.shape
查看数据集原始大小:
out:(887379, 74)
loan_status实际就也是我们的建模目标,据此查看贷款情况:
loanData.loan_status.value_counts()
Current 601779
Fully Paid 207723
Charged Off 45248
Late (31-120 days) 11591
Issued 8460
In Grace Period 6253
Late (16-30 days) 2357
Does not meet the credit policy. Status:Fully Paid 1988
Default 1219
Does not meet the credit policy. Status:Charged Off 761
Name: loan_status, dtype: int64
明显可以看到正负样本数量差距悬殊,估计 9:1,对于后面建模而言需要考虑这个问题
#查看Object类型数据的缺失情况
loanData.select_dtypes(include=['O']).describe()\
.T.assign(missing = loanData.apply(lambda x : (1 - x.count() /len(x)))).sort_values(by = 'missing',ascending = False)
1_output217.png
#查看float64类型数据的缺失情况
loanData.select_dtypes(include=['float64']).describe()\
.T.assign(missing = loanData.apply(lambda x : (1 - x.count() /len(x)))).sort_values(by = 'missing',ascending = False)
2_outpu_218.png
#缺失值处理1,这里按照25%的阈值处理,缺失超过25%的列直接删除
loanData.dropna(
axis=1, thresh=int(0.75 * len(loanData)),
inplace=True)
#查看处理效果
loanData.describe().T.assign(
missing_pct=loanData.apply(
lambda x: (1 - x.count() / len(x)))).sort_values(
by='missing_pct', ascending=False)
3_output_220.png
#缺失值处理2
#针对数据集缺失值列的意义进行填充或其他出来
#total_rev_hi_lim:总周转高信用/信用额度。以0填充
#tot_cur_bal:所有帐户的当前总余额。 以0填充
#tot_coll_amt:欠款总额。以0填充
#revol_util:循环利用率,或借款人相对于所有可用循环信贷使用的信贷额度。以均值填充
#collections_12_mths_ex_med:除医疗账单外12个月的欠款数量。以0填充
#inq_last_6mths:过去6个月的征信查询数目(不包括汽车及按揭查询)。以0填充
#acc_now_delinq: 借款人现在欠款的账户数量。以0填充
#pub_rec: 贬损公共记录的数量。以0填充
#open_acc;借款人信用档案中的未结信用额度。以均值填充
#total_acc:借款人信用档案中当前信用额度的总数。以均值填充
#delinq_2yrs:过去两年借款人信用档案中逾期30天以上的拖欠次数。以0填充
#annual_inc:借款人在注册期间自行报告的年收入。以均值填充。
cols_fill_with_zero = [
'total_rev_hi_lim', 'tot_cur_bal', 'tot_coll_amt',
'collections_12_mths_ex_med', 'inq_last_6mths', 'acc_now_delinq',
'pub_rec', 'delinq_2yrs'
]
cols_fill_with_mean = ['revol_util', 'open_acc', 'total_acc', 'annual_inc']
for col in cols_fill_with_zero:
loanData[col] = loanData[col].fillna(0)
for col in cols_fill_with_mean:
mean_of_col = loanData[col].mean()
loanData[col] = loanData[col].fillna(mean_of_col)
loanData.dropna(axis=0, inplace=True) #剔除有空值的行
#loanData.isnull().sum().sort_values(ascending = False)
loanData.shape
查看处理后的数据集大小
out (819022, 52)
看起来效果还可以
#查看正负样本情况
loanData.loan_status.value_counts()
Current 558269
Fully Paid 197119
Charged Off 41288
Late (31-120 days) 10683
In Grace Period 5778
Late (16-30 days) 2155
Does not meet the credit policy. Status:Fully Paid 1862
Default 1131
Does not meet the credit policy. Status:Charged Off 697
Issued 40
Name: loan_status, dtype: int64
可以看到正负样本差距依然悬殊
数据清理到这里就结束了
网友评论