一. 样本不均匀带来的影响
我们从样本数据中知道,正常的交易数据有2.8w左右数据,异常的交易数据有492,正常的交易数据与异常交易数据差距非常大,这样会导致我们模型的效果不佳。
下面我们来列举一个案例:
代码:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import confusion_matrix,recall_score,classification_report
# 实现主要功能的函数
def printing_Kfold_scores(x_train_data, y_train_data):
# 第一个参数是自己指定将训练集划分为多少个
# 一个训练集容易出现较多问题,多个训练集可以进行交叉验证
fold = KFold(5, shuffle=False)
# 正则化惩罚
# 用于惩罚那些最终评分高,但是不稳定的模型,浮动更小越稳定,可以规避过拟合问题
# 过拟合是模型在训练集OK,但是在测试集表现不佳的情况
c_param_range = [0.01, 0.1, 1, 10, 100]
results_table = pd.DataFrame(index=range(len(c_param_range), 2), columns=['C_parameter', 'Mean recall score'])
results_table['C_parameter'] = c_param_range
# k-fold 表示K折的交叉验证, 这里会得到两个索引集合: 训练集 = indices[0], 验证集 = indices[1]
j = 0
for c_param in c_param_range:
print('-------------------------------------------')
print('正则化惩罚力度: ', c_param)
print('-------------------------------------------')
print('')
recall_accs = []
# KFold.split :生成索引,将数据分割为训练集和测试集
for iteration, indices in enumerate(fold.split(y_train_data),start=1):
# 惩罚权重参数
# l2代表 loss(损失函数值) + 1/2*power(w,2)
# l1代表 loss(损失函数值) + |w| 新版本已弃用
# 指定算法模型, 并且给定参数
lr = LogisticRegression(C=c_param, penalty='l2')
# 训练模型, 注意不要给错索引, 训练的时候传入的一定是训练集, 所以X和Y的索引都是0
lr.fit(x_train_data[indices[0], :], y_train_data[indices[0], :].ravel())
# 输出系数
#print(lr.coef_)
# 建立好模型后, 预测模型结果, 这里用的是验证集, 索引为1
y_pred_undersample = lr.predict(x_train_data[indices[1], :])
# 预测结果明确后, 就可以进行评估, 这里recall_score需要传入预测值和真实值
recall_acc = recall_score(y_train_data[indices[1], :], y_pred_undersample)
# 将得到的值平均,所以要将其保存起来
recall_accs.append(recall_acc)
print('Iteration ', iteration, ': recall score = ', recall_acc)
# 计算完所有的交叉验证后, 计算平均结果
results_table.loc[j,'Mean recall score'] = np.mean(recall_accs)
j += 1
print('')
print('平均召回率: ', np.mean(recall_accs))
print('')
#print(results_table['Mean recall score'])
# best_c = results_table[(results_table['Mean recall score'].astype(float)).idxmax()]['C_parameter']
# 获取 Mean recall score 列 值最大的 那个C_parameter参数
r_index = results_table['Mean recall score'].astype(float).idxmax()
best_c = results_table['C_parameter'].loc[r_index]
#Finally, we can check which C parameter is the best amongst the chosen.
print('*********************************************************************************')
print('效果最好的模型所选的参数 = ', best_c)
print('*********************************************************************************')
return best_c
data = pd.read_csv("E:/file/creditcard.csv")
# 将金额数据处理成 范围为[-1,1] 之间的数值
# 机器学习默认数值越大,特征就越重要,不处理容易造成的问题是 金额这个特征值的重要性远大于V1-V28特征
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
# 删除暂时不用的特征值
data = data.drop(['Time','Amount'],axis=1)
X = data.values[:, data.columns != 'Class']
y = data.values[:, data.columns == 'Class']
# 划分训练集和测试集
# 测试集比例为0.3,也可以根据时间情况进行调整
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)
# 调用函数,传入数据集
best_c = printing_Kfold_scores(X_train,y_train)
测试记录:
E:\python\数据分析_new4\数据分析\Scripts\python.exe E:/python/数据分析/机器学习/回归/logic7.py
-------------------------------------------
正则化惩罚力度: 0.01
-------------------------------------------
Iteration 1 : recall score = 0.5373134328358209
Iteration 2 : recall score = 0.6164383561643836
Iteration 3 : recall score = 0.6666666666666666
Iteration 4 : recall score = 0.6
Iteration 5 : recall score = 0.5
平均召回率: 0.5840836911333742
-------------------------------------------
正则化惩罚力度: 0.1
-------------------------------------------
Iteration 1 : recall score = 0.5522388059701493
Iteration 2 : recall score = 0.6164383561643836
Iteration 3 : recall score = 0.7166666666666667
Iteration 4 : recall score = 0.6153846153846154
Iteration 5 : recall score = 0.5625
平均召回率: 0.612645688837163
-------------------------------------------
正则化惩罚力度: 1
-------------------------------------------
Iteration 1 : recall score = 0.5522388059701493
Iteration 2 : recall score = 0.6164383561643836
Iteration 3 : recall score = 0.7333333333333333
Iteration 4 : recall score = 0.6153846153846154
Iteration 5 : recall score = 0.575
平均召回率: 0.6184790221704963
-------------------------------------------
正则化惩罚力度: 10
-------------------------------------------
Iteration 1 : recall score = 0.5522388059701493
Iteration 2 : recall score = 0.6164383561643836
Iteration 3 : recall score = 0.7333333333333333
Iteration 4 : recall score = 0.6153846153846154
Iteration 5 : recall score = 0.575
平均召回率: 0.6184790221704963
-------------------------------------------
正则化惩罚力度: 100
-------------------------------------------
Iteration 1 : recall score = 0.5522388059701493
Iteration 2 : recall score = 0.6164383561643836
Iteration 3 : recall score = 0.7333333333333333
Iteration 4 : recall score = 0.6153846153846154
Iteration 5 : recall score = 0.575
平均召回率: 0.6184790221704963
*********************************************************************************
效果最好的模型所选的参数 = 1.0
*********************************************************************************
Process finished with exit code 0
结论:
我们可以看到,由于正常值与异常值差距太大,最好的模型评分才0.618左右,远远达不到我们的预期。
二. 处理样本不均衡问题的方法
2.1 权重法
-
类别权重法class weight
权重加在类别上,若类别的样本量多,则类别的权重设低一些,反之类别的权重设高些 -
样本权重法sample weight
权重加在样本上,若类别的样本量多,则其每个样本的权重低,反之样本的权重高
2.2 采样法
-
上采样(或 过采样)
对样本量少的类别进行过采样,直到和样本量多的类别量级差不多 -
下采样(或 子采样)
对样本量多的类别进行子采样,直到和样本量少的类别量级差不多 -
人工合成样本 (SMOTE采样)
为了解决过/子采样对样本分布造成改变的影响
三. 实例
3.1 下采样
代码:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# 读取数据集并处理
data = pd.read_csv("E:/file/creditcard.csv")
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
data = data.drop(['Time','Amount'],axis=1)
X = data.loc[:, data.columns != 'Class']
y = data.loc[:, data.columns == 'Class']
# 获取异常交易数据的总行数及索引
number_records_fraud = len(data[data.Class == 1])
fraud_indices = np.array(data[data.Class == 1].index)
# 获取正常交易数据的索引值
normal_indices = data[data.Class == 0].index
# 在正常样本当中, 随机采样得到指定个数的样本, 并取其索引
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False)
random_normal_indices = np.array(random_normal_indices)
# 有了正常和异常的样本后把他们的索引都拿到手
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])
# 根据索引得到下采样的所有样本点
under_sample_data = data.iloc[under_sample_indices,:]
X_undersample = under_sample_data.loc[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.loc[:, under_sample_data.columns == 'Class']
# 打印下采样测略后正负样本比例
print('正常样本所占整体比例:', len(under_sample_data[under_sample_data.Class == 0]) / len(under_sample_data))
print('负样本所占整体比例:', len(under_sample_data[under_sample_data.Class == 1]) / len(under_sample_data))
print('下采样测略总体样本数量:', len(under_sample_data))
# 对整个数据集进行划分, X为特征数据, Y为标签, test_size为测试集比列, random_state 为随机种子, 目的是使得每次随机的结果都一样
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)
print('原始训练集包含的样本数量:', len(X_train))
print('原始测试集包含的样本数量:', len(X_test))
print('原始样本总数:', len(X_train) + len(X_test))
# 下采样数据集进行划分
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample
,y_undersample
,test_size = 0.3
,random_state = 0)
print("")
print('下采样训练集包含的样本数量:', len(X_train_undersample))
print('下采样测试集包含的样本数量:', len(X_test_undersample))
print('下采样本总数:', len(X_train_undersample) + len(X_test_undersample))
测试记录:
正常样本所占整体比例: 0.5
负样本所占整体比例: 0.5
下采样测略总体样本数量: 984
原始训练集包含的样本数量: 199364
原始测试集包含的样本数量: 85443
原始样本总数: 284807
下采样训练集包含的样本数量: 688
下采样测试集包含的样本数量: 296
下采样本总数: 984
3.2 SMOTE方法
代码:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
# 读取数据集并处理
data = pd.read_csv("E:/file/creditcard.csv")
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
credit_cards = data.drop(['Time','Amount'],axis=1)
X = data.values[:, data.columns != 'Class']
y = data.values[:, data.columns == 'Class']
columns=credit_cards.columns
features_columns=columns.delete(len(columns)-1)
features=credit_cards[features_columns]
labels=credit_cards['Class']
features_train, features_test, labels_train, labels_test = train_test_split(features,
labels,
test_size=0.2,
random_state=0)
oversampler=SMOTE(random_state=0)
os_features,os_labels=oversampler.fit_resample(features_train,labels_train)
len(os_labels[os_labels==1])
os_features = pd.DataFrame(os_features)
os_labels = pd.DataFrame(os_labels)
# 打印SMOTE后正负样本比例
print('正常样本所占整体比例:', len(os_features[os_features.Class == 0]) / len(os_features))
print('负样本所占整体比例:', len(os_features[os_features.Class == 1]) / len(os_features))
print('SMOTE测略总体样本数量:', len(os_features))
# 对整个数据集进行划分, X为特征数据, Y为标签, test_size为测试集比列, random_state 为随机种子, 目的是使得每次随机的结果都一样
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)
print('原始训练集包含的样本数量:', len(X_train))
print('原始测试集包含的样本数量:', len(X_test))
print('原始样本总数:', len(X_train) + len(X_test))
# 下采样数据集进行划分
X_train_smote_sample, X_test_smote_sample, y_train_smote_sample, y_test_smote_sample = train_test_split(os_features
,os_labels
,test_size = 0.3
,random_state = 0)
print("")
print('smote训练集包含的样本数量:', len(X_train_smote_sample))
print('smote测试集包含的样本数量:', len(X_test_smote_sample))
print('smote测试集样本总数:', len(X_train_smote_sample) + len(X_test_smote_sample))
测试记录:
正常样本所占整体比例: 0.5
负样本所占整体比例: 0.5
SMOTE测略总体样本数量: 454908
原始训练集包含的样本数量: 199364
原始测试集包含的样本数量: 85443
原始样本总数: 284807
smote训练集包含的样本数量: 318435
smote测试集包含的样本数量: 136473
smote测试集样本总数: 454908
网友评论