交易数据异常预测

作者: bf3780a4db09 | 来源:发表于2019-02-18 16:27 被阅读4次

数据结构:

image.png
其中,
Time:交易持续时间
V1至V28:28个特征,特征提取已完成
Amount:交易金额特征,浮动较大,存在极端值
Class:交易是否存在异常,1表示存在异常,0表示交易正常【二分类数据】
注:前28个特征似乎已经标准化过,为了避免量纲的影响,需要将交易金额也做标准化处理
查看分类数据的大体分布情况
count_classes = data['Class'].value_counts()
count_classes.plot.bar()
plt.xlabel('Class')
plt.ylabel('Frequency')

返回


image.png

异常样本较少,正常样本占大多数,在28万左右,呈现出样本分布极度不均衡现象
处理样本分布不均衡问题
目的:减少0和1数据的样本数差距
法1:下采样【从较多样本数的那类数据(0)中取出与1样本数相近的样本,使得两个样本同样少,用这些0样本和1样本重新组成新的样本】
法2:过采样【增加样本数较少的那类数据(1),样本生成策略,使得两个样本同样多】
在此之前,需要对特征Amount进行标准化,并删除无效特征Time

from sklearn.preprocessing import StandardScaler
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].reshape(-1,1))  #reshape(-1,1)表示只固定列为1列,行不知道,自动确定
data = data.drop(['Time','Amount'],axis=1)  #drop函数默认删除行,列需要加axis = 1

下采样方法

X = data.loc[:,data.columns!='Class']
y = data.iloc[:,-2:-1]
number_records_fraud=sum(data['Class']==1)  #少的那类数据的样本数
fraud_indices = np.array(data[data['Class']==1].index) #异常数据的索引
normal_indices = data[data['Class']==0].index #正常数据索引
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace=False)  #从多的那类数据中随机抽取样本数等于少的那类数据的样本
# replace代表的意思是抽样之后还放不放回去,如果是False的话,那么出来的三个数都不一样,如果是True的话, 有可能会出现重复的,因为前面的抽的放回去了。
random_normal_indices = np.array(random_normal_indices)
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])  #数组连接,返回了一维数组
under_sample_data = data.loc[under_sample_indices,:]  #重组后的样本
X_undersample = under_sample_data.loc[:,under_sample_data.columns!='Class']
y_undersample = under_sample_data.iloc[:,-2:-1]
print(len(X_undersample)/len(under_sample_data),len(y_undersample)/len(under_sample_data),len(under_sample_data))
#两类数据样本量相等,总数据量大量减少

现在的总数量为984,相对于原始数据来说,这个样本量较小
利用交叉验证(寻求稳定的参数),寻找recall score最高的逻辑回归惩罚项参数best_c

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold,cross_val_score
from sklearn.metrics import confusion_matrix,recall_score,classification_report
def printing_Kfold_scores(x_train_data,y_train_data):
    fold = KFold(len(y_train_data),5,shuffle=False)  #交叉验证,将数据切分成5份
    
    c_param_range = [0.01,0.1,1,10,100]  #正则化惩罚项参数,研究惩罚项参数对结果的影响
    results_table = pd.DataFrame(index=range(len(c_param_range)),columns=['C_parameter','Mean recall score'])
    results_table['C_parameter'] = c_param_range
    #k fold给出两个列表:train_indices=indices[0],test_indices=indices[1]
    j = 0
    for c_param in c_param_range:
        print('-----------------------')
        print('C parameter: {}'.format(c_param))
        recall_accs = []
        for iteration,indices in enumerate(fold,start=1):#交叉验证
            lr = LogisticRegression(C=c_param,penalty='l1')  #构建包含惩罚项的逻辑回归
            #在整个数据的训练集部分:将数据分成5份,其中的几份作为训练集用来拟合数据,再用剩下的来验证当前模型得到性能
            lr.fit(x_train_data.iloc[indices[0],:],y_train_data.astype('int').iloc[indices[0],:].values.ravel())
            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)
           # print(y_train_data.iloc[indices[1],:].values.astype('int'))
            ####
            print(type(y_train_data.astype('int').iloc[indices[1],:].values[0,0]),type(y_pred_undersample[0]))
            recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
            
            recall_accs.append(recall_acc)
            print('Iteration: ',iteration, 'train index: ',len(indices[0]), 'test index: ',len(indices[1]),'recall score= ',recall_acc)
        results_table.loc[j,'Mean recall score'] = np.mean(recall_accs)
        j += 1
        print('Mean recall score: ', np.mean(recall_accs))
    #print(results_table['Mean recall score'])
    best_c = results_table.loc[results_table['Mean recall score'].apply(lambda x:float(x)).idxmax(),'C_parameter']
    return best_c
            

注: results_table['Mean recall score'].idxmax()报错'TypeError: reduction operation 'argmax' not allowed for this dtype'
原因在于“平均召回分数”的类型是对象,不能用“idxmax()”来计算值,应该将“平均召回分数”从“对象”改为“浮动”
将原始数据和下采样数据集切分成训练集和测试集

from sklearn.model_selection  import train_test_split
#整个数据集
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.3, random_state=0)  #random_state类似于seed,使得每次得到相同的训练集和测试集
print(len(X_train),len(X_test),len(X_train)+len(X_test))
#下采样数据集
X_train_undersample,X_test_undersample,y_train_undersample,y_test_undersample = train_test_split(X_undersample, y_undersample, test_size=0.3, random_state=0)
print(len(X_train_undersample),len(X_test_undersample),len(X_train_undersample)+len(X_test_undersample))

计算最佳的best_c为0.01,利用下采样数据构建拟合模型,并用来分别预测下采样测试集和原始数据集的测试集

lr_undersample = LogisticRegression(C=0.01,penalty='l1')
lr_undersample.fit(X_train_undersample,y_train_undersample)
#预测下采样数据集
y_pred_undersample = lr_undersample.predict(X_test_undersample)
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)
TN = cnf_matrix[0,0]
FP = cnf_matrix[0,1]
FN = cnf_matrix[1,0]
TP = cnf_matrix[1,1]
recall_score_1 = TP/(TP+FN)
precision = TP/(TP+FP)
accuracy = (TN+TP)/(TN+TP+FP+FN)
print('recall score is {}, precision is {}, accuracy is {}.'.format(recall_score_1,precision,accuracy))
#预测原始数据的测试集
y_pred_allsample = lr_undersample.predict(X_test)
cnf_matrix_all = confusion_matrix(y_test,y_pred_allsample)
TN = cnf_matrix_all[0,0]
FP = cnf_matrix_all[0,1]
FN = cnf_matrix_all[1,0]
TP = cnf_matrix_all[1,1]
recall_score_all = TP/(TP+FN)
precision_all = TP/(TP+FP)
accuracy_all = (TN+TP)/(TN+TP+FP+FN)
print('recall score is {}, precision is {}, accuracy is {}.'.format(recall_score_all,precision_all,accuracy_all))
print('TN is {}, FP is {}, FN is {}, TP is {}.'.format(TN,FP,FN,TP))

返回
下采样测试集结果

image.png
三个指标的值都很不错 注:这是在样本量较少的测试集中的结果,原始数据的样本数远大于现在的测试集样本量
原始数据测试集结果
image.png
recall和accuracy数值不错,但是精度很低,因为FP的数量很大,即将交易正常的数据误判为异常的数据较多,所以下采样数据的拟合效果不是特别好
过采样SMOTE算法样本生成策略
SMOTE算法:
第一步:对于少数类中的每一个样本x,以欧氏距离【差的平方和开根】为标准计算它到所有少数类样本的距离,将这些距离按升序排列;
第二步:根据样本不平衡比例设置一个采样比例以确定采样倍率N,对于每一个样本x,选择与它距离排名前N的样本;
第三步:将这N个样本按照公式 image.png

在三个指标差异不大的基础上,下采样的FP的样本数时9183,过采样中FP的样本数时2100,误判样本数大大下降。
综上,倾向于选择样本数较多的方法,数据越多,模型效果越好。
整个程序的流程大概是下面这样


绘图12.jpg
有个问题,为什么下采样是先抽样再切分数据集,过采样是先切分数据集,用训练集来做SMOTE,不是先抽样再重组样本,再切分数据集?
尝试了先抽样再切分【这样测试集的样本量更大】
oversampler = SMOTE(random_state=0)
features, labels = oversampler.fit_sample(X,y)
print(sum(labels==1),sum(labels==0))  #两类数据样本量相等284315
features_train,features_test,labels_train,labels_test = train_test_split(features,labels,test_size=0.3,random_state=0)
best_c = printing_Kfold_scores(pd.DataFrame(features_train),pd.DataFrame(labels_train))
lr_oversample = LogisticRegression(C=best_c,penalty='l1')
lr_oversample.fit(features_train,labels_train)
y_pred_oversample = lr_oversample.predict(X_test)
cnf_oversample = confusion_matrix(y_test,y_pred_oversample)
TN = cnf_oversample[0,0]
FP = cnf_oversample[0,1]
FN = cnf_oversample[1,0]
TP = cnf_oversample[1,1]
recall_score_all = TP/(TP+FN)
precision_all = TP/(TP+FP)
accuracy_all = (TN+TP)/(TN+TP+FP+FN)
#print('-----------------------------------------------------------')
#print('threshold is: ',i)
print('recall score is {}, precision is {}, accuracy is {}.'.format(recall_score_all,precision_all,accuracy_all))
print('TN is {}, FP is {}, FN is {}, TP is {}.'.format(TN,FP,FN,TP))
print('-----------------------------------------------------------')

返回


image.png

和原方法的结果差异不大,略好一些,那应该这样也是可以的。
做了这么多,再看看如果直接用原数据,recall score会出现什么结果

best_c_all = printing_Kfold_scores(X_train,y_train)
best_c_all

基于原始数据,虽然最佳的惩罚项参数是10,但是它所对应的recall score也只有0.62,效果远不如另外两种方法。
最后,看一下逻辑回归中的阈值对预测结果的影响,默认阈值是0.5,现在假设阈值为0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9中的一个,看各自的结果【用下采样数据】,先计算出sigmoid函数【概率值】,再和阈值比较,大于阈值,预测值等于1,小于阈值,预测值等于0

lr = LogisticRegression(C=0.01,penalty='l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample_prob = lr.predict_proba(X_test_undersample.values)  #计算1-g(z)和g(z)
thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
j = 1
for i in thresholds:
    y_test_predictions_high_recall = y_pred_undersample_prob[:,1] > i #返回布尔值,大于阈值返回True,即取1,小于等于阈值,返回False,即取0
    cnf_matrix = confusion_matrix(y_test_undersample.values,y_test_predictions_high_recall)
    TN = cnf_matrix[0,0]
    FP = cnf_matrix[0,1]
    FN = cnf_matrix[1,0]
    TP = cnf_matrix[1,1]
    recall_score_all = TP/(TP+FN)
    precision_all = TP/(TP+FP)
    accuracy_all = (TN+TP)/(TN+TP+FP+FN)
    print('-----------------------------------------------------------')
    print('threshold is: ',i)
    print('recall score is {}, precision is {}, accuracy is {}.'.format(recall_score_all,precision_all,accuracy_all))
    print('TN is {}, FP is {}, FN is {}, TP is {}.'.format(TN,FP,FN,TP))
    print('-----------------------------------------------------------')

返回


image.png
image.png

当阈值较小时,误判的较多,几乎将所有样本都判断为目标值;阈值到0.4后,FP的样本数开始减少,继续提高阈值,误杀的继续减少,再提高阈值0.6及以上,左下角的FN开始增加,recall开始下降,三个指标都不错的话,阈值取0.5或者0.6的样子,果然默认值还有比较有效果的。

相关文章

网友评论

    本文标题:交易数据异常预测

    本文链接:https://www.haomeiwen.com/subject/ubiaeqtx.html