机器学习入门数据集--6.信用卡诈骗预测

作者: ac619467fef3 | 来源:发表于2019-02-12 23:36 被阅读4次

机器学习入门数据集--1.鸢尾花
机器学习入门数据集--6.信用卡诈骗预测
机器学习1（特征工程）
python处理MNIST数据集
独立性假设与先验后验
Purdue机器学习入门（三）COCO数据集导入
机器学习入门
在独立表达量数据集构建和验证机器学习（随机森林）
实战1-Kaggle-Tatanic
金融反欺诈项目

通过用户使用信用卡行为数据，建立信用卡盗刷风控模型。当用户有了新的行为，通过这个模型就可以判断是正常用户的行为，还是有人盗刷这张卡。
由于数据集是PCA降维后的数据，这样就隐藏了原始信息的敏感信息，但保留了原数据中的信息量。
深度神经网络可解释性差，数据维度是用PCA处理之后的，所以很容易出现过拟合。可以把权重值加到Loss函数中，惩罚权重，降低过拟合。

查看数据

kaggle官网下载数据需要登录。

欧洲的信用卡持卡人在2013年9月2天时间里的284807笔交易数据，其中有492笔交易是欺诈交易，占比0.172%。数据采用PCA变换映射为V1，V2，...，V28 数值型属性，只有交易时间和金额这两个变量没有经过PCA变换。输出变量为二值变量，1为正常，0为欺诈交易。

RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
Time      284807 non-null float64
V1        284807 non-null float64
V2        284807 non-null float64
V3        284807 non-null float64
V4        284807 non-null float64
V5        284807 non-null float64
V6        284807 non-null float64
V7        284807 non-null float64
V8        284807 non-null float64
V9        284807 non-null float64
V10       284807 non-null float64
V11       284807 non-null float64
V12       284807 non-null float64
V13       284807 non-null float64
V14       284807 non-null float64
V15       284807 non-null float64
V16       284807 non-null float64
V17       284807 non-null float64
V18       284807 non-null float64
V19       284807 non-null float64
V20       284807 non-null float64
V21       284807 non-null float64
V22       284807 non-null float64
V23       284807 non-null float64
V24       284807 non-null float64
V25       284807 non-null float64
V26       284807 non-null float64
V27       284807 non-null float64
V28       284807 non-null float64
Amount    284807 non-null float64
Class     284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB
           count          mean           std         min           25%  \
Time    284807.0  9.481386e+04  47488.145955    0.000000  54201.500000
V1      284807.0  3.919560e-15      1.958696  -56.407510     -0.920373
V2      284807.0  5.688174e-16      1.651309  -72.715728     -0.598550
V3      284807.0 -8.769071e-15      1.516255  -48.325589     -0.890365
V4      284807.0  2.782312e-15      1.415869   -5.683171     -0.848640
V5      284807.0 -1.552563e-15      1.380247 -113.743307     -0.691597
V6      284807.0  2.010663e-15      1.332271  -26.160506     -0.768296
V7      284807.0 -1.694249e-15      1.237094  -43.557242     -0.554076
V8      284807.0 -1.927028e-16      1.194353  -73.216718     -0.208630
V9      284807.0 -3.137024e-15      1.098632  -13.434066     -0.643098
V10     284807.0  1.768627e-15      1.088850  -24.588262     -0.535426
V11     284807.0  9.170318e-16      1.020713   -4.797473     -0.762494
V12     284807.0 -1.810658e-15      0.999201  -18.683715     -0.405571
V13     284807.0  1.693438e-15      0.995274   -5.791881     -0.648539
V14     284807.0  1.479045e-15      0.958596  -19.214325     -0.425574
V15     284807.0  3.482336e-15      0.915316   -4.498945     -0.582884
V16     284807.0  1.392007e-15      0.876253  -14.129855     -0.468037
V17     284807.0 -7.528491e-16      0.849337  -25.162799     -0.483748
V18     284807.0  4.328772e-16      0.838176   -9.498746     -0.498850
V19     284807.0  9.049732e-16      0.814041   -7.213527     -0.456299
V20     284807.0  5.085503e-16      0.770925  -54.497720     -0.211721
V21     284807.0  1.537294e-16      0.734524  -34.830382     -0.228395
V22     284807.0  7.959909e-16      0.725702  -10.933144     -0.542350
V23     284807.0  5.367590e-16      0.624460  -44.807735     -0.161846
V24     284807.0  4.458112e-15      0.605647   -2.836627     -0.354586
V25     284807.0  1.453003e-15      0.521278  -10.295397     -0.317145
V26     284807.0  1.699104e-15      0.482227   -2.604551     -0.326984
V27     284807.0 -3.660161e-16      0.403632  -22.565679     -0.070840
V28     284807.0 -1.206049e-16      0.330083  -15.430084     -0.052960
Amount  284807.0  8.834962e+01    250.120109    0.000000      5.600000
Class   284807.0  1.727486e-03      0.041527    0.000000      0.000000

                 50%            75%            max
Time    84692.000000  139320.500000  172792.000000
V1          0.018109       1.315642       2.454930
V2          0.065486       0.803724      22.057729
V3          0.179846       1.027196       9.382558
V4         -0.019847       0.743341      16.875344
V5         -0.054336       0.611926      34.801666
V6         -0.274187       0.398565      73.301626
V7          0.040103       0.570436     120.589494
V8          0.022358       0.327346      20.007208
V9         -0.051429       0.597139      15.594995
V10        -0.092917       0.453923      23.745136
V11        -0.032757       0.739593      12.018913
V12         0.140033       0.618238       7.848392
V13        -0.013568       0.662505       7.126883
V14         0.050601       0.493150      10.526766
V15         0.048072       0.648821       8.877742
V16         0.066413       0.523296      17.315112
V17        -0.065676       0.399675       9.253526
V18        -0.003636       0.500807       5.041069
V19         0.003735       0.458949       5.591971
V20        -0.062481       0.133041      39.420904
V21        -0.029450       0.186377      27.202839
V22         0.006782       0.528554      10.503090
V23        -0.011193       0.147642      22.528412
V24         0.040976       0.439527       4.584549
V25         0.016594       0.350716       7.519589
V26        -0.052139       0.240952       3.517346
V27         0.001342       0.091045      31.612198
V28         0.011244       0.078280      33.847808
Amount     22.000000      77.165000   25691.160000
Class       0.000000       0.000000       1.000000

数据主要是由PCA产生，不需要过多预处理。

建模——深度学习模型

数据集有一个特点，正标签很少，因此在训练的时候应该均和正负标签。

import numpy as np
import pandas as pd

np.set_printoptions(suppress=True)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

data_df = pd.read_csv("/Users/wangsen/ai/13/homework/creditcard.csv")
# data_df.info()
# print(data_df.describe().T)

#print(data_df.Time.head(100))
data_df = data_df.drop("Time",axis=1)
neg_df = data_df[data_df.Class==0]
pos_df = data_df[data_df.Class==1]

#print(neg_df.head())
#print(pos_df.head())

neg_data = neg_df.drop('Class',axis=1).values
pos_data = pos_df.drop('Class',axis=1).values

print("neg_data shape:",neg_data.shape)
print("pos_data shape:",pos_data.shape)

import tensorflow as tf

X = tf.placeholder(dtype=tf.float32,shape=[None,29])
label = tf.placeholder(dtype=tf.float32,shape=[None,2])

net = tf.layers.dense(X,16,tf.nn.relu)
net = tf.layers.dense(net,256,tf.nn.relu)
net = tf.layers.dense(net,256,tf.nn.relu)
net = tf.layers.dense(net,256,tf.nn.relu)
net = tf.layers.dense(net,256,tf.nn.relu)
y = tf.layers.dense(net,2,None)
#y = tf.nn.softmax(y)
loss = tf.losses.softmax_cross_entropy(label,y)

#loss = tf.reduce_mean(tf.square(label-y))
#loss = tf.reduce_sum(-label*tf.log(y))
accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(y, 1), tf.argmax(label, 1)), tf.float32))

train_step = tf.train.AdamOptimizer(0.0001).minimize(loss)
# train_step  = tf.train.GradientDescentOptimizer(0.01).minimize(loss)

neg_high = neg_data.shape[0] 
pos_high = pos_data.shape[0]
input_y = np.zeros([64, 2])
input_y[:32, 0] = 1
input_y[32:, 1] = 1

test_x = np.concatenate([neg_data[10000:10000+450],pos_data[0:450]])
test_y = np.zeros([900,2])
test_y[:450,0] = 1
test_y[450:,1] = 1

sess = tf.Session()
sess.run(tf.global_variables_initializer())
l = []
s = []
for itr in range(10000):
    neg_ind = np.random.randint(0,neg_high,32)
    pos_ind = np.random.randint(0,pos_high,32)
    input_x = np.concatenate([neg_data[neg_ind],pos_data[pos_ind]])
    _,loss_var = sess.run((train_step,loss),feed_dict={X:input_x,label:input_y})
    if itr%100==0:
        accuracy_var = sess.run(accuracy,feed_dict={X:test_x,label:test_y})
        print("iter: accurency: loss:%f",(itr,accuracy_var,loss_var))
        s.append(accuracy_var)
        l.append(loss_var)
import matplotlib.pyplot as plt
plt.plot(l,color="red")
plt.plot(s,color="green")
plt.show()
'''
neg_data shape: (284315, 29)
pos_data shape: (492, 29)
'''

iter: accurency: loss:%f (9900, 0.98888886, 0.06290674)

损失函数和准确率

防止过拟合，添加L2惩罚

loss_w = [tf.nn.l2_loss(var) for var in tf.trainable_variables() if "kernel" in var.name]
print("variables:",tf.trainable_variables())
weights_norm = tf.reduce_sum(loss_w)
loss = tf.losses.softmax_cross_entropy(label,y)+0.001*weights_norm

variables: [<tf.Variable 'dense/kernel:0' shape=(29, 16) dtype=float32_ref>, <tf.Variable 'dense/bias:0' shape=(16,) dtype=float32_ref>, <tf.Variable 'dense_1/kernel:0' shape=(16, 256) dtype=float32_ref>, <tf.Variable 'dense_1/bias:0' shape=(256,) dtype=float32_ref>, <tf.Variable 'dense_2/kernel:0' shape=(256, 256) dtype=float32_ref>, <tf.Variable 'dense_2/bias:0' shape=(256,) dtype=float32_ref>, <tf.Variable 'dense_3/kernel:0' shape=(256, 256) dtype=float32_ref>, <tf.Variable 'dense_3/bias:0' shape=(256,) dtype=float32_ref>, <tf.Variable 'dense_4/kernel:0' shape=(256, 256) dtype=float32_ref>, <tf.Variable 'dense_4/bias:0' shape=(256,) dtype=float32_ref>, <tf.Variable 'dense_5/kernel:0' shape=(256, 2) dtype=float32_ref>, <tf.Variable 'dense_5/bias:0' shape=(2,) dtype=float32_ref>]
iter:4900 accurency:0.972222 loss:0.291221 weight:197.521942

添加L2惩罚项

机器学习入门数据集--1.鸢尾花
机器学习入门数据集鸢尾花手写数字识别波士顿房价预测泰坦尼克幸存者预测糖尿病人数据预测信用卡诈骗鸢尾花...
机器学习入门数据集--6.信用卡诈骗预测
通过用户使用信用卡行为数据，建立信用卡盗刷风控模型。当用户有了新的行为，通过这个模型就可以判断是正常用户的行为，还...
机器学习1（特征工程）
机器学习概述机器学习是从数据中自动分析获得规律（模型），并利用规律对未知数据进行预测。数据集的组成机器学习的...
python处理MNIST数据集
1. MNIST数据集 1.1 MNIST数据集获取 MNIST数据集是入门机器学习/模式识别的最经典数据集之一。...
独立性假设与先验后验
1.机器学习假设训练集样本独立同分布机器学习建立在当前获取到的历史数据 [训练集]，对未来数据进行预测、模拟。 ...
Purdue机器学习入门（三）COCO数据集导入
layout: posttitle: "Purdue机器学习入门（三）COCO数据集导入"sub...
机器学习入门
机器学习入门 1. 数据集一般来说，机器学习中用的数据集时来自文件，比较少来自数据库。在数据库中如MySQL容易...
在独立表达量数据集构建和验证机器学习（随机森林）
记录在TCGA数据集构建机器学习模型预测患者亚型，然后在geo数据集检验。假如我们在TCGA数据集上构建出亚型，...
实战1-Kaggle-Tatanic
泰坦尼克生存预测问题是机器学习入门的经典案例，通过分析已知训练集的乘客信息和生存结果，对预测集中的乘客做出预测。简...
金融反欺诈项目
构建信用卡反欺诈预测模型本项目需解决的问题本项目通过利用信用卡的历史交易数据，进行机器学习，构建信用卡反欺诈预...