机器学习入门数据集--6.信用卡诈骗预测

作者: ac619467fef3 | 来源:发表于2019-02-12 23:36 被阅读4次

    通过用户使用信用卡行为数据,建立信用卡盗刷风控模型。当用户有了新的行为,通过这个模型就可以判断是正常用户的行为,还是有人盗刷这张卡。
    由于数据集是PCA降维后的数据,这样就隐藏了原始信息的敏感信息,但保留了原数据中的信息量。
    深度神经网络可解释性差,数据维度是用PCA处理之后的,所以很容易出现过拟合。可以把权重值加到Loss函数中,惩罚权重,降低过拟合。

    查看数据

    kaggle官网 下载数据需要登录。

    欧洲的信用卡持卡人在2013年9月2天时间里的284807笔交易数据,其中有492笔交易是欺诈交易,占比0.172%。数据采用PCA变换映射为V1,V2,...,V28 数值型属性,只有交易时间和金额这两个变量没有经过PCA变换。输出变量为二值变量,1为正常,0为欺诈交易。

    RangeIndex: 284807 entries, 0 to 284806
    Data columns (total 31 columns):
    Time      284807 non-null float64
    V1        284807 non-null float64
    V2        284807 non-null float64
    V3        284807 non-null float64
    V4        284807 non-null float64
    V5        284807 non-null float64
    V6        284807 non-null float64
    V7        284807 non-null float64
    V8        284807 non-null float64
    V9        284807 non-null float64
    V10       284807 non-null float64
    V11       284807 non-null float64
    V12       284807 non-null float64
    V13       284807 non-null float64
    V14       284807 non-null float64
    V15       284807 non-null float64
    V16       284807 non-null float64
    V17       284807 non-null float64
    V18       284807 non-null float64
    V19       284807 non-null float64
    V20       284807 non-null float64
    V21       284807 non-null float64
    V22       284807 non-null float64
    V23       284807 non-null float64
    V24       284807 non-null float64
    V25       284807 non-null float64
    V26       284807 non-null float64
    V27       284807 non-null float64
    V28       284807 non-null float64
    Amount    284807 non-null float64
    Class     284807 non-null int64
    dtypes: float64(30), int64(1)
    memory usage: 67.4 MB
               count          mean           std         min           25%  \
    Time    284807.0  9.481386e+04  47488.145955    0.000000  54201.500000
    V1      284807.0  3.919560e-15      1.958696  -56.407510     -0.920373
    V2      284807.0  5.688174e-16      1.651309  -72.715728     -0.598550
    V3      284807.0 -8.769071e-15      1.516255  -48.325589     -0.890365
    V4      284807.0  2.782312e-15      1.415869   -5.683171     -0.848640
    V5      284807.0 -1.552563e-15      1.380247 -113.743307     -0.691597
    V6      284807.0  2.010663e-15      1.332271  -26.160506     -0.768296
    V7      284807.0 -1.694249e-15      1.237094  -43.557242     -0.554076
    V8      284807.0 -1.927028e-16      1.194353  -73.216718     -0.208630
    V9      284807.0 -3.137024e-15      1.098632  -13.434066     -0.643098
    V10     284807.0  1.768627e-15      1.088850  -24.588262     -0.535426
    V11     284807.0  9.170318e-16      1.020713   -4.797473     -0.762494
    V12     284807.0 -1.810658e-15      0.999201  -18.683715     -0.405571
    V13     284807.0  1.693438e-15      0.995274   -5.791881     -0.648539
    V14     284807.0  1.479045e-15      0.958596  -19.214325     -0.425574
    V15     284807.0  3.482336e-15      0.915316   -4.498945     -0.582884
    V16     284807.0  1.392007e-15      0.876253  -14.129855     -0.468037
    V17     284807.0 -7.528491e-16      0.849337  -25.162799     -0.483748
    V18     284807.0  4.328772e-16      0.838176   -9.498746     -0.498850
    V19     284807.0  9.049732e-16      0.814041   -7.213527     -0.456299
    V20     284807.0  5.085503e-16      0.770925  -54.497720     -0.211721
    V21     284807.0  1.537294e-16      0.734524  -34.830382     -0.228395
    V22     284807.0  7.959909e-16      0.725702  -10.933144     -0.542350
    V23     284807.0  5.367590e-16      0.624460  -44.807735     -0.161846
    V24     284807.0  4.458112e-15      0.605647   -2.836627     -0.354586
    V25     284807.0  1.453003e-15      0.521278  -10.295397     -0.317145
    V26     284807.0  1.699104e-15      0.482227   -2.604551     -0.326984
    V27     284807.0 -3.660161e-16      0.403632  -22.565679     -0.070840
    V28     284807.0 -1.206049e-16      0.330083  -15.430084     -0.052960
    Amount  284807.0  8.834962e+01    250.120109    0.000000      5.600000
    Class   284807.0  1.727486e-03      0.041527    0.000000      0.000000
    
                     50%            75%            max
    Time    84692.000000  139320.500000  172792.000000
    V1          0.018109       1.315642       2.454930
    V2          0.065486       0.803724      22.057729
    V3          0.179846       1.027196       9.382558
    V4         -0.019847       0.743341      16.875344
    V5         -0.054336       0.611926      34.801666
    V6         -0.274187       0.398565      73.301626
    V7          0.040103       0.570436     120.589494
    V8          0.022358       0.327346      20.007208
    V9         -0.051429       0.597139      15.594995
    V10        -0.092917       0.453923      23.745136
    V11        -0.032757       0.739593      12.018913
    V12         0.140033       0.618238       7.848392
    V13        -0.013568       0.662505       7.126883
    V14         0.050601       0.493150      10.526766
    V15         0.048072       0.648821       8.877742
    V16         0.066413       0.523296      17.315112
    V17        -0.065676       0.399675       9.253526
    V18        -0.003636       0.500807       5.041069
    V19         0.003735       0.458949       5.591971
    V20        -0.062481       0.133041      39.420904
    V21        -0.029450       0.186377      27.202839
    V22         0.006782       0.528554      10.503090
    V23        -0.011193       0.147642      22.528412
    V24         0.040976       0.439527       4.584549
    V25         0.016594       0.350716       7.519589
    V26        -0.052139       0.240952       3.517346
    V27         0.001342       0.091045      31.612198
    V28         0.011244       0.078280      33.847808
    Amount     22.000000      77.165000   25691.160000
    Class       0.000000       0.000000       1.000000
    

    数据主要是由PCA产生,不需要过多预处理。

    建模——深度学习模型

    数据集有一个特点,正标签很少,因此在训练的时候应该均和正负标签。

    import numpy as np
    import pandas as pd
    
    np.set_printoptions(suppress=True)
    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_rows', None)
    
    data_df = pd.read_csv("/Users/wangsen/ai/13/homework/creditcard.csv")
    # data_df.info()
    # print(data_df.describe().T)
    
    #print(data_df.Time.head(100))
    data_df = data_df.drop("Time",axis=1)
    neg_df = data_df[data_df.Class==0]
    pos_df = data_df[data_df.Class==1]
    
    #print(neg_df.head())
    #print(pos_df.head())
    
    neg_data = neg_df.drop('Class',axis=1).values
    pos_data = pos_df.drop('Class',axis=1).values
    
    print("neg_data shape:",neg_data.shape)
    print("pos_data shape:",pos_data.shape)
    
    import tensorflow as tf
    
    X = tf.placeholder(dtype=tf.float32,shape=[None,29])
    label = tf.placeholder(dtype=tf.float32,shape=[None,2])
    
    net = tf.layers.dense(X,16,tf.nn.relu)
    net = tf.layers.dense(net,256,tf.nn.relu)
    net = tf.layers.dense(net,256,tf.nn.relu)
    net = tf.layers.dense(net,256,tf.nn.relu)
    net = tf.layers.dense(net,256,tf.nn.relu)
    y = tf.layers.dense(net,2,None)
    #y = tf.nn.softmax(y)
    loss = tf.losses.softmax_cross_entropy(label,y)
    
    #loss = tf.reduce_mean(tf.square(label-y))
    #loss = tf.reduce_sum(-label*tf.log(y))
    accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(y, 1), tf.argmax(label, 1)), tf.float32))
    
    train_step = tf.train.AdamOptimizer(0.0001).minimize(loss)
    # train_step  = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
    
    neg_high = neg_data.shape[0] 
    pos_high = pos_data.shape[0]
    input_y = np.zeros([64, 2])
    input_y[:32, 0] = 1
    input_y[32:, 1] = 1
    
    test_x = np.concatenate([neg_data[10000:10000+450],pos_data[0:450]])
    test_y = np.zeros([900,2])
    test_y[:450,0] = 1
    test_y[450:,1] = 1
    
    sess = tf.Session()
    sess.run(tf.global_variables_initializer())
    l = []
    s = []
    for itr in range(10000):
        neg_ind = np.random.randint(0,neg_high,32)
        pos_ind = np.random.randint(0,pos_high,32)
        input_x = np.concatenate([neg_data[neg_ind],pos_data[pos_ind]])
        _,loss_var = sess.run((train_step,loss),feed_dict={X:input_x,label:input_y})
        if itr%100==0:
            accuracy_var = sess.run(accuracy,feed_dict={X:test_x,label:test_y})
            print("iter: accurency: loss:%f",(itr,accuracy_var,loss_var))
            s.append(accuracy_var)
            l.append(loss_var)
    import matplotlib.pyplot as plt
    plt.plot(l,color="red")
    plt.plot(s,color="green")
    plt.show()
    '''
    neg_data shape: (284315, 29)
    pos_data shape: (492, 29)
    '''
    
    iter: accurency: loss:%f (9900, 0.98888886, 0.06290674)
    
    损失函数和准确率

    防止过拟合,添加L2惩罚

    loss_w = [tf.nn.l2_loss(var) for var in tf.trainable_variables() if "kernel" in var.name]
    print("variables:",tf.trainable_variables())
    weights_norm = tf.reduce_sum(loss_w)
    loss = tf.losses.softmax_cross_entropy(label,y)+0.001*weights_norm
    
    variables: [<tf.Variable 'dense/kernel:0' shape=(29, 16) dtype=float32_ref>, <tf.Variable 'dense/bias:0' shape=(16,) dtype=float32_ref>, <tf.Variable 'dense_1/kernel:0' shape=(16, 256) dtype=float32_ref>, <tf.Variable 'dense_1/bias:0' shape=(256,) dtype=float32_ref>, <tf.Variable 'dense_2/kernel:0' shape=(256, 256) dtype=float32_ref>, <tf.Variable 'dense_2/bias:0' shape=(256,) dtype=float32_ref>, <tf.Variable 'dense_3/kernel:0' shape=(256, 256) dtype=float32_ref>, <tf.Variable 'dense_3/bias:0' shape=(256,) dtype=float32_ref>, <tf.Variable 'dense_4/kernel:0' shape=(256, 256) dtype=float32_ref>, <tf.Variable 'dense_4/bias:0' shape=(256,) dtype=float32_ref>, <tf.Variable 'dense_5/kernel:0' shape=(256, 2) dtype=float32_ref>, <tf.Variable 'dense_5/bias:0' shape=(2,) dtype=float32_ref>]
    iter:4900 accurency:0.972222 loss:0.291221 weight:197.521942
    
    添加L2惩罚项

    相关文章

      网友评论

        本文标题:机器学习入门数据集--6.信用卡诈骗预测

        本文链接:https://www.haomeiwen.com/subject/uhskeqtx.html