Tensorflow 基本文本分类

作者: dalalaa | 来源:发表于2018-09-25 10:16 被阅读36次

    导入工具库

    import tensorflow as tf
    from tensorflow import keras
    
    import numpy as np
    
    print(tf.__version__)
    
    1.10.0
    

    导入数据

    导入数据集,仍然是采用国内特色的导入方式,先自己下载,然后再导入。

    imdb = keras.datasets.imdb
    
    (train_data, train_labels), (test_data, test_labels) = imdb.load_data(path = 'H:/tf_project/imdb.npz',num_words=10000)
    

    npz格式的数据也可以直接使用np.load()导入,导入格式为类似字典的格式,可以使用dict()将之转化为字典。

    npy格式的数据文件使用np.load()导入之后直接就是array()格式。

    print("Training entries: {}, labels: {}".format(len(train_data), len(train_labels)))
    
    Training entries: 25000, labels: 25000
    
    print(train_data[0])
    
    [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
    
    len(train_data[0]), len(train_data[1])
    
    (218, 189)
    

    将整型数组重新转化为单词

    # A dictionary mapping words to an integer index
    # 直接下载
    # 地址:  https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
    word_index = imdb.get_word_index(r"H:\tf_project\imdb_word_index.json")
    word_index
    
    {'fawn': 34701,
     'tsukino': 52006,
     'nunnery': 52007,
     'sonja': 16816,
     'vani': 63951,
     'woods': 1408,
     ...}
    
    # 所有word的编码往后移三位
    word_index = {k:(v+3) for k,v in word_index.items()} 
    # 添加其他标记符
    word_index["<PAD>"] = 0
    word_index["<START>"] = 1
    word_index["<UNK>"] = 2  # unknown
    word_index["<UNUSED>"] = 3
    
    # 翻转key和value
    reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
    
    def decode_review(text):
        return ' '.join([reverse_word_index.get(i, '?') for i in text])
    

    将训练数据转化为文字

    decode_review(train_data[0])
    
    "<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"
    

    array格式的电影评论需要转化为张量才能传入网络,可以通过如下两种方式实现:

    1. 独热编码,即转化为只包含0和1的向量,如列表[3,5]可以被转化为一个10000维向量,该向量中除了下标为3和5的位置为1,其他位置均为0。这种方式对内存要求比较高。

    2. 我们可以填充数组,使所有的数组具备相同的长度,然后传入到网络中。

    train_data = keras.preprocessing.sequence.pad_sequences(train_data,
                                                            value=word_index["<PAD>"],
                                                            padding='post',
                                                            maxlen=256)
    test_data = keras.preprocessing.sequence.pad_sequences(test_data,
                                                           value=word_index["<PAD>"],
                                                           padding='post',
                                                           maxlen=256)
    
    len(train_data[0]), len(train_data[1])
    
    (256, 256)
    
    print(train_data[0])
    
    [   1   14   22   16   43  530  973 1622 1385   65  458 4468   66 3941    4
      173   36  256    5   25  100   43  838  112   50  670    2    9   35  480
      284    5  150    4  172  112  167    2  336  385   39    4  172 4536 1111
       17  546   38   13  447    4  192   50   16    6  147 2025   19   14   22
        4 1920 4613  469    4   22   71   87   12   16   43  530   38   76   15
       13 1247    4   22   17  515   17   12   16  626   18    2    5   62  386
       12    8  316    8  106    5    4 2223 5244   16  480   66 3785   33    4
      130   12   16   38  619    5   25  124   51   36  135   48   25 1415   33
        6   22   12  215   28   77   52    5   14  407   16   82    2    8    4
      107  117 5952   15  256    4    2    7 3766    5  723   36   71   43  530
      476   26  400  317   46    7    4    2 1029   13  104   88    4  381   15
      297   98   32 2071   56   26  141    6  194 7486   18    4  226   22   21
      134  476   26  480    5  144   30 5535   18   51   36   28  224   92   25
      104    4  226   65   16   38 1334   88   12   16  283    5   16 4472  113
      103   32   15   16 5345   19  178   32    0    0    0    0    0    0    0
        0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
        0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
        0]
    

    搭建模型

    vocab_size = 10000
    
    model = keras.Sequential()
    model.add(keras.layers.Embedding(vocab_size, 16))
    model.add(keras.layers.GlobalAveragePooling1D())
    model.add(keras.layers.Dense(16, activation=tf.nn.relu))
    model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))
    
    model.summary()
    
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    embedding (Embedding)        (None, None, 16)          160000    
    _________________________________________________________________
    global_average_pooling1d (Gl (None, 16)                0         
    _________________________________________________________________
    dense (Dense)                (None, 16)                272       
    _________________________________________________________________
    dense_1 (Dense)              (None, 1)                 17        
    =================================================================
    Total params: 160,289
    Trainable params: 160,289
    Non-trainable params: 0
    _________________________________________________________________
    
    1. 第一层是Embedding层

    2. 第二层是GlobalAveragePooling1D层

    3. 第三四层是全连接层

    4. 输出层只有一个节点,使用sigmoid激活函数将结果约束到0-1之间。

    损失函数和优化器

    model.compile(optimizer=tf.train.AdamOptimizer(),
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    

    创建验证集

    x_val = train_data[:10000]
    partial_x_train = train_data[10000:]
    
    y_val = train_labels[:10000]
    partial_y_train = train_labels[10000:]
    

    训练模型

    这里的history是fit的返回值,包括了训练过程中指标变化信息

    history = model.fit(partial_x_train,
                        partial_y_train,
                        epochs=40,
                        batch_size=512,
                        validation_data=(x_val, y_val),
                        verbose=1)
    
    Train on 15000 samples, validate on 10000 samples
    Epoch 1/40
    15000/15000 [==============================] - 4s 249us/step - loss: 0.7391 - acc: 0.5035 - val_loss: 0.7010 - val_acc: 0.4947
    Epoch 2/40
    15000/15000 [==============================] - 1s 52us/step - loss: 0.6931 - acc: 0.5251 - val_loss: 0.6912 - val_acc: 0.5338
    Epoch 3/40
    15000/15000 [==============================] - 1s 51us/step - loss: 0.6903 - acc: 0.5801 - val_loss: 0.6897 - val_acc: 0.5656
    Epoch 4/40
    15000/15000 [==============================] - 1s 50us/step - loss: 0.6884 - acc: 0.6543 - val_loss: 0.6879 - val_acc: 0.6747
    Epoch 5/40
    15000/15000 [==============================] - 1s 48us/step - loss: 0.6864 - acc: 0.6421 - val_loss: 0.6860 - val_acc: 0.7004
    Epoch 6/40
    15000/15000 [==============================] - 1s 51us/step - loss: 0.6841 - acc: 0.7283 - val_loss: 0.6837 - val_acc: 0.7259
    Epoch 7/40
    15000/15000 [==============================] - 1s 53us/step - loss: 0.6810 - acc: 0.7203 - val_loss: 0.6805 - val_acc: 0.6978
    Epoch 8/40
    15000/15000 [==============================] - 1s 53us/step - loss: 0.6769 - acc: 0.7057 - val_loss: 0.6759 - val_acc: 0.6885
    Epoch 9/40
    15000/15000 [==============================] - 1s 51us/step - loss: 0.6707 - acc: 0.7150 - val_loss: 0.6695 - val_acc: 0.7142
    Epoch 10/40
    15000/15000 [==============================] - 1s 56us/step - loss: 0.6628 - acc: 0.7443 - val_loss: 0.6610 - val_acc: 0.7356
    Epoch 11/40
    15000/15000 [==============================] - 1s 51us/step - loss: 0.6529 - acc: 0.7487 - val_loss: 0.6503 - val_acc: 0.7497
    Epoch 12/40
    15000/15000 [==============================] - 1s 50us/step - loss: 0.6387 - acc: 0.7843 - val_loss: 0.6345 - val_acc: 0.7720
    Epoch 13/40
    15000/15000 [==============================] - 1s 56us/step - loss: 0.6182 - acc: 0.7861 - val_loss: 0.6157 - val_acc: 0.7727
    Epoch 14/40
    15000/15000 [==============================] - 1s 50us/step - loss: 0.5933 - acc: 0.7986 - val_loss: 0.5889 - val_acc: 0.7900
    Epoch 15/40
    15000/15000 [==============================] - 1s 49us/step - loss: 0.5614 - acc: 0.8103 - val_loss: 0.5584 - val_acc: 0.7956
    Epoch 16/40
    15000/15000 [==============================] - 1s 49us/step - loss: 0.5295 - acc: 0.8157 - val_loss: 0.5293 - val_acc: 0.8052
    Epoch 17/40
    15000/15000 [==============================] - 1s 50us/step - loss: 0.4963 - acc: 0.8327 - val_loss: 0.5008 - val_acc: 0.8192
    Epoch 18/40
    15000/15000 [==============================] - 1s 52us/step - loss: 0.4647 - acc: 0.8423 - val_loss: 0.4726 - val_acc: 0.8273
    Epoch 19/40
    15000/15000 [==============================] - 1s 49us/step - loss: 0.4349 - acc: 0.8519 - val_loss: 0.4471 - val_acc: 0.8363
    Epoch 20/40
    15000/15000 [==============================] - 1s 50us/step - loss: 0.4076 - acc: 0.8607 - val_loss: 0.4243 - val_acc: 0.8434
    Epoch 21/40
    15000/15000 [==============================] - 1s 49us/step - loss: 0.3829 - acc: 0.8707 - val_loss: 0.4043 - val_acc: 0.8489
    Epoch 22/40
    15000/15000 [==============================] - 1s 50us/step - loss: 0.3612 - acc: 0.8773 - val_loss: 0.3872 - val_acc: 0.8547
    Epoch 23/40
    15000/15000 [==============================] - 1s 49us/step - loss: 0.3424 - acc: 0.8833 - val_loss: 0.3729 - val_acc: 0.8587
    Epoch 24/40
    15000/15000 [==============================] - 1s 49us/step - loss: 0.3256 - acc: 0.8885 - val_loss: 0.3605 - val_acc: 0.8643
    Epoch 25/40
    15000/15000 [==============================] - 1s 48us/step - loss: 0.3111 - acc: 0.8935 - val_loss: 0.3500 - val_acc: 0.8673
    Epoch 26/40
    15000/15000 [==============================] - 1s 49us/step - loss: 0.2980 - acc: 0.8960 - val_loss: 0.3415 - val_acc: 0.8698
    Epoch 27/40
    15000/15000 [==============================] - 1s 50us/step - loss: 0.2868 - acc: 0.8989 - val_loss: 0.3338 - val_acc: 0.8711
    Epoch 28/40
    15000/15000 [==============================] - 1s 49us/step - loss: 0.2758 - acc: 0.9039 - val_loss: 0.3268 - val_acc: 0.8746
    Epoch 29/40
    15000/15000 [==============================] - 1s 49us/step - loss: 0.2666 - acc: 0.9058 - val_loss: 0.3218 - val_acc: 0.8751
    Epoch 30/40
    15000/15000 [==============================] - 1s 53us/step - loss: 0.2588 - acc: 0.9079 - val_loss: 0.3164 - val_acc: 0.8768
    Epoch 31/40
    15000/15000 [==============================] - 1s 51us/step - loss: 0.2498 - acc: 0.9125 - val_loss: 0.3124 - val_acc: 0.8769
    Epoch 32/40
    15000/15000 [==============================] - 1s 50us/step - loss: 0.2431 - acc: 0.9135 - val_loss: 0.3086 - val_acc: 0.8793
    Epoch 33/40
    15000/15000 [==============================] - 1s 50us/step - loss: 0.2352 - acc: 0.9170 - val_loss: 0.3052 - val_acc: 0.8805
    Epoch 34/40
    15000/15000 [==============================] - 1s 47us/step - loss: 0.2288 - acc: 0.9183 - val_loss: 0.3030 - val_acc: 0.8807
    Epoch 35/40
    15000/15000 [==============================] - 1s 51us/step - loss: 0.2231 - acc: 0.9195 - val_loss: 0.2998 - val_acc: 0.8802
    Epoch 36/40
    15000/15000 [==============================] - 1s 51us/step - loss: 0.2166 - acc: 0.9220 - val_loss: 0.2975 - val_acc: 0.8825
    Epoch 37/40
    15000/15000 [==============================] - 1s 49us/step - loss: 0.2111 - acc: 0.9247 - val_loss: 0.2956 - val_acc: 0.8831
    Epoch 38/40
    15000/15000 [==============================] - 1s 49us/step - loss: 0.2058 - acc: 0.9259 - val_loss: 0.2940 - val_acc: 0.8834
    Epoch 39/40
    15000/15000 [==============================] - 1s 49us/step - loss: 0.2003 - acc: 0.9294 - val_loss: 0.2922 - val_acc: 0.8846
    Epoch 40/40
    15000/15000 [==============================] - 1s 52us/step - loss: 0.1953 - acc: 0.9307 - val_loss: 0.2908 - val_acc: 0.8848
    

    验证模型

    results = model.evaluate(test_data, test_labels)
    
    print(results)
    
    25000/25000 [==============================] - 2s 86us/step
    [0.3060342230606079, 0.87492000000000003]
    

    结果可视化

    history_dict = history.history
    history_dict.keys()
    
    dict_keys(['val_loss', 'val_acc', 'loss', 'acc'])
    

    误差率

    import matplotlib.pyplot as plt
    %matplotlib inline
    acc = history.history['acc']
    val_acc = history.history['val_acc']
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    
    epochs = range(1, len(acc) + 1)
    
    # "bo" is for "blue dot"
    plt.plot(epochs, loss, 'bo', label='Training loss')
    # b is for "solid blue line"
    plt.plot(epochs, val_loss, 'b', label='Validation loss')
    plt.title('Training and validation loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()
    
    plt.show()
    
    output_31_0.png

    上图中,大约20个epochs之后,validation loss的下降变缓,模型开始出现过拟合现象。

    准确率

    plt.clf()   # clear figure
    acc_values = history_dict['acc']
    val_acc_values = history_dict['val_acc']
    
    plt.plot(epochs, acc, 'bo', label='Training acc')
    plt.plot(epochs, val_acc, 'b', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.legend()
    
    plt.show()
    
    output_34_0.png

    相关文章

      网友评论

        本文标题:Tensorflow 基本文本分类

        本文链接:https://www.haomeiwen.com/subject/xwnhoftx.html