【深度学习TensorFlow (十)】文本嵌入模型_IMBD评

作者: Geekero | 来源:发表于2021-02-27 23:30 被阅读0次

学习自中国大学MOOC TensorFlow学习课程

一、构建分词器构建词典实现词条化

from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!'
]

#用分词器创建频率最高的100个词的词典进入编码
tokenizer = Tokenizer(num_words = 100) 
tokenizer.fit_on_texts(sentences) #便利文本创建编码。键为文本；值为编码
#查看词典结构
word_index = tokenizer.word_index
print(word_index)

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}

二、根据分词器实现文本的序列化

import tensorflow as tf
from tensorflow import keras


from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

#遇见未登录词时，填充特殊值
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

序列化：

#序列化
sequences = tokenizer.texts_to_sequences(sentences)
转换为长度相同的矩阵
padded = pad_sequences(sequences,maxlen = 8) #, padding='post', truncating='post'
print("\nWord Index = " , word_index)
print("\nSequences = " , sequences)
print("\nPadded Sequences:")
print(padded)

# 输出
Word Index =  {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}

Sequences =  [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]

Padded Sequences:
[[ 0  0  0  0  5  3  2  4]
 [ 0  0  0  0  5  3  2  7]
 [ 0  0  0  0  6  3  2  4]
 [ 0  8  6  9  2  4 10 11]]

三、用含有未登录词的句子输入到分词器中

# Try with words that the tokenizer wasn't fit to
test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print("\nTest Sequence = ", test_seq)

padded = pad_sequences(test_seq,maxlen = 2)
print("\nPadded Test Sequence: ")
print(padded)


#序列化输出
Test Sequence =  [[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]
Padded Test Sequence: 
[[2 4]
 [2 1]]

四、讽刺数据集预处理

# !wget --no-check-certificate \
#     https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json \
#     -O sarcasm.json
  
import json

#读入文本，构建序列
with open("./sarcasm.json", 'r') as f:
    datastore = json.load(f)


sentences = [] 
labels = []
urls = []
for item in datastore:
    sentences.append(item['headline'])
    labels.append(item['is_sarcastic'])
    urls.append(item['article_link'])

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

#构建分词器
tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
print(len(word_index))

输出:
29657

#序列化
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post')
print(sentences[2])
print(padded[2])
print(padded.shape)  #26709个句子, 40个单词的长度

输出：

mom starting to fear son's web series closest thing she will have to grandchild
[  145   838     2   907  1749  2093   582  4719   221   143    39    46
     2 10736     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0]
(26709, 40)

五、可视化的IMDB分类

# NOTE: PLEASE MAKE SURE YOU ARE RUNNING THIS IN A PYTHON3 ENVIRONMENT

import tensorflow as tf
print(tf.__version__)

# This is needed for the iterator over the data
# But not necessary if you have TF 2.0 installed
#!pip install tensorflow==2.0.0-beta0


    2.4.0

#我们可以通过 tf.enable_eager_execution() 方法来启用动态图机制。但是Tensflow2.X不需要运行这条命令
#tf.enable_eager_execution()

# 根据https://blog.csdn.net/weixin_42462804/article/details/105558997 改为：
#tf.compat.v1.enable_eager_execution()  #TF2.0以上不需要执行这段代码
#!pip install -q tensorflow-datasets

进行文本的情感分析，需要从语料库中学习到关键信息，像卷积神经网络提取图像特征时一样。嵌入的核心在于将所有相关的词汇，聚类为多维空间中的向量。

案例：电影评论分类的嵌入

通过已有的评论标签，tensorflow可以建立一个嵌入，来将表示不同类型评论的词语进行聚类

通过词嵌入构建一个可视化的IMDB情感分类器

加载tensorflow tfds库：

image.png

报错，根据：https://blog.csdn.net/qq_42192693/article/details/104950706

进行以下操作：

1，卸载jupyter：pip uninstall jupyter

2，安装jupyter：pip install jupyter

3，安装ipywidgets：pip install ipywidgets

4，关联：jupyter nbextension enable --py widgetsnbextension

import tensorflow_datasets as tfds

imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

```python
#将句子和标签放入列表
import numpy as np

train_data, test_data = imdb['train'], imdb['test']


training_sentences = []
training_labels = []

testing_sentences = []  
testing_labels = []


# str(s.tonumpy()) is needed in Python3 instead of just s.numpy()
for s,l in train_data:
  #通过numpy方法从张量中获取值
  training_sentences.append(str(s.numpy()))
  training_labels.append(l.numpy())
  
for s,l in test_data:  # s为存储文本的张量
  testing_sentences.append(str(s.numpy()))
  testing_labels.append(l.numpy())
 
#转换为numpy数据形式
training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)

    <tf.Tensor: shape=(), dtype=string, numpy=b"They just don't make cartoons like they used to. This one had wit, great characters, and the greatest ensemble of voice over artists ever assembled for a daytime cartoon show. This still remains as one of the highest rated daytime cartoon shows, and one of the most honored, winning several Emmy Awards.">

查看一下testing_sentences

testing_sentences[0:1]


    ['b"There are films that make careers. For George Romero, it was NIGHT OF THE LIVING DEAD; for Kevin Smith, CLERKS; for Robert Rodriguez, EL MARIACHI. Add to that list Onur Tukel\'s absolutely amazing DING-A-LING-LESS. Flawless film-making, and as assured and as professional as any of the aforementioned movies. I haven\'t laughed this hard since I saw THE FULL MONTY. (And, even then, I don\'t think I laughed quite this hard... So to speak.) Tukel\'s talent is considerable: DING-A-LING-LESS is so chock full of double entendres that one would have to sit down with a copy of this script and do a line-by-line examination of it to fully appreciate the, uh, breadth and width of it. Every shot is beautifully composed (a clear sign of a sure-handed director), and the performances all around are solid (there\'s none of the over-the-top scenery chewing one might\'ve expected from a film like this). DING-A-LING-LESS is a film whose time has come."']

training_labels_final

    array([0, 0, 0, ..., 0, 0, 1], dtype=int64)

对句子进行词条化

##超参数
vocab_size = 10000
embedding_dim = 16
max_length = 120
trunc_type='post'
oov_tok = "<OOV>"


from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

#词条化并构建词典
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
#序列化
sequences = tokenizer.texts_to_sequences(training_sentences)
#转换成长度相同的矩阵
padded = pad_sequences(sequences,maxlen=max_length, truncating=trunc_type, padding='post')
#测试数据集也同样方法处理
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences,maxlen=max_length)

自然语音处理有两种展平的方法：

第一种直接全部展平：

#构建神经网络模型
model1 = tf.keras.Sequential([
    #嵌入层的作用：可以在一个高维空间中，找到一组相似的向量来表示情感相同的单词
    #这些向量会因为相似的数据标签，逐渐聚集在一起
    #所以神经网路可以学习这些向量，建立向量和标签之间的联系
    #向量成为单词和它们代表的情感之间的联系纽带
    #执行嵌入以后会得到一个二维数组，行数为句子的长度，列数为向量的维度
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    #展平并输入全连接层
    tf.keras.layers.Flatten(),  #第一种展平方法是使用平坦层
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model1.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model1.summary()


    Model: "sequential_3"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    embedding_4 (Embedding)      (None, 120, 16)           160000    
    _________________________________________________________________
    flatten_1 (Flatten)          (None, 1920)              0         
    _________________________________________________________________
    dense_7 (Dense)              (None, 6)                 11526     
    _________________________________________________________________
    dense_8 (Dense)              (None, 1)                 7         
    =================================================================
    Total params: 171,533
    Trainable params: 171,533
    Non-trainable params: 0
    _________________________________________________________________

第二种使用“全局平局池化层”


model2 = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),  #在每个向量的维度上取平均值进行输出
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model2.summary() #这样得到的模型更加简洁训练速度更快。这个让我想起用单细胞转录组数据转换成bulk处理的效果

    Model: "sequential_5"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    embedding_6 (Embedding)      (None, 120, 16)           160000    
    _________________________________________________________________
    global_average_pooling1d_4 ( (None, 16)                0         
    _________________________________________________________________
    dense_11 (Dense)             (None, 6)                 102       
    _________________________________________________________________
    dense_12 (Dense)             (None, 1)                 7         
    =================================================================
    Total params: 160,109
    Trainable params: 160,109
    Non-trainable params: 0
    _________________________________________________________________

训练数据

神经网络进行了有效的训练，但也可能产生过拟合现象

num_epochs = 10 #训练周期
model1.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))

    Epoch 1/10
    782/782 [==============================] - 10s 11ms/step - loss: 0.6031 - accuracy: 0.6439 - val_loss: 0.3543 - val_accuracy: 0.8456
    Epoch 2/10
    782/782 [==============================] - 8s 10ms/step - loss: 0.2439 - accuracy: 0.9082 - val_loss: 0.3752 - val_accuracy: 0.8351
    Epoch 3/10
    782/782 [==============================] - 8s 10ms/step - loss: 0.1029 - accuracy: 0.9742 - val_loss: 0.4578 - val_accuracy: 0.8241
    Epoch 4/10
    782/782 [==============================] - 8s 10ms/step - loss: 0.0290 - accuracy: 0.9963 - val_loss: 0.5767 - val_accuracy: 0.8080
    Epoch 5/10
    782/782 [==============================] - 8s 10ms/step - loss: 0.0069 - accuracy: 0.9995 - val_loss: 0.6223 - val_accuracy: 0.8159
    Epoch 6/10
    782/782 [==============================] - 8s 10ms/step - loss: 0.0025 - accuracy: 0.9999 - val_loss: 0.6873 - val_accuracy: 0.8143
    Epoch 7/10
    782/782 [==============================] - 8s 10ms/step - loss: 9.8368e-04 - accuracy: 1.0000 - val_loss: 0.7294 - val_accuracy: 0.8160
    Epoch 8/10
    782/782 [==============================] - 8s 10ms/step - loss: 5.1677e-04 - accuracy: 1.0000 - val_loss: 0.7731 - val_accuracy: 0.8150
    Epoch 9/10
    782/782 [==============================] - 8s 10ms/step - loss: 2.8761e-04 - accuracy: 1.0000 - val_loss: 0.8148 - val_accuracy: 0.8156
    Epoch 10/10
    782/782 [==============================] - 8s 10ms/step - loss: 1.7435e-04 - accuracy: 1.0000 - val_loss: 0.8601 - val_accuracy: 0.8143
    




    <tensorflow.python.keras.callbacks.History at 0x21916125c10>

训练模型

num_epochs = 10 #训练周期 这里速度没看出差异 但是精度似乎更差了
model2.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))

    Epoch 1/10
    782/782 [==============================] - 9s 10ms/step - loss: 0.6451 - accuracy: 0.6336 - val_loss: 0.4048 - val_accuracy: 0.8382
    Epoch 2/10
    782/782 [==============================] - 8s 11ms/step - loss: 0.3543 - accuracy: 0.8596 - val_loss: 0.3381 - val_accuracy: 0.8551
    Epoch 3/10
    782/782 [==============================] - 8s 10ms/step - loss: 0.2807 - accuracy: 0.8864 - val_loss: 0.3338 - val_accuracy: 0.8568
    Epoch 4/10
    782/782 [==============================] - 8s 11ms/step - loss: 0.2368 - accuracy: 0.9087 - val_loss: 0.3450 - val_accuracy: 0.8542
    Epoch 5/10
    782/782 [==============================] - 8s 10ms/step - loss: 0.2063 - accuracy: 0.9259 - val_loss: 0.3660 - val_accuracy: 0.8474
    Epoch 6/10
    782/782 [==============================] - 8s 10ms/step - loss: 0.1850 - accuracy: 0.9345 - val_loss: 0.3897 - val_accuracy: 0.8427
    Epoch 7/10
    782/782 [==============================] - 8s 10ms/step - loss: 0.1638 - accuracy: 0.9427 - val_loss: 0.4174 - val_accuracy: 0.8374
    Epoch 8/10
    782/782 [==============================] - 8s 11ms/step - loss: 0.1485 - accuracy: 0.9496 - val_loss: 0.4447 - val_accuracy: 0.8316
    Epoch 9/10
    782/782 [==============================] - 8s 10ms/step - loss: 0.1354 - accuracy: 0.9566 - val_loss: 0.4815 - val_accuracy: 0.8286
    Epoch 10/10
    782/782 [==============================] - 9s 11ms/step - loss: 0.1258 - accuracy: 0.9615 - val_loss: 0.5129 - val_accuracy: 0.8252
    




    <tensorflow.python.keras.callbacks.History at 0x2195f3887c0>

更加深入讨论嵌入（网络层的可视化）

#首先获取神经网络第0层的权值
e = model.layers[0] 
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim) 一个10000 * 16的矩阵
#嵌入层有一万个单词，每个单词被转化成16维的向量

    (10000, 16)

weights

    array([[-0.0037344 ,  0.0441438 ,  0.02674736, ...,  0.00774414,
             0.0354718 , -0.04018337],
           [ 0.00612024, -0.0319906 , -0.01417147, ...,  0.02941039,
             0.00112623, -0.01814681],
           [-0.0116765 ,  0.02541435,  0.04666061, ..., -0.0051497 ,
            -0.04277205, -0.04614462],
           ...,
           [ 0.03461715,  0.02974167,  0.01525644, ..., -0.02042359,
            -0.0231557 ,  0.01969254],
           [ 0.03000433, -0.02585316,  0.02725101, ..., -0.01997694,
             0.04028039,  0.03279526],
           [ 0.0287138 , -0.01199107, -0.04754802, ...,  0.01724907,
             0.00563254, -0.00768729]], dtype=float32)

可视化序列输出

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

print(decode_review(padded[1]))

    b'i have been known to fall asleep during films but this is usually due to a combination of things including really tired being warm and comfortable on the <OOV> and having just eaten a lot however on this occasion i fell asleep because the film was rubbish the plot development was constant constantly slow and boring things seemed to happen but with no explanation of what was causing them or why i admit i may have missed part of the film but i watched the majority of it and everything just seemed to happen of its own <OOV> without any real concern for anything else i cant recommend this film at all ' ? ? ? ? ? ? ?

原本的文本:

print(training_sentences[1])

    b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development was constant. Constantly slow and boring. Things seemed to happen, but with no explanation of what was causing them or why. I admit, I may have missed part of the film, but i watched the majority of it and everything just seemed to happen of its own accord without any real concern for anything else. I cant recommend this film at all.'

将更改后的word_index以及嵌入层的权重，分别写入out_v 和 out_m

import io

out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for word_num in range(1, vocab_size):
  word = reverse_word_index[word_num]
  embeddings = weights[word_num]
  out_m.write(word + "\n") #每一行保存一个单词，并保存为meta.tsv
  out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")  #每一行分别写入每个向量的向量值，并保存为vecs.tsv
out_v.close()
out_m.close()

https://projector.tensorflow.org/ 可视化网站

# try:
#   from google.colab import files
# except ImportError:
#   pass
# else:
#   files.download('vecs.tsv')
#   files.download('meta.tsv')