注意力机制
![](https://img.haomeiwen.com/i8207483/07bad6130c901050.jpg)
参考《动手学深度学习》
参考《李宏毅老师机器学习》
![](https://img.haomeiwen.com/i8207483/31bac42b0b910e10.jpg)
相关参考资料
![](https://img.haomeiwen.com/i8207483/bf1de1ee54ff560b.jpg)
使用 tensorflow2.0 实现带有注意力机器学习
在编码器—解码器(seq2seq)一节里,解码器在各个时间步依赖相同的背景变量(context)来获取编码器输入序列信息。当编码器为循环神经网络时,背景变量来自编码器最终时间步的隐藏状态。
现在,让我们再次思考那一节提到的翻译例子:
- 输入为英语序列“Hi, guys.”
- 输出为西班牙语序列“Ey, chavales.”。
不难想到,解码器在生成输出序列中的每一个词时可能只需要使用(关注)解码器中序列某一部分的信息。在解码器的每一时间步对输入序列中不同时间步的表征或编码信息分配不同的注意力一样。这也是注意力机制的由来。
仍然以循环神经网络为例,注意力机制通过对编码器所有时间步的隐藏状态做加权平均来得到背景变量。解码器在每一时间步调整这些权重,即注意力权重,从而能够在不同时间步分别关注输入序列中的不同部分并编码进相应时间步的背景变量。本节我们将讨论注意力机制是怎么工作的。
在编码器—解码器(seq2seq)分享中,我们区分了输入序列或编码器的索引与输出序列或解码器的索引
。该节中,解码器在时间步
的隐藏状态
,其中
是上一时间步
的输出
的表征,且任一时间步
使用相同的背景变量
。但在注意力机制中,解码器的每一时间步将使用可变的背景变量。记
是解码器在时间步
的背景变量,那么解码器在该时间步的隐藏状态可以改写为
这里的关键是如何计算背景变量和如何利用它来更新隐藏状态
。下面将分别描述这两个关键点。
计算背景变量
我们先描述第一个关键点,即计算背景变量。描绘了注意力机制如何为解码器在时间步2计算背景变量。首先,函数根据解码器在时间步1的隐藏状态和编码器在各个时间步的隐藏状态计算softmax运算的输入。softmax运算输出概率分布并对编码器各个时间步的隐藏状态做加权平均,从而得到背景变量。
![](https://img.haomeiwen.com/i8207483/e8ac29073ad66bb5.png)
具体来说,令编码器在时间步的隐藏状态为
,且总时间步数为
。那么解码器在时间步
的背景变量为所有编码器隐藏状态的加权平均:
其中给定时,权重
在
的值是一个概率分布。为了得到概率分布,我们可以使用softmax运算:
现在,我们需要定义如何计算上式中softmax运算的输入。由于
同时取决于解码器的时间步
和编码器的时间步
,我们不妨以解码器在时间步
的隐藏状态
与编码器在时间步
的隐藏状态
为输入,并通过函数
计算
:
这里函数有多种选择,如果两个输入向量长度相同,一个简单的选择是计算它们的内积
。而最早提出注意力机制的论文则将输入连结后通过含单隐藏层的多层感知机变换:
其中、
、
都是可以学习的模型参数。
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from sklearn.model_selection import train_test_split
import unicodedata
import re
import numpy as np
import os
import io
import time
path_to_zip = tf.keras.utils.get_file(
'spa-eng.zip', origin='http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip',
extract=True)
path_to_file = os.path.dirname(path_to_zip)+"/spa-eng/spa.txt"
en_spa_file_path = "./spa-eng/spa.txt"
![](https://img.haomeiwen.com/i8207483/0107a913b093a566.jpg)
代码
数据预处理
定义模型
- 编码器
- 注意力机制
- 解码器
# 将 unicode 文件转换为 ascii
# 装换为 ascii 码相对比 unicode 要体积比较小
def unicode_to_ascii(s):
# 过滤掉西班牙语中的重音
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')
en_sentence = 'Then what?'
sp_senetence = '¿Entonces qué?'
print(unicode_to_ascii(en_sentence))
print(unicode_to_ascii(sp_senetence))
Then what?
¿Entonces que?
通过转换我们发现英文没有什么区别,而西班牙语e上符号不见了。接下来我们将语言中标点符号和文字分开。
def preprocess_sentence(w):
w = unicode_to_ascii(w.lower().strip())
# 在单词与跟在其后的标点符号之间插入一个空格
# 例如: "he is a boy." => "he is a boy ."
# 参考:https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
w = re.sub(r"([?.!,¿])", r" \1 ", w)
# 空格去重,去掉多余空格,仅保留一个空格
w = re.sub(r'[" "]+', " ", w)
# 将所有除了 (a-z, A-Z, ".", "?", "!", ",")之外的字符都替换为空格
w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)
# 去掉前后的空格
w = w.rstrip().strip()
# 给句子加上开始和结束标记
# 以便模型知道何时开始和结束预测
w = '<start> ' + w + ' <end>'
return w
print(preprocess_sentence(en_sentence))
print(preprocess_sentence(sp_senetence))
<start> then what ? <end>
<start> ¿ entonces que ? <end>
读取文件
# 1. 去除重音符号
# 2. 清理句子
# 3. 返回这样格式的单词对:[ENGLISH, SPANISH]
def create_dataset(path, num_examples):
lines = io.open(path, encoding='UTF-8').read().strip().split('\n')
# 切分每一行将行切分为英文->西班牙语对照
word_pairs = [[preprocess_sentence(w) for w in l.split('\t')] for l in lines[:num_examples]]
return zip(*word_pairs)
a = [(1,2),(3,4),(5,6)]
c,d = zip(*a)
print(c,d)
(1, 3, 5) (2, 4, 6)
en, sp = create_dataset(en_spa_file_path, None)
print(en[-1])
print(sp[-1])
<start> if you want to sound like a native speaker , you must be willing to practice saying the same sentence over and over in the same way that banjo players practice the same phrase over and over until they can play it correctly and at the desired tempo . <end>
<start> si quieres sonar como un hablante nativo , debes estar dispuesto a practicar diciendo la misma frase una y otra vez de la misma manera en que un musico de banjo practica el mismo fraseo una y otra vez hasta que lo puedan tocar correctamente y en el tiempo esperado . <end>
def tokenizer(lang):
# num_words=None 可以对词表进行限制,filters 是词表黑名单
lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='',split=' ')
# 统计词频生成词表
lang_tokenizer.fit_on_texts(lang)
#将语料从文本转为 1
tensor = lang_tokenizer.texts_to_sequences(lang)
# 对文本进行补全处理
tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,
padding='post')
return tensor, lang_tokenizer
input_tensor,input_tokenizer = tokenizer(sp[:30000])
output_tensor,output_tokenizer = tokenizer(en[:30000])
定义一个可以查看 tensor 中句子最大长度是多少的函数
def max_length(tensor):
return max(len(t) for t in tensor)
max_length_input = max_length(input_tensor)
max_length_output = max_length(output_tensor)
print(max_length_input,max_length_output)
16 11
对训练数据集和测试数据集的切分
num_examples = 30000
input_train, input_eval, output_train, output_eval = train_test_split(
input_tensor,output_tensor,test_size=0.2)
len(input_train),len(input_eval),len(output_train),len(output_eval)
(24000, 6000, 24000, 6000)
def convert(example, tokenizer):
for t in example:
if t!=0:
print ("%d ----> %s" % (t, tokenizer.index_word[t]))
print ("Input Language; index to word mapping")
convert(input_train[0], input_tokenizer)
print ()
print ("Target Language; index to word mapping")
convert(output_train[0], output_tokenizer)
Input Language; index to word mapping
1 ----> <start>
69 ----> todos
6434 ----> comtemplaron
3 ----> .
2 ----> <end>
Target Language; index to word mapping
1 ----> <start>
28 ----> they
67 ----> all
700 ----> watched
3 ----> .
2 ----> <end>
def make_dataset(input_tensor,output_tensor,batch_size,epochs,shuffle):
dataset = tf.data.Dataset.from_tensor_slices((input_tensor,output_tensor))
if shuffle:
dataset = dataset.shuffle(30000)
dataset = dataset.repeat(epochs).batch(batch_size,drop_remainder = True)
return dataset
batch_size = 64
epochs = 20
train_dataset = make_dataset(input_train,output_train,
batch_size,epochs,True)
eval_dataset = make_dataset(input_eval,output_eval,batch_size,1,False)
for x, y in train_dataset.take(1):
print(x.shape)
print(y.shape)
print(x)
print(y)
(64, 16)
(64, 11)
tf.Tensor(
[[ 1 148 13 ... 0 0 0]
[ 1 9 92 ... 0 0 0]
[ 1 78 1104 ... 0 0 0]
...
[ 1 29 259 ... 0 0 0]
[ 1 28 55 ... 0 0 0]
[ 1 9 172 ... 0 0 0]], shape=(64, 16), dtype=int32)
tf.Tensor(
[[ 1 4 133 4689 320 3 2 0 0 0 0]
[ 1 13 104 8 585 3 2 0 0 0 0]
[ 1 157 1224 3 2 0 0 0 0 0 0]
[ 1 13 546 8 217 3 2 0 0 0 0]
[ 1 28 47 15 73 81 3 2 0 0 0]
[ 1 2519 13 980 3 2 0 0 0 0 0]
[ 1 19 8 34 601 3 2 0 0 0 0]
[ 1 30 12 6 29 9 413 7 2 0 0]
[ 1 14 38 64 197 10 3 2 0 0 0]
[ 1 4 135 141 54 41 3 2 0 0 0]
[ 1 5 956 3 2 0 0 0 0 0 0]
[ 1 135 19 3 2 0 0 0 0 0 0]
[ 1 46 17 1050 20 3 2 0 0 0 0]
[ 1 176 12 442 436 7 2 0 0 0 0]
[ 1 569 50 49 5 3 2 0 0 0 0]
[ 1 4 693 10 11 252 3 2 0 0 0]
[ 1 10 11 103 868 3 2 0 0 0 0]
[ 1 14 1039 81 655 3 2 0 0 0 0]
[ 1 195 118 109 3 2 0 0 0 0 0]
[ 1 16 23 74 59 424 3 2 0 0 0]
[ 1 5 105 61 940 3 2 0 0 0 0]
[ 1 10 11 34 517 3 2 0 0 0 0]
[ 1 46 17 122 5 3 2 0 0 0 0]
[ 1 5 1685 3 2 0 0 0 0 0 0]
[ 1 1290 5 26 107 3 2 0 0 0 0]
[ 1 102 13 201 185 3 2 0 0 0 0]
[ 1 5 179 348 3 2 0 0 0 0 0]
[ 1 140 329 24 963 3 2 0 0 0 0]
[ 1 5 895 224 3 2 0 0 0 0 0]
[ 1 75 27 53 7 2 0 0 0 0 0]
[ 1 96 4 873 13 533 7 2 0 0 0]
[ 1 4 47 9 770 3 2 0 0 0 0]
[ 1 4 127 380 9 153 3 2 0 0 0]
[ 1 62 4 1990 7 2 0 0 0 0 0]
[ 1 4 18 1964 3 2 0 0 0 0 0]
[ 1 40 9 159 3 2 0 0 0 0 0]
[ 1 4 77 6 266 3 2 0 0 0 0]
[ 1 4 114 6 1352 3 2 0 0 0 0]
[ 1 14 46 81 36 3 2 0 0 0 0]
[ 1 4 117 140 4699 3 2 0 0 0 0]
[ 1 10 8 78 220 3 2 0 0 0 0]
[ 1 32 11 13 1451 7 2 0 0 0 0]
[ 1 4 311 9 698 3 2 0 0 0 0]
[ 1 4 328 41 203 803 3 2 0 0 0]
[ 1 4 18 34 9 443 159 3 2 0 0]
[ 1 19 169 339 494 3 2 0 0 0 0]
[ 1 30 12 456 31 837 3 2 0 0 0]
[ 1 20 11 2570 3 2 0 0 0 0 0]
[ 1 14 1299 61 471 3 2 0 0 0 0]
[ 1 10 38 64 824 3 2 0 0 0 0]
[ 1 20 11 308 3 2 0 0 0 0 0]
[ 1 4 65 210 175 66 3 2 0 0 0]
[ 1 195 118 545 3 2 0 0 0 0 0]
[ 1 25 4 72 6 131 7 2 0 0 0]
[ 1 4 18 34 342 119 10 3 2 0 0]
[ 1 4 42 10 9 817 1265 3 2 0 0]
[ 1 71 8 45 7 2 0 0 0 0 0]
[ 1 28 92 184 3 2 0 0 0 0 0]
[ 1 4 75 40 91 284 3 2 0 0 0]
[ 1 82 8 31 1087 315 7 2 0 0 0]
[ 1 4 25 103 256 6 3 2 0 0 0]
[ 1 28 23 269 59 41 3 2 0 0 0]
[ 1 20 26 48 248 3 2 0 0 0 0]
[ 1 14 175 14 26 626 3 2 0 0 0]], shape=(64, 11), dtype=int32)
模型定义
embedding_units = 256
units = 1024
input_vocab_size = len(input_tokenizer.word_index) + 1
output_vocab_size = len(output_tokenizer.word_index) + 1
定义编码器
class Encoder(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, encoding_units, batch_size):
super(Encoder, self).__init__()
self.batch_size = batch_size
self.encoding_units = encoding_units
# 创建 embedding 层
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(self.encoding_units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')
def call(self, x, hidden):
x = self.embedding(x)
output, state = self.gru(x, initial_state = hidden)
return output, state
# 创建一个全部是 0 的隐含状态
def initialize_hidden_state(self):
return tf.zeros((self.batch_size, self.encoding_units))
encoder = Encoder(input_vocab_size, embedding_units, units, batch_size)
sample_hidden = encoder.initialize_hidden_state()
sample_output, sample_hidden = encoder(x,sample_hidden)
print("shape of sample output: ",sample_output.shape)
print("shape of sample hidden: ",sample_hidden.shape)
shape of sample output: (64, 16, 1024)
shape of sample hidden: (64, 1024)
class BahdanauAttention(tf.keras.layers.Layer):
def __init__(self, units):
super(BahdanauAttention, self).__init__()
# 定义 3 全连接层,分别对 decoder_hidden 和 encoder_outputs 进行全连接
self.W1 = tf.keras.layers.Dense(units)
self.W2 = tf.keras.layers.Dense(units)
self.V = tf.keras.layers.Dense(1)
def call(self, query, values):
# query 对应 decoder_hidden, values 对应 encoder_ouputs 对应
# 隐藏层的形状 == (批大小,隐藏层大小)
# hidden_with_time_axis 的形状 == (批大小,1,隐藏层大小)
# 这样做是为了执行加法以计算分数,需要进行维度上的扩展
hidden_with_time_axis = tf.expand_dims(query, 1)
# 分数的形状 == (批大小,最大长度,1)
# 我们在最后一个轴上得到 1, 因为我们把分数应用于 self.V
# 在应用 self.V 之前,张量的形状是(批大小,最大长度,单位)
score = self.V(tf.nn.tanh(
self.W1(values) + self.W2(hidden_with_time_axis)))
# 注意力权重 (attention_weights) 的形状 == (批大小,最大长度,1)
attention_weights = tf.nn.softmax(score, axis=1)
# 上下文向量 (context_vector) 求和之后的形状 == (批大小,隐藏层大小)
context_vector = attention_weights * values
context_vector = tf.reduce_sum(context_vector, axis=1)
return context_vector, attention_weights
attention_layer = BahdanauAttention(10)
attention_result, attention_weights = attention_layer(sample_hidden, sample_output)
print("Attention result shape: (batch size, units) {}"
.format(attention_result.shape))
print("Attention weights shape: (batch_size, sequence_length, 1) {}"
.format(attention_weights.shape))
Attention result shape: (batch size, units) (64, 1024)
Attention weights shape: (batch_size, sequence_length, 1) (64, 16, 1)
class Decoder(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
super(Decoder, self).__init__()
self.batch_sz = batch_sz
self.dec_units = dec_units
# 实现 embedding 层
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(self.dec_units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')
self.fc = tf.keras.layers.Dense(vocab_size)
# 用于注意力
self.attention = BahdanauAttention(self.dec_units)
# hidden 上一步输出
def call(self, x, hidden, enc_output):
# 编码器输出 (enc_output) 的形状 == (批大小,最大长度,隐藏层大小)
context_vector, attention_weights = self.attention(hidden, enc_output)
# x 在通过嵌入层后的形状 == (批大小,1,嵌入维度)
x = self.embedding(x)
# x 在拼接 (concatenation) 后的形状 == (批大小,1,嵌入维度 + 隐藏层大小)
x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
# 将合并后的向量传送到 GRU
output, state = self.gru(x)
# 输出的形状 == (批大小 * 1,隐藏层大小)
output = tf.reshape(output, (-1, output.shape[2]))
# 输出的形状 == (批大小,vocab)
x = self.fc(output)
return x, state, attention_weights
decoder = Decoder(output_vocab_size, embedding_units, units, batch_size)
sample_decoder_output, decoder_hidden, decoder_aw = decoder(tf.random.uniform((64, 1)),
sample_hidden, sample_output)
print ('Decoder output shape: (batch_size, vocab size) {}'
.format(sample_decoder_output.shape))
print ('Decoder decoder_hidden shape: (batch_size, vocab size) {}'
.format(decoder_hidden.shape))
print ('Decoder decoder_aw shape: (batch_size, vocab size) {}'
.format(decoder_aw.shape))
Decoder output shape: (batch_size, vocab size) (64, 4935)
Decoder decoder_hidden shape: (batch_size, vocab size) (64, 1024)
Decoder decoder_aw shape: (batch_size, vocab size) (64, 16, 1)
optimizer = tf.keras.optimizers.Adam()
# 预测词语 id, word id 所以使用
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
# 定义损失函数
def loss_function(real, pred):
# 用来输出 padding 的损失函数去掉,将其损失函数变为 0
mask = tf.math.logical_not(tf.math.equal(real, 0))
loss_ = loss_object(real, pred)
mask = tf.cast(mask, dtype=loss_.dtype)
loss_ *= mask
return tf.reduce_mean(loss_)
@tf.function
def train_step(inp, targ, enc_hidden):
loss = 0
with tf.GradientTape() as tape:
enc_output, enc_hidden = encoder(inp, enc_hidden)
dec_hidden = enc_hidden
# dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)
# 教师强制 - 将目标词作为下一个输入
for t in range(1, targ.shape[1]):
# 将编码器输出 (enc_output) 传送至解码器
# dec_input = tf.expand_dims(targ[:,t],1)
dec_input = tf.expand_dims(targ[:, t], 1)
predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
loss += loss_function(targ[:, t], predictions)
# 使用教师强制
batch_loss = (loss / int(targ.shape[1]))
variables = encoder.trainable_variables + decoder.trainable_variables
gradients = tape.gradient(loss, variables)
optimizer.apply_gradients(zip(gradients, variables))
return batch_loss
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
encoder=encoder,
decoder=decoder)
EPOCHS = 10
steps_per_epoch = len(input_tensor) // batch_size
for epoch in range(EPOCHS):
start = time.time()
enc_hidden = encoder.initialize_hidden_state()
total_loss = 0
for (batch, (inp, targ)) in enumerate(train_dataset.take(steps_per_epoch)):
batch_loss = train_step(inp, targ, enc_hidden)
total_loss += batch_loss
if batch % 100 == 0:
print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
batch,
batch_loss.numpy()))
# 每 2 个周期(epoch),保存(检查点)一次模型
if (epoch + 1) % 2 == 0:
checkpoint.save(file_prefix = checkpoint_prefix)
print('Epoch {} Loss {:.4f}'.format(epoch + 1,
total_loss / steps_per_epoch))
print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))
Epoch 1 Batch 0 Loss 0.0296
Epoch 1 Batch 100 Loss 0.0198
Epoch 1 Batch 200 Loss 0.0589
Epoch 1 Batch 300 Loss 0.0436
Epoch 1 Batch 400 Loss 0.0112
Epoch 1 Loss 0.0416
Time taken for 1 epoch 1458.5940191745758 sec
Epoch 2 Batch 0 Loss 0.0059
Epoch 2 Batch 100 Loss 0.0086
Epoch 2 Batch 200 Loss 0.0308
Epoch 2 Batch 300 Loss 0.0184
Epoch 2 Batch 400 Loss 0.0007
Epoch 2 Loss 0.0125
Time taken for 1 epoch 2318.0777299404144 sec
Epoch 3 Batch 0 Loss 0.0052
Epoch 3 Batch 100 Loss 0.0015
Epoch 3 Batch 200 Loss 0.0005
Epoch 3 Batch 300 Loss 0.0005
Epoch 3 Batch 400 Loss 0.0007
Epoch 3 Loss 0.0015
Time taken for 1 epoch 1415.883584022522 sec
Epoch 4 Batch 0 Loss 0.0004
Epoch 4 Batch 100 Loss 0.0002
Epoch 4 Batch 200 Loss 0.0003
Epoch 4 Batch 300 Loss 0.0004
Epoch 4 Batch 400 Loss 0.0002
Epoch 4 Loss 0.0002
Time taken for 1 epoch 1319.1745591163635 sec
Epoch 5 Batch 0 Loss 0.0001
Epoch 5 Batch 100 Loss 0.0002
Epoch 5 Batch 200 Loss 0.0001
Epoch 5 Batch 300 Loss 0.0002
Epoch 5 Batch 400 Loss 0.0001
Epoch 5 Loss 0.0001
Time taken for 1 epoch 1266.965316772461 sec
Epoch 6 Batch 0 Loss 0.0001
Epoch 6 Batch 100 Loss 0.0001
Epoch 6 Batch 200 Loss 0.0001
Epoch 6 Batch 300 Loss 0.0001
Epoch 6 Batch 400 Loss 0.0001
Epoch 6 Loss 0.0001
Time taken for 1 epoch 1919.5929939746857 sec
Epoch 7 Batch 0 Loss 0.0001
Epoch 7 Batch 100 Loss 0.0001
Epoch 7 Batch 200 Loss 0.0001
Epoch 7 Batch 300 Loss 0.0001
Epoch 7 Batch 400 Loss 0.0001
Epoch 7 Loss 0.0001
Time taken for 1 epoch 1426.8153262138367 sec
Epoch 8 Batch 0 Loss 0.0001
Epoch 8 Batch 100 Loss 0.0001
Epoch 8 Batch 200 Loss 0.0001
Epoch 8 Batch 300 Loss 0.0000
Epoch 8 Batch 400 Loss 0.0001
Epoch 8 Loss 0.0001
Time taken for 1 epoch 1444.5691719055176 sec
Epoch 9 Batch 0 Loss 0.0001
Epoch 9 Batch 100 Loss 0.0001
Epoch 9 Batch 200 Loss 0.0001
Epoch 9 Batch 300 Loss 0.0000
Epoch 9 Batch 400 Loss 0.0000
Epoch 9 Loss 0.0000
Time taken for 1 epoch 1420.9264559745789 sec
Epoch 10 Batch 0 Loss 0.0001
Epoch 10 Batch 100 Loss 0.0000
Epoch 10 Batch 200 Loss 0.0000
Epoch 10 Batch 300 Loss 0.0000
Epoch 10 Batch 400 Loss 0.0000
Epoch 10 Loss 0.0000
Time taken for 1 epoch 2865.8215408325195 sec
网友评论