2020 机器翻译 (1)

作者: zidea | 来源:发表于2020-02-23 20:40 被阅读0次

2020 机器翻译 (1)
经验 | 机器翻译译前编辑的10个小窍门
Attention for ASR
NLP的应用
通过学习对齐翻译的神经机器翻译
Task04: 动手学深度学习——机器翻译及相关技术；注意力机制
神经机器翻译概览：基准模型与改进（上）
阿里巴巴机器翻译在跨境电商场景下的应用和实践
12月6日物联网新闻丨机器翻译系统取得崭新的突破；飞机上网技术曝
机器翻译质量评估笔记

注意力机制

machine_translation.jpg

参考《动手学深度学习》
参考《李宏毅老师机器学习》

sequence.jpg

相关参考资料

tf_2.jpg

使用 tensorflow2.0 实现带有注意力机器学习

在编码器—解码器(seq2seq)一节里，解码器在各个时间步依赖相同的背景变量(context)来获取编码器输入序列信息。当编码器为循环神经网络时，背景变量来自编码器最终时间步的隐藏状态。

现在，让我们再次思考那一节提到的翻译例子：

输入为英语序列“Hi, guys.”
输出为西班牙语序列“Ey, chavales.”。

不难想到，解码器在生成输出序列中的每一个词时可能只需要使用（关注)解码器中序列某一部分的信息。在解码器的每一时间步对输入序列中不同时间步的表征或编码信息分配不同的注意力一样。这也是注意力机制的由来。

仍然以循环神经网络为例，注意力机制通过对编码器所有时间步的隐藏状态做加权平均来得到背景变量。解码器在每一时间步调整这些权重，即注意力权重，从而能够在不同时间步分别关注输入序列中的不同部分并编码进相应时间步的背景变量。本节我们将讨论注意力机制是怎么工作的。

在编码器—解码器（seq2seq)分享中，我们区分了输入序列或编码器的索引 $t$ 与输出序列或解码器的索引 $t'$ 。该节中，解码器在时间步 $t'$ 的隐藏状态 $\boldsymbol{s}_{t'} = g(\boldsymbol{y}_{t'-1}, \boldsymbol{c}, \boldsymbol{s}_{t'-1})$ ，其中 $\boldsymbol{y}_{t'-1}$ 是上一时间步 $t'-1$ 的输出 $y_{t'-1}$ 的表征，且任一时间步 $t'$ 使用相同的背景变量 $\boldsymbol{c}$ 。但在注意力机制中，解码器的每一时间步将使用可变的背景变量。记 $\boldsymbol{c}_{t'}$ 是解码器在时间步 $t'$ 的背景变量，那么解码器在该时间步的隐藏状态可以改写为

$\boldsymbol{s}_{t'} = g(\boldsymbol{y}_{t'-1}, \boldsymbol{c}_{t'}, \boldsymbol{s}_{t'-1}).$

这里的关键是如何计算背景变量 $\boldsymbol{c}_{t'}$ 和如何利用它来更新隐藏状态 $\boldsymbol{s}_{t'}$ 。下面将分别描述这两个关键点。

计算背景变量

我们先描述第一个关键点，即计算背景变量。描绘了注意力机制如何为解码器在时间步2计算背景变量。首先，函数 $a$ 根据解码器在时间步1的隐藏状态和编码器在各个时间步的隐藏状态计算softmax运算的输入。softmax运算输出概率分布并对编码器各个时间步的隐藏状态做加权平均，从而得到背景变量。

编码器—解码器上的注意力机制

具体来说，令编码器在时间步 $t$ 的隐藏状态为 $\boldsymbol{h}_t$ ，且总时间步数为 $T$ 。那么解码器在时间步 $t'$ 的背景变量为所有编码器隐藏状态的加权平均：

$\boldsymbol{c}_{t'} = \sum_{t=1}^T \alpha_{t' t} \boldsymbol{h}_t,$

其中给定 $t'$ 时，权重 $\alpha_{t' t}$ 在 $t=1,\ldots,T$ 的值是一个概率分布。为了得到概率分布，我们可以使用softmax运算:

$\alpha_{t' t} = \frac{\exp(e_{t' t})}{ \sum_{k=1}^T \exp(e_{t' k}) },\quad t=1,\ldots,T.$

现在，我们需要定义如何计算上式中softmax运算的输入 $e_{t' t}$ 。由于 $e_{t' t}$ 同时取决于解码器的时间步 $t'$ 和编码器的时间步 $t$ ，我们不妨以解码器在时间步 $t'-1$ 的隐藏状态 $\boldsymbol{s}_{t' - 1}$ 与编码器在时间步 $t$ 的隐藏状态 $\boldsymbol{h}_t$ 为输入，并通过函数 $a$ 计算 $e_{t' t}$ ：

$e_{t' t} = a(\boldsymbol{s}_{t' - 1}, \boldsymbol{h}_t).$

这里函数 $a$ 有多种选择，如果两个输入向量长度相同，一个简单的选择是计算它们的内积 $a(\boldsymbol{s}, \boldsymbol{h})=\boldsymbol{s}^\top \boldsymbol{h}$ 。而最早提出注意力机制的论文则将输入连结后通过含单隐藏层的多层感知机变换：

$a(\boldsymbol{s}, \boldsymbol{h}) = \boldsymbol{v}^\top \tanh(\boldsymbol{W}_s \boldsymbol{s} + \boldsymbol{W}_h \boldsymbol{h}),$

其中 $\boldsymbol{v}$ 、 $\boldsymbol{W}_s$ 、 $\boldsymbol{W}_h$ 都是可以学习的模型参数。

from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from sklearn.model_selection import train_test_split

import unicodedata
import re
import numpy as np
import os
import io
import time

path_to_zip = tf.keras.utils.get_file(
    'spa-eng.zip', origin='http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip',
    extract=True)

path_to_file = os.path.dirname(path_to_zip)+"/spa-eng/spa.txt"

en_spa_file_path = "./spa-eng/spa.txt"

开始coding

代码

数据预处理

定义模型

编码器
注意力机制
解码器

# 将 unicode 文件转换为 ascii
# 装换为 ascii 码相对比 unicode 要体积比较小
def unicode_to_ascii(s):
    # 过滤掉西班牙语中的重音     
    return ''.join(c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn')

en_sentence = 'Then what?'
sp_senetence = '¿Entonces qué?'
print(unicode_to_ascii(en_sentence))
print(unicode_to_ascii(sp_senetence))

Then what?
¿Entonces que?

通过转换我们发现英文没有什么区别，而西班牙语e上符号不见了。接下来我们将语言中标点符号和文字分开。

def preprocess_sentence(w):
    w = unicode_to_ascii(w.lower().strip())

    # 在单词与跟在其后的标点符号之间插入一个空格
    # 例如： "he is a boy." => "he is a boy ."
    # 参考：https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
    w = re.sub(r"([?.!,¿])", r" \1 ", w)
    # 空格去重，去掉多余空格，仅保留一个空格     
    w = re.sub(r'[" "]+', " ", w)

    # 将所有除了 (a-z, A-Z, ".", "?", "!", ",")之外的字符都替换为空格
    w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)
    # 去掉前后的空格     
    w = w.rstrip().strip()

    # 给句子加上开始和结束标记
    # 以便模型知道何时开始和结束预测
    w = '<start> ' + w + ' <end>'
    return w

print(preprocess_sentence(en_sentence))
print(preprocess_sentence(sp_senetence))

<start> then what ? <end>
<start> ¿ entonces que ? <end>

读取文件

# 1. 去除重音符号
# 2. 清理句子
# 3. 返回这样格式的单词对：[ENGLISH, SPANISH]
def create_dataset(path, num_examples):
    lines = io.open(path, encoding='UTF-8').read().strip().split('\n')
    # 切分每一行将行切分为英文->西班牙语对照     
    word_pairs = [[preprocess_sentence(w) for w in l.split('\t')]  for l in lines[:num_examples]]

    return zip(*word_pairs)

a = [(1,2),(3,4),(5,6)]
c,d = zip(*a)
print(c,d)

(1, 3, 5) (2, 4, 6)

en, sp = create_dataset(en_spa_file_path, None)
print(en[-1])
print(sp[-1])

<start> if you want to sound like a native speaker , you must be willing to practice saying the same sentence over and over in the same way that banjo players practice the same phrase over and over until they can play it correctly and at the desired tempo . <end>
<start> si quieres sonar como un hablante nativo , debes estar dispuesto a practicar diciendo la misma frase una y otra vez de la misma manera en que un musico de banjo practica el mismo fraseo una y otra vez hasta que lo puedan tocar correctamente y en el tiempo esperado . <end>

def tokenizer(lang):
    # num_words=None 可以对词表进行限制，filters 是词表黑名单     
    lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='',split=' ')
    # 统计词频生成词表     
    lang_tokenizer.fit_on_texts(lang)
    #将语料从文本转为 1     
    tensor = lang_tokenizer.texts_to_sequences(lang)
    # 对文本进行补全处理     
    tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,
                                                             padding='post')

    return tensor, lang_tokenizer

input_tensor,input_tokenizer = tokenizer(sp[:30000])
output_tensor,output_tokenizer = tokenizer(en[:30000])

定义一个可以查看 tensor 中句子最大长度是多少的函数

def max_length(tensor):
    return max(len(t) for t in tensor)

max_length_input = max_length(input_tensor)
max_length_output = max_length(output_tensor)
print(max_length_input,max_length_output)

16 11

对训练数据集和测试数据集的切分

num_examples = 30000
input_train, input_eval, output_train, output_eval = train_test_split(
    input_tensor,output_tensor,test_size=0.2)
len(input_train),len(input_eval),len(output_train),len(output_eval)

(24000, 6000, 24000, 6000)

def convert(example, tokenizer):
    for t in example:
        if t!=0:
            print ("%d ----> %s" % (t, tokenizer.index_word[t]))

print ("Input Language; index to word mapping")
convert(input_train[0], input_tokenizer)
print ()
print ("Target Language; index to word mapping")
convert(output_train[0], output_tokenizer)

Input Language; index to word mapping
1 ----> <start>
69 ----> todos
6434 ----> comtemplaron
3 ----> .
2 ----> <end>

Target Language; index to word mapping
1 ----> <start>
28 ----> they
67 ----> all
700 ----> watched
3 ----> .
2 ----> <end>

def make_dataset(input_tensor,output_tensor,batch_size,epochs,shuffle):
    dataset = tf.data.Dataset.from_tensor_slices((input_tensor,output_tensor))
    
    if shuffle:
        dataset = dataset.shuffle(30000)
    dataset = dataset.repeat(epochs).batch(batch_size,drop_remainder = True)
    return dataset

batch_size = 64
epochs = 20
train_dataset = make_dataset(input_train,output_train,
                            batch_size,epochs,True)
eval_dataset = make_dataset(input_eval,output_eval,batch_size,1,False)

for x, y in train_dataset.take(1):
    print(x.shape)
    print(y.shape)
    print(x)
    print(y)

(64, 16)
(64, 11)
tf.Tensor(
[[   1  148   13 ...    0    0    0]
 [   1    9   92 ...    0    0    0]
 [   1   78 1104 ...    0    0    0]
 ...
 [   1   29  259 ...    0    0    0]
 [   1   28   55 ...    0    0    0]
 [   1    9  172 ...    0    0    0]], shape=(64, 16), dtype=int32)
tf.Tensor(
[[   1    4  133 4689  320    3    2    0    0    0    0]
 [   1   13  104    8  585    3    2    0    0    0    0]
 [   1  157 1224    3    2    0    0    0    0    0    0]
 [   1   13  546    8  217    3    2    0    0    0    0]
 [   1   28   47   15   73   81    3    2    0    0    0]
 [   1 2519   13  980    3    2    0    0    0    0    0]
 [   1   19    8   34  601    3    2    0    0    0    0]
 [   1   30   12    6   29    9  413    7    2    0    0]
 [   1   14   38   64  197   10    3    2    0    0    0]
 [   1    4  135  141   54   41    3    2    0    0    0]
 [   1    5  956    3    2    0    0    0    0    0    0]
 [   1  135   19    3    2    0    0    0    0    0    0]
 [   1   46   17 1050   20    3    2    0    0    0    0]
 [   1  176   12  442  436    7    2    0    0    0    0]
 [   1  569   50   49    5    3    2    0    0    0    0]
 [   1    4  693   10   11  252    3    2    0    0    0]
 [   1   10   11  103  868    3    2    0    0    0    0]
 [   1   14 1039   81  655    3    2    0    0    0    0]
 [   1  195  118  109    3    2    0    0    0    0    0]
 [   1   16   23   74   59  424    3    2    0    0    0]
 [   1    5  105   61  940    3    2    0    0    0    0]
 [   1   10   11   34  517    3    2    0    0    0    0]
 [   1   46   17  122    5    3    2    0    0    0    0]
 [   1    5 1685    3    2    0    0    0    0    0    0]
 [   1 1290    5   26  107    3    2    0    0    0    0]
 [   1  102   13  201  185    3    2    0    0    0    0]
 [   1    5  179  348    3    2    0    0    0    0    0]
 [   1  140  329   24  963    3    2    0    0    0    0]
 [   1    5  895  224    3    2    0    0    0    0    0]
 [   1   75   27   53    7    2    0    0    0    0    0]
 [   1   96    4  873   13  533    7    2    0    0    0]
 [   1    4   47    9  770    3    2    0    0    0    0]
 [   1    4  127  380    9  153    3    2    0    0    0]
 [   1   62    4 1990    7    2    0    0    0    0    0]
 [   1    4   18 1964    3    2    0    0    0    0    0]
 [   1   40    9  159    3    2    0    0    0    0    0]
 [   1    4   77    6  266    3    2    0    0    0    0]
 [   1    4  114    6 1352    3    2    0    0    0    0]
 [   1   14   46   81   36    3    2    0    0    0    0]
 [   1    4  117  140 4699    3    2    0    0    0    0]
 [   1   10    8   78  220    3    2    0    0    0    0]
 [   1   32   11   13 1451    7    2    0    0    0    0]
 [   1    4  311    9  698    3    2    0    0    0    0]
 [   1    4  328   41  203  803    3    2    0    0    0]
 [   1    4   18   34    9  443  159    3    2    0    0]
 [   1   19  169  339  494    3    2    0    0    0    0]
 [   1   30   12  456   31  837    3    2    0    0    0]
 [   1   20   11 2570    3    2    0    0    0    0    0]
 [   1   14 1299   61  471    3    2    0    0    0    0]
 [   1   10   38   64  824    3    2    0    0    0    0]
 [   1   20   11  308    3    2    0    0    0    0    0]
 [   1    4   65  210  175   66    3    2    0    0    0]
 [   1  195  118  545    3    2    0    0    0    0    0]
 [   1   25    4   72    6  131    7    2    0    0    0]
 [   1    4   18   34  342  119   10    3    2    0    0]
 [   1    4   42   10    9  817 1265    3    2    0    0]
 [   1   71    8   45    7    2    0    0    0    0    0]
 [   1   28   92  184    3    2    0    0    0    0    0]
 [   1    4   75   40   91  284    3    2    0    0    0]
 [   1   82    8   31 1087  315    7    2    0    0    0]
 [   1    4   25  103  256    6    3    2    0    0    0]
 [   1   28   23  269   59   41    3    2    0    0    0]
 [   1   20   26   48  248    3    2    0    0    0    0]
 [   1   14  175   14   26  626    3    2    0    0    0]], shape=(64, 11), dtype=int32)

模型定义

embedding_units = 256
units = 1024
input_vocab_size = len(input_tokenizer.word_index) + 1
output_vocab_size = len(output_tokenizer.word_index) + 1

定义编码器

class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, encoding_units, batch_size):
        super(Encoder, self).__init__()
        self.batch_size = batch_size
        self.encoding_units = encoding_units
        # 创建 embedding 层        
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(self.encoding_units,
                                       return_sequences=True,
                                       return_state=True,
                                       recurrent_initializer='glorot_uniform')

    def call(self, x, hidden):
        x = self.embedding(x)
        output, state = self.gru(x, initial_state = hidden)
        return output, state
    # 创建一个全部是 0 的隐含状态 
    def initialize_hidden_state(self):
        return tf.zeros((self.batch_size, self.encoding_units))

encoder = Encoder(input_vocab_size, embedding_units, units, batch_size)

sample_hidden = encoder.initialize_hidden_state()
sample_output, sample_hidden = encoder(x,sample_hidden)
print("shape of sample output: ",sample_output.shape)
print("shape of sample hidden: ",sample_hidden.shape)

shape of sample output:  (64, 16, 1024)
shape of sample hidden:  (64, 1024)

class BahdanauAttention(tf.keras.layers.Layer):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        # 定义 3 全连接层，分别对 decoder_hidden 和 encoder_outputs 进行全连接         
        self.W1 = tf.keras.layers.Dense(units)
        self.W2 = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)

    def call(self, query, values):
        # query 对应 decoder_hidden, values 对应 encoder_ouputs 对应         
        # 隐藏层的形状 == （批大小，隐藏层大小）
        # hidden_with_time_axis 的形状 == （批大小，1，隐藏层大小）
        # 这样做是为了执行加法以计算分数,需要进行维度上的扩展 
        hidden_with_time_axis = tf.expand_dims(query, 1)

        # 分数的形状 == （批大小，最大长度，1）
        # 我们在最后一个轴上得到 1， 因为我们把分数应用于 self.V
        # 在应用 self.V 之前，张量的形状是（批大小，最大长度，单位）
        score = self.V(tf.nn.tanh(
            self.W1(values) + self.W2(hidden_with_time_axis)))

        # 注意力权重 （attention_weights） 的形状 == （批大小，最大长度，1）
        attention_weights = tf.nn.softmax(score, axis=1)

        # 上下文向量 （context_vector） 求和之后的形状 == （批大小，隐藏层大小）
        context_vector = attention_weights * values
        context_vector = tf.reduce_sum(context_vector, axis=1)

        return context_vector, attention_weights

attention_layer = BahdanauAttention(10)
attention_result, attention_weights = attention_layer(sample_hidden, sample_output)

print("Attention result shape: (batch size, units) {}"
      .format(attention_result.shape))
print("Attention weights shape: (batch_size, sequence_length, 1) {}"
      .format(attention_weights.shape))

Attention result shape: (batch size, units) (64, 1024)
Attention weights shape: (batch_size, sequence_length, 1) (64, 16, 1)

class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
        super(Decoder, self).__init__()
        self.batch_sz = batch_sz
        self.dec_units = dec_units
        # 实现 embedding 层         
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(self.dec_units,
                                       return_sequences=True,
                                       return_state=True,
                                       recurrent_initializer='glorot_uniform')
        self.fc = tf.keras.layers.Dense(vocab_size)

        # 用于注意力
        self.attention = BahdanauAttention(self.dec_units)
    # hidden 上一步输出     
    def call(self, x, hidden, enc_output):
        # 编码器输出 （enc_output） 的形状 == （批大小，最大长度，隐藏层大小）
        context_vector, attention_weights = self.attention(hidden, enc_output)

        # x 在通过嵌入层后的形状 == （批大小，1，嵌入维度）
        x = self.embedding(x)

        # x 在拼接 （concatenation） 后的形状 == （批大小，1，嵌入维度 + 隐藏层大小）
        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

        # 将合并后的向量传送到 GRU
        output, state = self.gru(x)

        # 输出的形状 == （批大小 * 1，隐藏层大小）
        output = tf.reshape(output, (-1, output.shape[2]))

        # 输出的形状 == （批大小，vocab）
        x = self.fc(output)

        return x, state, attention_weights

decoder = Decoder(output_vocab_size, embedding_units, units, batch_size)

sample_decoder_output, decoder_hidden, decoder_aw = decoder(tf.random.uniform((64, 1)),
                                      sample_hidden, sample_output)

print ('Decoder output shape: (batch_size, vocab size) {}'
       .format(sample_decoder_output.shape))
print ('Decoder decoder_hidden shape: (batch_size, vocab size) {}'
       .format(decoder_hidden.shape))
print ('Decoder decoder_aw shape: (batch_size, vocab size) {}'
       .format(decoder_aw.shape))

Decoder output shape: (batch_size, vocab size) (64, 4935)
Decoder decoder_hidden shape: (batch_size, vocab size) (64, 1024)
Decoder decoder_aw shape: (batch_size, vocab size) (64, 16, 1)

optimizer = tf.keras.optimizers.Adam()
# 预测词语 id， word id 所以使用 
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')

# 定义损失函数
def loss_function(real, pred):
    # 用来输出 padding 的损失函数去掉，将其损失函数变为 0     
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_mean(loss_)

@tf.function
def train_step(inp, targ, enc_hidden):
    loss = 0

    with tf.GradientTape() as tape:
        enc_output, enc_hidden = encoder(inp, enc_hidden)

        dec_hidden = enc_hidden

#         dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)

        # 教师强制 - 将目标词作为下一个输入
        for t in range(1, targ.shape[1]):
            # 将编码器输出 （enc_output） 传送至解码器
#             dec_input = tf.expand_dims(targ[:,t],1)
            dec_input = tf.expand_dims(targ[:, t], 1)
            predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)

            loss += loss_function(targ[:, t], predictions)

            # 使用教师强制
            

    batch_loss = (loss / int(targ.shape[1]))

    variables = encoder.trainable_variables + decoder.trainable_variables

    gradients = tape.gradient(loss, variables)

    optimizer.apply_gradients(zip(gradients, variables))

    return batch_loss

checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)

EPOCHS = 10
steps_per_epoch = len(input_tensor) // batch_size

for epoch in range(EPOCHS):
    start = time.time()

    enc_hidden = encoder.initialize_hidden_state()
    total_loss = 0

    for (batch, (inp, targ)) in enumerate(train_dataset.take(steps_per_epoch)):
        batch_loss = train_step(inp, targ, enc_hidden)
        total_loss += batch_loss

        if batch % 100 == 0:
            print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                         batch,
                                                         batch_loss.numpy()))
    # 每 2 个周期（epoch），保存（检查点）一次模型
    if (epoch + 1) % 2 == 0:
        checkpoint.save(file_prefix = checkpoint_prefix)

    print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                      total_loss / steps_per_epoch))
    print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

Epoch 1 Batch 0 Loss 0.0296
Epoch 1 Batch 100 Loss 0.0198
Epoch 1 Batch 200 Loss 0.0589
Epoch 1 Batch 300 Loss 0.0436
Epoch 1 Batch 400 Loss 0.0112
Epoch 1 Loss 0.0416
Time taken for 1 epoch 1458.5940191745758 sec

Epoch 2 Batch 0 Loss 0.0059
Epoch 2 Batch 100 Loss 0.0086
Epoch 2 Batch 200 Loss 0.0308
Epoch 2 Batch 300 Loss 0.0184
Epoch 2 Batch 400 Loss 0.0007
Epoch 2 Loss 0.0125
Time taken for 1 epoch 2318.0777299404144 sec

Epoch 3 Batch 0 Loss 0.0052
Epoch 3 Batch 100 Loss 0.0015
Epoch 3 Batch 200 Loss 0.0005
Epoch 3 Batch 300 Loss 0.0005
Epoch 3 Batch 400 Loss 0.0007
Epoch 3 Loss 0.0015
Time taken for 1 epoch 1415.883584022522 sec

Epoch 4 Batch 0 Loss 0.0004
Epoch 4 Batch 100 Loss 0.0002
Epoch 4 Batch 200 Loss 0.0003
Epoch 4 Batch 300 Loss 0.0004
Epoch 4 Batch 400 Loss 0.0002
Epoch 4 Loss 0.0002
Time taken for 1 epoch 1319.1745591163635 sec

Epoch 5 Batch 0 Loss 0.0001
Epoch 5 Batch 100 Loss 0.0002
Epoch 5 Batch 200 Loss 0.0001
Epoch 5 Batch 300 Loss 0.0002
Epoch 5 Batch 400 Loss 0.0001
Epoch 5 Loss 0.0001
Time taken for 1 epoch 1266.965316772461 sec

Epoch 6 Batch 0 Loss 0.0001
Epoch 6 Batch 100 Loss 0.0001
Epoch 6 Batch 200 Loss 0.0001
Epoch 6 Batch 300 Loss 0.0001
Epoch 6 Batch 400 Loss 0.0001
Epoch 6 Loss 0.0001
Time taken for 1 epoch 1919.5929939746857 sec

Epoch 7 Batch 0 Loss 0.0001
Epoch 7 Batch 100 Loss 0.0001
Epoch 7 Batch 200 Loss 0.0001
Epoch 7 Batch 300 Loss 0.0001
Epoch 7 Batch 400 Loss 0.0001
Epoch 7 Loss 0.0001
Time taken for 1 epoch 1426.8153262138367 sec

Epoch 8 Batch 0 Loss 0.0001
Epoch 8 Batch 100 Loss 0.0001
Epoch 8 Batch 200 Loss 0.0001
Epoch 8 Batch 300 Loss 0.0000
Epoch 8 Batch 400 Loss 0.0001
Epoch 8 Loss 0.0001
Time taken for 1 epoch 1444.5691719055176 sec

Epoch 9 Batch 0 Loss 0.0001
Epoch 9 Batch 100 Loss 0.0001
Epoch 9 Batch 200 Loss 0.0001
Epoch 9 Batch 300 Loss 0.0000
Epoch 9 Batch 400 Loss 0.0000
Epoch 9 Loss 0.0000
Time taken for 1 epoch 1420.9264559745789 sec

Epoch 10 Batch 0 Loss 0.0001
Epoch 10 Batch 100 Loss 0.0000
Epoch 10 Batch 200 Loss 0.0000
Epoch 10 Batch 300 Loss 0.0000
Epoch 10 Batch 400 Loss 0.0000
Epoch 10 Loss 0.0000
Time taken for 1 epoch 2865.8215408325195 sec

2020 机器翻译 (1)
注意力机制参考《动手学深度学习》参考《李宏毅老师机器学习》相关参考资料 2020机器学习循环神经网(1) 20...
经验 | 机器翻译译前编辑的10个小窍门
以下文章来源于机器翻译观察，作者Andy Nikulin 如何正确做译前编辑，让机器翻译质量更靠谱？都2020年...
Attention for ASR
1 基于 Attention 的模型 Attention机制最先应用于机器翻译中，并在机器翻译中取得了最好的效果。...
NLP的应用
1 信息摘要 2 机器翻译 3 统计型机器翻译 4 信息检索布尔检索向量空间模型概率模型 5 语音识别 6 文本...
通过学习对齐翻译的神经机器翻译
神经机器翻译是最近提出的机器翻译方法。与传统的统计机器翻译不同，神经机器翻译的目的是建立一个单一的神经网络，可以联...
Task04: 动手学深度学习——机器翻译及相关技术；注意力机制
（学习笔记，待补充）本文目录如下： 1.机器翻译1.1 机器翻译的概念 2.注意力机制与Seq2seq模型 3.T...
神经机器翻译概览：基准模型与改进（上）
下篇：神经机器翻译概览：基准模型与改进（下）介绍一下当前机器翻译领域很火的神经机器翻译(Neural Machi...
阿里巴巴机器翻译在跨境电商场景下的应用和实践
摘要：本文将与大家分享机器翻译相关背景知识，再深入介绍机器翻译在阿里生态中的具体应用实践，介绍基于机器翻译技术搭建...
12月6日物联网新闻丨机器翻译系统取得崭新的突破；飞机上网技术曝
致联科技讯 12月6日物联网新闻机器翻译系统取得崭新的突破飞机上网技术曝光一、【物联网头条】 1.机器翻译系...
机器翻译质量评估笔记
机器翻译质量评估笔记简介质量评估（QE）旨在没有人工干预的情况下机器翻译质量。QE 结果在昂贵的机器翻译后编辑...