基于Attention的机器翻译（一）

作者: 北有深巷 | 来源:发表于2018-10-20 14:29 被阅读0次

本文翻译自谷歌开源代码官方源码

本人能力有限，希望大家指正

Neural Machine Translation with Attention

本文使用sequence to sequence (seq2seq)的方法。主要用到 tf.keras 和 eager execution。这是一个难度较高的示例，需要你了解过seq2seq模型方面的知识。

在训练模型后，使用该模型，你可以把西班牙文翻译成英文。例如输入 "¿todavia estan en casa?", 则会返回英文句子: "are you still at home?"

这是一个玩具级的例子，注意力机制的模型也许更加有趣。下图是翻译时输入句子的注意力模型图

1.png

这个例子在一个P100 GPU上跑了10分钟。

from __future__ import absolute_import, division, print_function

# 源码说把tensorflow更新到>=1.10就行。呃呃，但是我运行不了，貌似要把tensorflow更新到最新的1.12。
import tensorflow as tf

tf.enable_eager_execution()

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

import unicodedata
import re
import numpy as np
import os
import time

print(tf.__version__)

下载和预处理数据

我们将使用链接中提供的数据。每一行的数据对大概长这个样子:

May I borrow this book? ¿Puedo tomar prestado este libro?

在链接中有很多不同国家的语言，但是我们只用了英文-西班牙文翻译的数据集，为了方便，我们将使用下面的代码从谷歌云上下载数据，但你也可以从上面的链接下载数据，只需对数据读取路径修改一下就行了。

然后，我们便需要对数据进行预处理，包括以下步骤：
1.对每一句句子，前面增加start 和句子后面增加end ；
2.去掉一些特殊符号，例如一些标点符号“¿”；
3.建立词的索引，对词和索引进行调换（词典从word → id 变成id → word）；
4.对每一个句子进行填充，使得每一个句子的长度一样长。

# 下载数据集
path_to_zip = tf.keras.utils.get_file(
    'spa-eng.zip', origin='http://download.tensorflow.org/data/spa-eng.zip', 
    extract=True)

path_to_file = os.path.dirname(path_to_zip)+"/spa-eng/spa.txt"
print(path_to_file)

# 把unicode转换ascii
def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn')

def preprocess_sentence(w):
    w = unicode_to_ascii(w.lower().strip())
    
    # 在词与标点之间增加一个空格
    # 例如: "he is a boy." => "he is a boy ." 
    # 至于为什么，可参考Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
    w = re.sub(r"([?.!,¿])", r" \1 ", w)
    w = re.sub(r'[" "]+', " ", w)
    
    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)
    
    w = w.rstrip().strip()
    
    # 对每一句句子，前面增加*start* 和句子后面增加*end*
    # 使得模型知道什么时候开始和结束预测
    w = '<start> ' + w + ' <end>'
    return w

# 1. 去除口音
# 2. 清洗句子
# 3. 返回词对，格式如下: [ENGLISH, SPANISH]
def create_dataset(path, num_examples):
    lines = open(path, encoding='UTF-8').read().strip().split('\n')
    
    word_pairs = [[preprocess_sentence(w) for w in l.split('\t')]  for l in lines[:num_examples]]
    
    return word_pairs

# 下面这个class是建立一个词典索引 (e.g,. "dad" -> 5) ，反之亦然 
# (e.g., 5 -> "dad") 
class LanguageIndex():
    def __init__(self, lang):
        self.lang = lang
        self.word2idx = {}
        self.idx2word = {}
        self.vocab = set()

        self.create_index()
    
    def create_index(self):
        for phrase in self.lang:
            self.vocab.update(phrase.split(' '))
        self.vocab = sorted(self.vocab)
        self.word2idx['<pad>'] = 0
        for index, word in enumerate(self.vocab):
            self.word2idx[word] = index + 1

        for word, index in self.word2idx.items():
            self.idx2word[index] = word

def max_length(tensor):
    return max(len(t) for t in tensor)


def load_dataset(path, num_examples):
    # 输入已经清洗过的词语，输出词语对
    pairs = create_dataset(path, num_examples)

    # 建立索引
    inp_lang = LanguageIndex(sp for en, sp in pairs)
    targ_lang = LanguageIndex(en for en, sp in pairs)
    
    # 对目标语言建立句子的词向量
    
    # 西班牙语句子
    input_tensor = [[inp_lang.word2idx[s] for s in sp.split(' ')] for en, sp in pairs]
    
    # 英文句子
    target_tensor = [[targ_lang.word2idx[s] for s in en.split(' ')] for en, sp in pairs]
    
    # 计算最大的张量(tensor)的长度
    # Here, we'll set those to the longest sentence in the dataset
    max_length_inp, max_length_tar = max_length(input_tensor), max_length(target_tensor)
    
    # 对所有张量(tensor)进行填充(Padding)，使得所有句子的维度一样
    input_tensor = tf.keras.preprocessing.sequence.pad_sequences(input_tensor, 
                                                                 maxlen=max_length_inp,
                                                                 padding='post')
    
    target_tensor = tf.keras.preprocessing.sequence.pad_sequences(target_tensor, 
                                                                  maxlen=max_length_tar, 
                                                                  padding='post')
    
    return input_tensor, target_tensor, inp_lang, targ_lang, max_length_inp, max_length_tar

限制训练集的大少（可选）

完整的数据集有>10w个句子，全部训练的花会花比较长的时间。为了训练得更快，我们限制一下训练数据的大少，下面代码减少数据集到3w句。（当然，模型的质量会受到影响。）：

# Try experimenting with the size of that dataset
num_examples = 30000
input_tensor, target_tensor, inp_lang, targ_lang, max_length_inp, max_length_targ = load_dataset(path_to_file, num_examples)

# 划分训练集和验证集，下列对数据进行2/8分
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)

# 查看维度
len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val)

构建tf.data数据集(什么是tf.data？请看tf.data)

BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
N_BATCH = BUFFER_SIZE//BATCH_SIZE
embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_lang.word2idx)
vocab_tar_size = len(targ_lang.word2idx)

dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

.........未完待续...........

网友评论

本文标题：基于Attention的机器翻译（一）

本文链接：https://www.haomeiwen.com/subject/pscmzftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

基于Attention的机器翻译（一）

本文翻译自谷歌开源代码官方源码

Neural Machine Translation with Attention

下载和预处理数据

限制训练集的大少（可选）

构建tf.data数据集(什么是tf.data？请看tf.data)

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

基于Attention的机器翻译（一）

本文翻译自谷歌开源代码 官方源码

Neural Machine Translation with Attention

下载和预处理数据

限制训练集的大少（可选）

构建tf.data数据集(什么是tf.data？请看tf.data)

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

本文翻译自谷歌开源代码官方源码