15Seq2Seq实战语言翻译(2)

作者: 弟弟们的哥哥 | 来源:发表于2019-10-22 17:53 被阅读0次

15Seq2Seq实战语言翻译(2)
译审对比|翻译新人的译文长什么样？employee-for lo
Spark案例
CS224N(8)-机器翻译与Seq2Seq
16Seq2Seq实战语言翻译-attention(3)
面对的文字问题
Vuei18n 实际应用不使用打包工具篇！
Android studio 好用的插件
读书摘要‖《小说坊八讲》第一讲讨论/提及书目
深度学习的最开始---2

1.加载数据

# English source data
with open("data/small_vocab_en", "r", encoding="utf-8") as f:
    source_text = f.read()

# French target data
with open("data/small_vocab_fr", "r", encoding="utf-8") as f:
    target_text = f.read()

2.查看数据

# 统计英文语料数据
sentences = source_text.split('\n')
word_counts = [len(sentence.split()) for sentence in sentences]
# 统计法语语料数据
sentences = target_text.split('\n')
word_counts = [len(sentence.split()) for sentence in sentences]

3.数据预处理

3.1 构造字典

# 构造英文词典
source_vocab = list(set(source_text.lower().split()))
# 构造法语词典
target_vocab = list(set(target_text.lower().split()))

3.2 增加特殊字符

# 增加特殊编码
SOURCE_CODES = ['<PAD>', '<UNK>']
TARGET_CODES = ['<PAD>', '<EOS>', '<UNK>', '<GO>']

3.3 word和id之间的映射表

# 构造英文语料的映射表
source_vocab_to_int = {word: idx for idx, word in enumerate(SOURCE_CODES + source_vocab)}
source_int_to_vocab = {idx: word for idx, word in enumerate(SOURCE_CODES + source_vocab)}

# 构造法语语料的映射表
target_vocab_to_int = {word: idx for idx, word in enumerate(TARGET_CODES + target_vocab)}
target_int_to_vocab = {idx: word for idx, word in enumerate(TARGET_CODES + target_vocab)}

3.4 text 转换成 int

 # 用<PAD>填充整个序列
    text_to_idx = []
    # unk index
    unk_idx = map_dict.get("<UNK>")
    pad_idx = map_dict.get("<PAD>")
    eos_idx = map_dict.get("<EOS>")
    
    # 如果是输入源文本
    if not is_target:
        for word in sentence.lower().split():
            text_to_idx.append(map_dict.get(word, unk_idx))
    
    # 否则，对于输出目标文本需要做<EOS>的填充最后
    else:
        for word in sentence.lower().split():
            text_to_idx.append(map_dict.get(word, unk_idx))
        text_to_idx.append(eos_idx)
    
    # 如果超长需要截断
    if len(text_to_idx) > max_length:
        return text_to_idx[:max_length]
    # 如果不够则增加<PAD>
    else:
        text_to_idx = text_to_idx + [pad_idx] * (max_length - len(text_to_idx))
        return text_to_idx
# 对源句子进行转换 Tx = 20
source_text_to_int = []
for sentence in tqdm.tqdm(source_text.split("\n")):
    source_text_to_int.append(text_to_int(sentence, source_vocab_to_int, 20, 
                                          is_target=False))

# 对目标句子进行转换  Ty = 25
target_text_to_int = []
for sentence in tqdm.tqdm(target_text.split("\n")):
    target_text_to_int.append(text_to_int(sentence, target_vocab_to_int, 25, 
                                          is_target=True))

X = np.array(source_text_to_int)
Y = np.array(target_text_to_int)

# 对X和Y做One Hot Encoding
Xoh = np.array(list(map(lambda x: to_categorical(x, num_classes=len(source_vocab_to_int)), X)))
Yoh = np.array(list(map(lambda x: to_categorical(x, num_classes=len(target_vocab_to_int)), Y)))

4. 构建模型

和上一篇介绍的一样，encoder将输入信息embedding转换成稠密向量，再输入给LSTM学习成一个固定长度向量S，S输入到Decoder端生成新的序列。所以模型模块主要分为四部分：

模型输入： model_inputs
Encoder端： encoder_layer
Decoder端：输入端decoder_layer_inputs/ 训练deocder_layer_train / 预测 decoder_layer_inference
Seq2seq模型
具体代码可以套用上一篇

5.模型预测与调参

epochs = 10
batch_size =128
rnn_size = 128
rnn_num_layers = 1
encoder_embedding_size = 100
decoder_embedding_size = 100
learning_rate = 0.001
#每50轮打印一次结果
display_step = 50

设置了10轮迭代，1层LSTM，encoder与decoder的嵌入词向量维度均为100维，并指定每训练50轮打印一次结果.由于语料库比较少，仅有13W条，对于语言翻译模型这种严重依赖数据的模型确实有点少。而且因为数据集有限，并没有划分训练集和测试集。