02-seq2seq原理与实践

作者: HsuanvaneCHINA | 来源:发表于2019-10-10 20:51 被阅读0次

    目录

    原理部分

    • 机器翻译发展历史
    • Seq2Seq网络基本架构
    • Seq2Seq网络应用
    • Seq2Seq存在的问题
    • Attention机制

    实践部分

    • 任务1:
      • 数据预处理
      • 编码层与词向量
      • 完成解码模块
      • 模型迭代
    • 任务2:
      • 数据预处理
      • 使用构建好的词向量
      • 完成解码操作
      • 任务总结

    在进行学习Seq2Seq之前,先来回顾一下RNN(图1)和LSTM(图2)的网络架构。

    图1 RNN网络架构 图2 LSTM网络架构

    原理部分

    机器翻译的历史

    图3 最早期的逐字翻译

    逐字翻译出来的结果明显不符合人类日常语言交流的常态,语言生硬或者不符合语义,于是就发展到了基于统计学的机器翻译,但是它也明显的缺点就是不包含上下文的信息。


    图4 基于统计学的机器翻译

    以及现在的基于循环网络(RNN)和编码(word embedding)的机器翻译。如图5


    图5 基于深度学习的机器翻译

    有了输入的内容,并对其进行编码,有利用计算机进行计算和处理,处理完成后我们还需要对其进行解码操作。如图6


    i图6 基于深度学习的机器翻译

    现在有用户输入一段英文文本序列想要得到对应的西班牙语文本翻译。

    • 首先进行Input,接收到用户输入的文本序列。
    • 其次,进入编码器Encoder(如RNN),将文本序列进行编码,得到比如为维度是3维的数据形式向量
    • 然后,将3维的向量输入到解码器Decoder中
    • 最后,得到解码后的文本

    其实概览全局,整个流程就是从用户那里得到一段文本序列(Sequence)经过计算机的处理(To),即输入和编码;最终得到了对应的文本序列(Sequence),即输出和解码,其实这也就是seq2seq的流程。

    Seq2Seq的网络架构

    整个网络模型分为Encoder和Decoder,两个部分接连着一个中间向量。

    • Encoder是一个RNN网络,其隐藏层包含有若干个单元。每个单元都是一个LSTM单元。Encoder输出的结果是经过处理的向量,并作为Decoder的输入。
    • 同理,Decoder结构与Encoder结构类似,每一个单元的输入是前一个单元的输出,即每步得出一个结果。
    • 该模型训练有一个缺点,就是语料数据很难获取。

    以下图7为例,现在收到了一封邮件,内容为Are you free tomorrow。最终想要得到Yes, What's up?的回复。Tips:STATRT为开始符(有的论文用GO表示);
    END为终止符,作为解码器解码终止的标志,有的论文称为EOS(End of sentences.),这就需要在数据预处理的过程中在训练数据中加入。

    图7 Seq2Seq网络架构实例

    Seq2Seq的应用

    • 机器翻译


      图8 Seq2Seq网络应用-机器翻译
    • 文本摘要


      图9 Seq2Seq网络应用-文本摘要
    • 情感对话生成


      图10 Seq2Seq网络应用-情感对话生成
    • 代码补全


      图11 Seq2Seq网络应用-代码补全

    Seq2Seq存在的问题

    • 压缩损失了信息
      如图12,在进行模型训练前,对文本需要进行embedding,即将文本映射为向量,然后通过LSTM单元,但是即使LSTM控制保留的信息再好,压缩到最后一个节点那里也总是会丢失信息,那么就会对对最后的预测结果会产生影响。


      图12 LSTM中的信息丢失问题
    • 长度限制
      如果输入的序列过长,训练出来的模型表达效果也不会太出色,一般理想长度为10-20.如图13。


      图13 Seq2Seq受到文本长度的影响

    Attention机制

    基于以上的问题,在模型中加入Attention注意力机制,具体原理可以看02-注意力机制-attention机制(基于循环神经网络RNN)这篇文章。

    Attention机制在计算机视觉领域中的解释是这样的,“高分辨率”聚焦在图片的某个特定区域并以“低分辨率”感知图像周边区域的模式。通过大量的实验证明,将attention机制应用在机器翻译、摘要生成、阅读理解等问题上,取得的效果显著。

    另外还有一种Bucket机制,比如现在有很多组对话,有些对话长度为0-100字符,那么相应的进行模型训练后,输出的区间也会是这样0-100字符。正常情况下,应该对所有的的句子进行补全,但是的工作量会增加。
    Bucket机制则是对所有的句子先进行分组,将句子根据不同的区间分为若干个组,如bucket1[10,10],bucket2[10-30,20-30],bucket3[30-100,30-100]等,然后再进行计算。即,如果我们要进行模型训练,发现语料数据的长度变化幅度有点大,那么就可以考虑加入Bucket机制。(在TensorFlow深度学习框架中进行seq2seq网络训练时,默认进行Bucket)。

    实践部分

    任务1:

    任务1将实现一个基础版的Seq2Seq输入一个单词(字母序列),模型将返回一个对字母排序后的“单词”

    基础Seq2Seq主要包含三部分:

    如:将文本按照字典顺序排序:hello --> ehllo

    查看TensorFlow版本

    from distutils.version import LooseVersion
    import tensorflow as tf
    from tensorflow.python.layers.core import Dense
    
    
    # Check TensorFlow Version
    assert LooseVersion(tf.__version__) >= LooseVersion('1.1'), 'Please use TensorFlow version 1.1 or newer'
    print('TensorFlow Version: {}'.format(tf.__version__))
    

    如果缺少某些包,到该网站下载即可,不过可能网速可能过慢。http://www.lfd.uci.edu/~gohlke/pythonlibs/#tensorflow

    1.数据集加载

    import numpy as np
    import time
    import tensorflow as tf
    
    with open('data/letters_source.txt', 'r', encoding='utf-8') as f:  # 
        source_data = f.read()
    
    with open('data/letters_target.txt', 'r', encoding='utf-8') as f:
        target_data = f.read()
    
    1.1数据预览
    print(source_data.split('\n')[:10])
    print(target_data.split('\n')[:10])
    

    source输出结果为:
    ['bsaqq',
    'npy',
    'lbwuj',
    'bqv',
    'kial',
    'tddam',
    'edxpjpg',
    'nspv',
    'huloz',
    'kmclq']

    target输出结果为:
    ['abqqs',
    'npy',
    'bjluw',
    'bqv',
    'aikl',
    'addmt',
    'degjppx',
    'npsv',
    'hlouz',
    'cklmq']

    source为准备数据,即准备输入的数据,作为训练集。
    target为目标数据,即预测实现的数据,作为测试集。

    2.数据预处理

    这里的数据预处理,是将待输入的文本映射为连续低维稠密向量,便于模型进行训练。

    def extract_character_vocab(data):
        '''
        构造映射表
        '''
        # 这里构造特殊词表,便于执行特殊操作如开始GO、停止EOS、未知向量UNK(多出现在不规范的数据集中,无法对其进行映射时)和PAD(对文本进行填充保证每次大小都是一样的,如RNN中的零填充)。
        special_words = ['<PAD>', '<UNK>', '<GO>',  '<EOS>']  
    
        set_words = list(set([character for line in data.split('\n') for character in line]))  # 统计不重复的字符,转换为列表,便于之后进行embedding
        # 这里要把四个特殊字符添加进词典
        int_to_vocab = {idx: word for idx, word in enumerate(special_words + set_words)}  # 利用枚举方法做映射,完成数据预处理
        vocab_to_int = {word: idx for idx, word in int_to_vocab.items()}
    
        return int_to_vocab, vocab_to_int
    
    2.1调用构造好的函数进行数据预处理
    # 构造映射表
    source_int_to_letter, source_letter_to_int = extract_character_vocab(source_data)
    target_int_to_letter, target_letter_to_int = extract_character_vocab(target_data)
    
    # 对字母进行转换
    source_int = [[source_letter_to_int.get(letter, source_letter_to_int['<UNK>']) 
                   for letter in line] for line in source_data.split('\n')]
    target_int = [[target_letter_to_int.get(letter, target_letter_to_int['<UNK>']) 
                   for letter in line] + [target_letter_to_int['<EOS>']] for line in target_data.split('\n')] 
    
    2.2查看映射结果
    # 查看一下转换结果
    print(source_int[:10])
    print(target_int[:10])
    

    结果1:
    [[17, 9, 12, 11, 11], # bsaqq
    [16, 29, 26],
    [13, 17, 15, 25, 8],
    [17, 11, 4],
    [18, 10, 12, 13],
    [23, 7, 7, 12, 24],
    [27, 7, 6, 29, 8, 29, 5],
    [16, 9, 29, 4],
    [28, 25, 13, 21, 20],
    [18, 24, 22, 13, 11]]
    结果2:
    [[12, 17, 11, 11, 9, 3], # abqqs,可以看到这里的3代表加入的特殊符号EOS
    [16, 29, 26, 3],
    [17, 8, 13, 25, 15, 3],
    [17, 11, 4, 3],
    [12, 10, 18, 13, 3],
    [12, 7, 7, 24, 23, 3],
    [7, 27, 5, 8, 29, 29, 6, 3],
    [16, 29, 9, 4, 3],
    [28, 13, 21, 25, 20, 3],
    [22, 18, 13, 24, 11, 3]]

    3.构建模型

    3.1输入层
    def get_inputs():
        '''
        模型输入tensor
        '''
        inputs = tf.placeholder(tf.int32, [None, None], name='inputs')  # 用placeholder进行占位,形状不指定根据训练数据变化
        targets = tf.placeholder(tf.int32, [None, None], name='targets')
        learning_rate = tf.placeholder(tf.float32, name='learning_rate')  # 同理,这里替学习率进行占位
        
        # 定义target序列最大长度(之后target_sequence_length和source_sequence_length会作为feed_dict的参数)
        target_sequence_length = tf.placeholder(tf.int32, (None,), name='target_sequence_length')
        max_target_sequence_length = tf.reduce_max(target_sequence_length, name='max_target_len')  # 这里计算序列最大长度项,便于之后根据此进行填充 
        source_sequence_length = tf.placeholder(tf.int32, (None,), name='source_sequence_length')
        
        return inputs, targets, learning_rate, target_sequence_length, max_target_sequence_length, source_sequence_length
    
    3.2Encoder端

    在Encoder端,我们需要进行两步:

    • 第一步要对我们的输入进行Embedding;
    • 再把Embedding好的向量传给RNN进行处理。
    将要使用到的API介绍:

    在Embedding中,我们使用tf.contrib.layers.embed_sequence,它会对每个batch执行embedding操作。

    • tf.contrib.layers.embed_sequence:

    对序列数据执行embedding操作,输入[batch_size, sequence_length]的tensor,返回[batch_size, sequence_length, embed_dim]的tensor。

    features = [[1,2,3],[4,5,6]]

    outputs = tf.contrib.layers.embed_sequence(features, vocab_size, embed_dim)

    如果embed_dim=4,输出结果为

    [
    [[0.1,0.2,0.3,0.1],[0.2,0.5,0.7,0.2],[0.1,0.6,0.1,0.2]],
    [[0.6,0.2,0.8,0.2],[0.5,0.6,0.9,0.2],[0.3,0.9,0.2,0.2]]
    ]

    • tf.contrib.rnn.MultiRNNCell:

    对RNN单元按序列堆叠。接受参数为一个由RNN cell组成的list。

    rnn_size代表一个rnn单元中隐层节点数量,layer_nums代表堆叠的rnn cell个数

    • tf.nn.dynamic_rnn:

    构建RNN,接受动态输入序列。返回RNN的输出以及最终状态的tensor。

    dynamic_rnn与rnn的区别在于,dynamic_rnn对于不同的batch,可以接收不同的sequence_length。

    例如,第一个batch是[batch_size,10],第二个batch是[batch_size,20]。而rnn只能接收定长的sequence_length。

    def get_encoder_layer(input_data, rnn_size, num_layers,
                       source_sequence_length, source_vocab_size, 
                       encoding_embedding_size):
    
        '''
        构造Encoder层,其实也就是一个简单的RNN模型
        
        参数说明:
        - input_data: 输入tensor,输入数据
        - rnn_size: rnn隐层结点数量
        - num_layers: 堆叠的rnn cell数量
        - source_sequence_length: 源数据的序列长度
        - source_vocab_size: 源数据的词典大小,词库大小(不重复的词)
        - encoding_embedding_size: embedding的大小,映射成向量后的维度
        '''
        # Encoder embedding
        encoder_embed_input = tf.contrib.layers.embed_sequence(input_data, source_vocab_size, encoding_embedding_size)
    
        # RNN cell,以随机初始化的方式构造基本的LSTM单元
        def get_lstm_cell(rnn_size):
            lstm_cell = tf.contrib.rnn.LSTMCell(rnn_size, initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))  
            return lstm_cell
    
        # 根据基本的LSTM单元,构造多隐层的RNN网络,有几层隐层,就把几层的LSTM单元组合在一起
        cell = tf.contrib.rnn.MultiRNNCell([get_lstm_cell(rnn_size) for _ in range(num_layers)])  
        
        # 构建RNN,接受动态输入序列。返回RNN的输出以及最终状态的tensor
        encoder_output, encoder_state = tf.nn.dynamic_rnn(cell, encoder_embed_input, 
                                                          sequence_length=source_sequence_length, dtype=tf.float32)  # cell是构造好的网络,映射向量,序列长度
        
        return encoder_output, encoder_state
    
    3.3Decoder端

    对target数据进行预处理:
    预处理包括加入停止词,保证数据的维度一致等。


    图14 数据预处理后的示意图
    def process_decoder_input(data, vocab_to_int, batch_size):
        '''
        补充<GO>,并移除最后一个字符 
        '''
        # cut掉最后一个字符
        ending = tf.strided_slice(data, [0, 0], [batch_size, -1], [1, 1])
        decoder_input = tf.concat([tf.fill([batch_size, 1], vocab_to_int['<GO>']), ending], 1)
    
        return decoder_input
    
    3.4对target数据进行embedding

    同样地,我们还需要对target数据进行embedding,使得它们能够传入Decoder中的RNN。

    将要使用到的API介绍:
    • tf.contrib.seq2seq.TrainingHelper:

    Decoder端用来训练的函数。

    这个函数不会把t-1阶段的输出作为t阶段的输入,而是把target中的真实值直接输入给RNN。

    主要参数是inputs和sequence_length。返回helper对象,可以作为BasicDecoder函数的参数。

    • tf.contrib.seq2seq.GreedyEmbeddingHelper:

    它和TrainingHelper的区别在于它会把t-1下的输出进行embedding后再输入给RNN。

    下面的图15中代表的是training过程:

    在training过程中,我们并不会把每个阶段的预测输出作为下一阶段的输入,下一阶段的输入我们会直接使用target data真实值,这样能够保证模型更加准确。

    图15 Decoder端训练过程.png
    def decoding_layer(target_letter_to_int, decoding_embedding_size, num_layers, rnn_size,
                       target_sequence_length, max_target_sequence_length, encoder_state, decoder_input):
        '''
        构造Decoder层
        
        参数:
        - target_letter_to_int: target数据的映射表
        - decoding_embedding_size: embed向量大小
        - num_layers: 堆叠的RNN单元数量
        - rnn_size: RNN单元的隐层结点数量
        - target_sequence_length: target数据序列长度
        - max_target_sequence_length: target数据序列最大长度
        - encoder_state: encoder端编码的状态向量
        - decoder_input: decoder端输入
        '''
        # 1. Embedding
        target_vocab_size = len(target_letter_to_int)  # 计算最终词库的大小
        decoder_embeddings = tf.Variable(tf.random_uniform([target_vocab_size, decoding_embedding_size])) # 定义映射矩阵
        decoder_embed_input = tf.nn.embedding_lookup(decoder_embeddings, decoder_input)  # 查看当前的映射结果
    
        # 2. 构造Decoder中的RNN单元
        def get_decoder_cell(rnn_size):  
          """构造基本的LSTM单元"""
            decoder_cell = tf.contrib.rnn.LSTMCell(rnn_size,
                                               initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            return decoder_cell
        cell = tf.contrib.rnn.MultiRNNCell([get_decoder_cell(rnn_size) for _ in range(num_layers)])  # 构造RNN网络
         
        # 3. Output全连接层,相当于是加上Softmax,对得出的结果进行分类
        output_layer = Dense(target_vocab_size,
                             kernel_initializer = tf.truncated_normal_initializer(mean = 0.0, stddev=0.1))
    
    
        # 4. Training decoder,训练decoder,LSTM单元直接用label去做输入
        with tf.variable_scope("decode"):
            # 得到help对象
            training_helper = tf.contrib.seq2seq.TrainingHelper(inputs=decoder_embed_input,
                                                                sequence_length=target_sequence_length,
                                                                time_major=False)
            # 构造基本的decoder
            training_decoder = tf.contrib.seq2seq.BasicDecoder(cell,
                                                               training_helper,
                                                               encoder_state,
                                                               output_layer) 
            # 得到decoder训练后的输出值
            training_decoder_output, _ = tf.contrib.seq2seq.dynamic_decode(training_decoder,
                                                                           impute_finished=True,
                                                                           maximum_iterations=max_target_sequence_length)
    
        # 5. Predicting decoder,预测decoder,LSTM单元用前一阶段的输出去做输入
        # 与training共享参数
        with tf.variable_scope("decode", reuse=True):  # 作用域与4相同,reuse=Ture,说明与上一阶段的参数是共享的
            # 创建一个常量tensor并复制为batch_size的大小
            start_tokens = tf.tile(tf.constant([target_letter_to_int['<GO>']], dtype=tf.int32), [batch_size], 
                                   name='start_tokens')
            predicting_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(decoder_embeddings,
                                                                    start_tokens,
                                                                    target_letter_to_int['<EOS>'])
            predicting_decoder = tf.contrib.seq2seq.BasicDecoder(cell,
                                                            predicting_helper,
                                                            encoder_state,
                                                            output_layer)
            predicting_decoder_output, _ = tf.contrib.seq2seq.dynamic_decode(predicting_decoder,
                                                                impute_finished=True,
                                                                maximum_iterations=max_target_sequence_length)
        
        return training_decoder_output, predicting_decoder_output
    
    3.5构建seq2seq模型

    上面已经构建完成Encoder和Decoder,下面将这两部分连接起来,构建seq2seq模型

    def seq2seq_model(input_data, targets, lr, target_sequence_length, 
                      max_target_sequence_length, source_sequence_length,
                      source_vocab_size, target_vocab_size,
                      encoder_embedding_size, decoder_embedding_size, 
                      rnn_size, num_layers):
        
        # 获取encoder的状态输出
        _, encoder_state = get_encoder_layer(input_data, 
                                      rnn_size, 
                                      num_layers, 
                                      source_sequence_length,
                                      source_vocab_size, 
                                      encoding_embedding_size)
        
        
        # 预处理后的decoder输入
        decoder_input = process_decoder_input(targets, target_letter_to_int, batch_size)
        
        # 将状态向量与输入传递给decoder
        training_decoder_output, predicting_decoder_output = decoding_layer(target_letter_to_int, 
                                                                           decoding_embedding_size, 
                                                                           num_layers, 
                                                                           rnn_size,
                                                                           target_sequence_length,
                                                                           max_target_sequence_length,
                                                                           encoder_state, 
                                                                           decoder_input) 
        
        return training_decoder_output, predicting_decoder_output
        
    
    
    超参数设置
    # 超参数
    # Number of Epochs
    epochs = 60
    # Batch Size
    batch_size = 128
    # RNN Size
    rnn_size = 50
    # Number of Layers
    num_layers = 2
    # Embedding Size
    encoding_embedding_size = 15
    decoding_embedding_size = 15
    # Learning Rate
    learning_rate = 0.001
    
    构造graph
    # 构造graph
    train_graph = tf.Graph()
    
    with train_graph.as_default():
        
        # 获得模型输入    
        input_data, targets, lr, target_sequence_length, max_target_sequence_length, source_sequence_length = get_inputs()
        
        training_decoder_output, predicting_decoder_output = seq2seq_model(input_data, 
                                                                          targets, 
                                                                          lr, 
                                                                          target_sequence_length, 
                                                                          max_target_sequence_length, 
                                                                          source_sequence_length,
                                                                          len(source_letter_to_int),
                                                                          len(target_letter_to_int),
                                                                          encoding_embedding_size, 
                                                                          decoding_embedding_size, 
                                                                          rnn_size, 
                                                                          num_layers)    
        
        training_logits = tf.identity(training_decoder_output.rnn_output, 'logits')
        predicting_logits = tf.identity(predicting_decoder_output.sample_id, name='predictions')
        
        masks = tf.sequence_mask(target_sequence_length, max_target_sequence_length, dtype=tf.float32, name='masks')  # 不将EOS等特殊符号参与运算
    
        with tf.name_scope("optimization"):
            
            # Loss function
            cost = tf.contrib.seq2seq.sequence_loss(
                training_logits,
                targets,
                masks)
    
            # Optimizer
            optimizer = tf.train.AdamOptimizer(lr)  # 优化器
    
            # Gradient Clipping 基于定义的min与max对tesor数据进行截断操作,目的是为了应对梯度爆发或者梯度消失的情况
            gradients = optimizer.compute_gradients(cost)  # 梯度求解
            capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None]  # 为梯度求解指定范围
            train_op = optimizer.apply_gradients(capped_gradients)
    
    

    4.batch批处理

    def pad_sentence_batch(sentence_batch, pad_int):
        '''
        对batch中的序列进行补全,保证batch中的每行都有相同的sequence_length
        
        参数:
        - sentence batch
        - pad_int: <PAD>对应索引号
        '''
        max_sentence = max([len(sentence) for sentence in sentence_batch])
        return [sentence + [pad_int] * (max_sentence - len(sentence)) for sentence in sentence_batch]
    
    def get_batches(targets, sources, batch_size, source_pad_int, target_pad_int):
        '''
        定义生成器,用来获取batch
        '''
        for batch_i in range(0, len(sources)//batch_size):
            start_i = batch_i * batch_size
            sources_batch = sources[start_i:start_i + batch_size]  # 指定索引符,将数据取出
            targets_batch = targets[start_i:start_i + batch_size]
            # 补全序列
            pad_sources_batch = np.array(pad_sentence_batch(sources_batch, source_pad_int))
            pad_targets_batch = np.array(pad_sentence_batch(targets_batch, target_pad_int))
            
            # 记录每条记录的长度
            pad_targets_lengths = []
            for target in pad_targets_batch:
                pad_targets_lengths.append(len(target))
            
            pad_source_lengths = []
            for source in pad_sources_batch:
                pad_source_lengths.append(len(source))
            
            yield pad_targets_batch, pad_sources_batch, pad_targets_lengths, pad_source_lengths
    

    5.Training训练

    # 将数据集分割为train和validation
    train_source = source_int[batch_size:]
    train_target = target_int[batch_size:]
    # 留出一个batch进行验证
    valid_source = source_int[:batch_size]
    valid_target = target_int[:batch_size]
    (valid_targets_batch, valid_sources_batch, valid_targets_lengths, valid_sources_lengths) = next(get_batches(valid_target, valid_source, batch_size,
                               source_letter_to_int['<PAD>'],
                               target_letter_to_int['<PAD>']))
    
    display_step = 50 # 每隔50轮输出loss
    
    checkpoint = "trained_model.ckpt" 
    with tf.Session(graph=train_graph) as sess:
        sess.run(tf.global_variables_initializer())
            
        for epoch_i in range(1, epochs+1):
            for batch_i, (targets_batch, sources_batch, targets_lengths, sources_lengths) in enumerate(
                    get_batches(train_target, train_source, batch_size,
                               source_letter_to_int['<PAD>'],
                               target_letter_to_int['<PAD>'])):
                
                _, loss = sess.run(
                    [train_op, cost],
                    {input_data: sources_batch,
                     targets: targets_batch,
                     lr: learning_rate,
                     target_sequence_length: targets_lengths,
                     source_sequence_length: sources_lengths})
    
                if batch_i % display_step == 0:
                    
                    # 计算validation loss
                    validation_loss = sess.run(
                    [cost],
                    {input_data: valid_sources_batch,
                     targets: valid_targets_batch,
                     lr: learning_rate,
                     target_sequence_length: valid_targets_lengths,
                     source_sequence_length: valid_sources_lengths})
                    
                    print('Epoch {:>3}/{} Batch {:>4}/{} - Training Loss: {:>6.3f}  - Validation loss: {:>6.3f}'
                          .format(epoch_i,
                                  epochs, 
                                  batch_i, 
                                  len(train_source) // batch_size, 
                                  loss, 
                                  validation_loss[0]))
    
        
        
        # 保存模型
        saver = tf.train.Saver()
        saver.save(sess, checkpoint)
        print('Model Trained and Saved')
    
    结果:

    Epoch 1/60 Batch 50/77 - Training Loss: 2.332 - Validation loss: 2.091
    Epoch 2/60 Batch 50/77 - Training Loss: 1.803 - Validation loss: 1.593
    Epoch 3/60 Batch 50/77 - Training Loss: 1.550 - Validation loss: 1.379
    Epoch 4/60 Batch 50/77 - Training Loss: 1.343 - Validation loss: 1.184
    Epoch 5/60 Batch 50/77 - Training Loss: 1.230 - Validation loss: 1.077
    Epoch 6/60 Batch 50/77 - Training Loss: 1.096 - Validation loss: 0.956
    Epoch 7/60 Batch 50/77 - Training Loss: 0.993 - Validation loss: 0.849
    Epoch 8/60 Batch 50/77 - Training Loss: 0.893 - Validation loss: 0.763
    Epoch 9/60 Batch 50/77 - Training Loss: 0.808 - Validation loss: 0.673
    Epoch 10/60 Batch 50/77 - Training Loss: 0.728 - Validation loss: 0.600
    Epoch 11/60 Batch 50/77 - Training Loss: 0.650 - Validation loss: 0.539
    Epoch 12/60 Batch 50/77 - Training Loss: 0.594 - Validation loss: 0.494
    Epoch 13/60 Batch 50/77 - Training Loss: 0.560 - Validation loss: 0.455
    Epoch 14/60 Batch 50/77 - Training Loss: 0.502 - Validation loss: 0.411
    Epoch 15/60 Batch 50/77 - Training Loss: 0.464 - Validation loss: 0.380
    Epoch 16/60 Batch 50/77 - Training Loss: 0.428 - Validation loss: 0.352
    Epoch 17/60 Batch 50/77 - Training Loss: 0.394 - Validation loss: 0.323
    Epoch 18/60 Batch 50/77 - Training Loss: 0.364 - Validation loss: 0.297
    Epoch 19/60 Batch 50/77 - Training Loss: 0.335 - Validation loss: 0.270
    Epoch 20/60 Batch 50/77 - Training Loss: 0.305 - Validation loss: 0.243
    Epoch 21/60 Batch 50/77 - Training Loss: 0.311 - Validation loss: 0.248
    Epoch 22/60 Batch 50/77 - Training Loss: 0.253 - Validation loss: 0.203
    Epoch 23/60 Batch 50/77 - Training Loss: 0.227 - Validation loss: 0.182
    Epoch 24/60 Batch 50/77 - Training Loss: 0.204 - Validation loss: 0.165
    Epoch 25/60 Batch 50/77 - Training Loss: 0.184 - Validation loss: 0.150
    Epoch 26/60 Batch 50/77 - Training Loss: 0.166 - Validation loss: 0.136
    Epoch 27/60 Batch 50/77 - Training Loss: 0.150 - Validation loss: 0.124
    Epoch 28/60 Batch 50/77 - Training Loss: 0.135 - Validation loss: 0.113
    Epoch 29/60 Batch 50/77 - Training Loss: 0.121 - Validation loss: 0.103
    Epoch 30/60 Batch 50/77 - Training Loss: 0.109 - Validation loss: 0.094
    Epoch 31/60 Batch 50/77 - Training Loss: 0.098 - Validation loss: 0.086
    Epoch 32/60 Batch 50/77 - Training Loss: 0.088 - Validation loss: 0.079
    Epoch 33/60 Batch 50/77 - Training Loss: 0.079 - Validation loss: 0.073
    Epoch 34/60 Batch 50/77 - Training Loss: 0.071 - Validation loss: 0.067
    Epoch 35/60 Batch 50/77 - Training Loss: 0.063 - Validation loss: 0.062
    Epoch 36/60 Batch 50/77 - Training Loss: 0.057 - Validation loss: 0.057
    Epoch 37/60 Batch 50/77 - Training Loss: 0.052 - Validation loss: 0.053
    Epoch 38/60 Batch 50/77 - Training Loss: 0.047 - Validation loss: 0.049
    Epoch 39/60 Batch 50/77 - Training Loss: 0.043 - Validation loss: 0.045
    Epoch 40/60 Batch 50/77 - Training Loss: 0.039 - Validation loss: 0.042
    Epoch 41/60 Batch 50/77 - Training Loss: 0.036 - Validation loss: 0.039
    Epoch 42/60 Batch 50/77 - Training Loss: 0.033 - Validation loss: 0.037
    Epoch 43/60 Batch 50/77 - Training Loss: 0.030 - Validation loss: 0.034
    Epoch 44/60 Batch 50/77 - Training Loss: 0.028 - Validation loss: 0.032
    Epoch 45/60 Batch 50/77 - Training Loss: 0.026 - Validation loss: 0.029
    Epoch 46/60 Batch 50/77 - Training Loss: 0.024 - Validation loss: 0.028
    Epoch 47/60 Batch 50/77 - Training Loss: 0.027 - Validation loss: 0.029
    Epoch 48/60 Batch 50/77 - Training Loss: 0.030 - Validation loss: 0.030
    Epoch 49/60 Batch 50/77 - Training Loss: 0.023 - Validation loss: 0.026
    Epoch 50/60 Batch 50/77 - Training Loss: 0.021 - Validation loss: 0.024
    Epoch 51/60 Batch 50/77 - Training Loss: 0.019 - Validation loss: 0.022
    Epoch 52/60 Batch 50/77 - Training Loss: 0.017 - Validation loss: 0.021
    Epoch 53/60 Batch 50/77 - Training Loss: 0.016 - Validation loss: 0.020
    Epoch 54/60 Batch 50/77 - Training Loss: 0.015 - Validation loss: 0.019
    Epoch 55/60 Batch 50/77 - Training Loss: 0.014 - Validation loss: 0.018
    Epoch 56/60 Batch 50/77 - Training Loss: 0.013 - Validation loss: 0.018
    Epoch 57/60 Batch 50/77 - Training Loss: 0.012 - Validation loss: 0.017
    Epoch 58/60 Batch 50/77 - Training Loss: 0.011 - Validation loss: 0.016
    Epoch 59/60 Batch 50/77 - Training Loss: 0.011 - Validation loss: 0.016
    Epoch 60/60 Batch 50/77 - Training Loss: 0.010 - Validation loss: 0.015
    Model Trained and Saved

    6.Predicate预测

    def source_to_seq(text):
        '''
        对源数据进行转换
        '''
        sequence_length = 7
        return [source_letter_to_int.get(word, source_letter_to_int['<UNK>']) for word in text] + [source_letter_to_int['<PAD>']]*(sequence_length-len(text))
    
    # 输入一个单词
    input_word = 'common'
    text = source_to_seq(input_word)
    
    checkpoint = "./trained_model.ckpt"
    
    loaded_graph = tf.Graph()
    with tf.Session(graph=loaded_graph) as sess:
        # 加载模型
        loader = tf.train.import_meta_graph(checkpoint + '.meta')
        loader.restore(sess, checkpoint)
    
        input_data = loaded_graph.get_tensor_by_name('inputs:0')
        logits = loaded_graph.get_tensor_by_name('predictions:0')
        source_sequence_length = loaded_graph.get_tensor_by_name('source_sequence_length:0')
        target_sequence_length = loaded_graph.get_tensor_by_name('target_sequence_length:0')
        
        answer_logits = sess.run(logits, {input_data: [text]*batch_size, 
                                          target_sequence_length: [len(text)]*batch_size, 
                                          source_sequence_length: [len(text)]*batch_size})[0] 
    
    
    pad = source_letter_to_int["<PAD>"] 
    
    print('原始输入:', input_word)
    
    print('\nSource')
    print('  Word 编号:    {}'.format([i for i in text]))
    print('  Input Words: {}'.format(" ".join([source_int_to_letter[i] for i in text])))
    
    print('\nTarget')
    print('  Word 编号:       {}'.format([i for i in answer_logits if i != pad]))
    print('  Response Words: {}'.format(" ".join([target_int_to_letter[i] for i in answer_logits if i != pad])))
    
    结果展示:

    INFO:tensorflow:Restoring parameters from ./trained_model.ckpt
    原始输入: common

    Source
    Word 编号: [20, 28, 6, 6, 28, 5, 0]
    Input Words: c o m m o n <PAD>

    Target
    Word 编号: [20, 6, 6, 5, 28, 28, 3]
    Response Words: c m m n o o <EOS>

    任务2:文本摘要练习

    数据集:Amazon 500000评论
    分为以下步骤进行:

    • 数据预处理
    • 构建Seq2Seq模型
    • 训练网络
    • 测试效果

    seq2seq教程: https://github.com/j-min/tf_tutorial_plus/tree/master/RNN_seq2seq/contrib_seq2seq 国外大神写的Seq2Seq的tutorial

    1.导入需要的外部库

    import pandas as pd
    import numpy as np
    import tensorflow as tf
    import re
    from nltk.corpus import stopwords
    import time
    from tensorflow.python.layers.core import Dense
    from tensorflow.python.ops.rnn_cell_impl import _zero_state_tensors
    print('TensorFlow Version: {}'.format(tf.__version__))
    

    2.导入数据

    reviews = pd.read_csv("Reviews.csv")
    print(reviews.shape)
    print(reviews.head())
    

    结果为:
    (568454, 10)

    Id  ProductId   UserId  ProfileName HelpfulnessNumerator    HelpfulnessDenominator  Score   Time    Summary Text
    

    0 1 B001E4KFG0 A3SGXH7AUHU8GW delmartian 1 1 5 1303862400 Good Quality Dog Food I have bought several of the Vitality canned d...
    1 2 B00813GRG4 A1D87F6ZCVE5NK dll pa 0 0 1 1346976000 Not as Advertised Product arrived labeled as Jumbo Salted Peanut...
    2 3 B000LQOCH0 ABXLMWJIXXAIN Natalia Corres "Natalia Corres" 1 1 4 1219017600 "Delight" says it all This is a confection that has been around a fe...
    3 4 B000UA0QIQ A395BORC6FGVXV Karl 3 3 2 1307923200 Cough Medicine If you are looking for the secret ingredient i...
    4 5 B006K2ZZ7K A1UQRSCLF8GW1T Michael D. Bigham "M. Wassir" 0 0 5 1350777600 Great taffy Great taffy at a great price. There was a wid...

    2.1检查空数据
    # Check for any nulls values
    reviews.isnull().sum()
    
    2.2删除空值和不需要的特征
    # Remove null values and unneeded features
    reviews = reviews.dropna()
    reviews = reviews.drop(['Id','ProductId','UserId','ProfileName','HelpfulnessNumerator','HelpfulnessDenominator',
                            'Score','Time'], 1)
    reviews = reviews.reset_index(drop=True)
    
    reviews.head()
    
    2.3查看部分数据
    # Inspecting some of the reviews
    for i in range(5):
        print("Review #",i+1)
        print(reviews.Summary[i])
        print(reviews.Text[i])
        print()
    

    3.数据预处理

    主要处理任务:

    • 全部转换成小写
    • 连词转换
    • 去停用词(只在描述中去掉)
    3.1设置缩写词列表
    
    contractions = { 
    "ain't": "am not",
    "aren't": "are not",
    "can't": "cannot",
    "can't've": "cannot have",
    "'cause": "because",
    "could've": "could have",
    "couldn't": "could not",
    "couldn't've": "could not have",
    "didn't": "did not",
    "doesn't": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hadn't've": "had not have",
    "hasn't": "has not",
    "haven't": "have not",
    "he'd": "he would",
    "he'd've": "he would have",
    "he'll": "he will",
    "he's": "he is",
    "how'd": "how did",
    "how'll": "how will",
    "how's": "how is",
    "i'd": "i would",
    "i'll": "i will",
    "i'm": "i am",
    "i've": "i have",
    "isn't": "is not",
    "it'd": "it would",
    "it'll": "it will",
    "it's": "it is",
    "let's": "let us",
    "ma'am": "madam",
    "mayn't": "may not",
    "might've": "might have",
    "mightn't": "might not",
    "must've": "must have",
    "mustn't": "must not",
    "needn't": "need not",
    "oughtn't": "ought not",
    "shan't": "shall not",
    "sha'n't": "shall not",
    "she'd": "she would",
    "she'll": "she will",
    "she's": "she is",
    "should've": "should have",
    "shouldn't": "should not",
    "that'd": "that would",
    "that's": "that is",
    "there'd": "there had",
    "there's": "there is",
    "they'd": "they would",
    "they'll": "they will",
    "they're": "they are",
    "they've": "they have",
    "wasn't": "was not",
    "we'd": "we would",
    "we'll": "we will",
    "we're": "we are",
    "we've": "we have",
    "weren't": "were not",
    "what'll": "what will",
    "what're": "what are",
    "what's": "what is",
    "what've": "what have",
    "where'd": "where did",
    "where's": "where is",
    "who'll": "who will",
    "who's": "who is",
    "won't": "will not",
    "wouldn't": "would not",
    "you'd": "you would",
    "you'll": "you will",
    "you're": "you are"
    }
    
    3.2数据清洗
    def clean_text(text, remove_stopwords = True):
        '''Remove unwanted characters, stopwords, and format the text to create fewer nulls word embeddings'''
        
        # Convert words to lower case
        text = text.lower()
        
        # Replace contractions with their longer forms 
        if True:
            text = text.split()
            new_text = []
            for word in text:
                if word in contractions:
                    new_text.append(contractions[word])
                else:
                    new_text.append(word)
            text = " ".join(new_text)
        
        # Format words and remove unwanted characters
        text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
        text = re.sub(r'\<a href', ' ', text)
        text = re.sub(r'&amp;', '', text) 
        text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)
        text = re.sub(r'<br />', ' ', text)
        text = re.sub(r'\'', ' ', text)
        
        # Optionally, remove stop words
        if remove_stopwords:
            text = text.split()
            stops = set(stopwords.words("english"))
            text = [w for w in text if not w in stops]
            text = " ".join(text)
    
        return text
    

    ↑我们将删除文本中的停用词,因为它们不能用于训练我们的模型。 但是,我们会将它们保留为摘要,以便它们听起来更像自然短语。

    # Clean the summaries and texts
    clean_summaries = []
    for summary in reviews.Summary:
        clean_summaries.append(clean_text(summary, remove_stopwords=False))
    print("Summaries are complete.")
    
    clean_texts = []
    for text in reviews.Text:
        clean_texts.append(clean_text(text))
    print("Texts are complete.")
    

    检查已清理的摘要和文本,确保它们已被清理干净

    for i in range(5):
        print("Clean Review #",i+1)
        print(clean_summaries[i])
        print(clean_texts[i])
        print()
    

    计算一组文本中每个单词的出现次数

    def count_words(count_dict, text):
        '''Count the number of occurrences of each word in a set of text'''
        for sentence in text:
            for word in sentence.split():
                if word not in count_dict:
                    count_dict[word] = 1
                else:
                    count_dict[word] += 1
    

    查找每个单词的使用次数和词汇量的大小

    word_counts = {}
    
    count_words(word_counts, clean_summaries)
    count_words(word_counts, clean_texts)
                
    print("Size of Vocabulary:", len(word_counts))
    

    结果:
    Size of Vocabulary: 132884

    4.使用构建好的词向量

    这里使用目前效果较好,别人已构建好的词向量

    # 加载Conceptnet Numberbatch(CN)嵌入,类似于GloVe,但可能更好
    # (https://github.com/commonsense/conceptnet-numberbatch)  这里使用别人已经训练好的词向量ConceptNet
    embeddings_index = {}
    with open('numberbatch-en-17.04b.txt', encoding='utf-8') as f:
        for line in f:
            values = line.split(' ')
            word = values[0]
            embedding = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = embedding
    
    print('Word embeddings:', len(embeddings_index))
    

    词库总词向量为:
    484557

    4.1但是有些词在我们当前使用的语料库中是不存在的,那么这时候就需要我们自己去做word embedding.
    # Find the number of words that are missing from CN, and are used more than our threshold.embedding.
    missing_words = 0
    threshold = 20
    
    for word, count in word_counts.items():
        if count > threshold:
            if word not in embeddings_index:
                missing_words += 1
                
    missing_ratio = round(missing_words/len(word_counts),4)*100
                
    print("Number of words missing from CN:", missing_words)
    print("Percent of words that are missing from vocabulary: {}%".format(missing_ratio))
    

    结果为:
    Number of words missing from CN: 3044
    Percent of words that are missing from vocabulary: 2.29%

    阈值设置为20,不在词向量中的且出现超过20次,那咱们就得自己做它的映射向量了

    4.2将单词转换为整数的字典
    # Limit the vocab that we will use to words that appear ≥ threshold or are in GloVe
    
    #dictionary to convert words to integers 这里做了将词到int类型的映射,方便在训练和测试的时候,词的转换的操作
    vocab_to_int = {} 
    
    value = 0
    for word, count in word_counts.items():
        if count >= threshold or word in embeddings_index:
            vocab_to_int[word] = value
            value += 1
    
    # Special tokens that will be added to our vocab
    codes = ["<UNK>","<PAD>","<EOS>","<GO>"]   
    
    # Add codes to vocab
    for code in codes:
        vocab_to_int[code] = len(vocab_to_int)
    
    # Dictionary to convert integers to words
    int_to_vocab = {}
    for word, value in vocab_to_int.items():
        int_to_vocab[value] = word
    
    usage_ratio = round(len(vocab_to_int) / len(word_counts),4)*100
    
    print("Total number of unique words:", len(word_counts))
    print("Number of words we will use:", len(vocab_to_int))
    print("Percent of words we will use: {}%".format(usage_ratio))
    

    结果为:
    Total number of unique words: 132884
    Number of words we will use: 65469
    Percent of words we will use: 49.27%

    4.3设置词向量维度
    # Need to use 300 for embedding dimensions to match CN's vectors.
    embedding_dim = 300  # 因为使用的是别人已经训练好的词向量,且他们设置的词向量的维度是300维,这里指定自己的维度也是300维,确保保持一致
    nb_words = len(vocab_to_int)
    
    # Create matrix with default values of zero
    word_embedding_matrix = np.zeros((nb_words, embedding_dim), dtype=np.float32)
    for word, i in vocab_to_int.items():
        if word in embeddings_index:
            word_embedding_matrix[i] = embeddings_index[word]
        else:
            # If word not in CN, create a random embedding for it
            new_embedding = np.array(np.random.uniform(-1.0, 1.0, embedding_dim))
            embeddings_index[word] = new_embedding
            word_embedding_matrix[i] = new_embedding
    
    # Check if value matches len(vocab_to_int)
    print(len(word_embedding_matrix))  # 65469
    
    4.4将文本中的单词转换为整数。
    def convert_to_ints(text, word_count, unk_count, eos=False):
        '''Convert words in text to an integer.。
           If word is not in vocab_to_int, use UNK's integer.如果word不在vocab_to_int中,请使用UNK的整数
           Total the number of words and UNKs.单词和UNK的总数。
           Add EOS token to the end of texts 将EOS token添加到文本末尾'''
        ints = []
        for sentence in text:
            sentence_ints = []
            for word in sentence.split():
                word_count += 1
                if word in vocab_to_int:
                    sentence_ints.append(vocab_to_int[word])
                else:
                    sentence_ints.append(vocab_to_int["<UNK>"])
                    unk_count += 1
            if eos:
                sentence_ints.append(vocab_to_int["<EOS>"])
            ints.append(sentence_ints)
        return ints, word_count, unk_count
    
    4.5将convert_to_ints应用于clean_summaries和clean_texts
    # Apply convert_to_ints to clean_summaries and clean_texts
    word_count = 0
    unk_count = 0
    
    int_summaries, word_count, unk_count = convert_to_ints(clean_summaries, word_count, unk_count)
    int_texts, word_count, unk_count = convert_to_ints(clean_texts, word_count, unk_count, eos=True)
    
    unk_percent = round(unk_count/word_count,4)*100
    
    print("Total number of words in headlines:", word_count)
    print("Total number of UNKs in headlines:", unk_count)
    print("Percent of words that are UNK: {}%".format(unk_percent))
    

    结果为:
    Total number of words in headlines: 25679946
    Total number of UNKs in headlines: 170450
    Percent of words that are UNK: 0.66%

    4.6从文本中创建句子长度的DataFrame
    def create_lengths(text):  # 因为语料库中词的长度不一致,要做padding,所以这里先统计每个sentence长度
        '''Create a data frame of the sentence lengths from a text'''
        lengths = []
        for sentence in text:
            lengths.append(len(sentence))
        return pd.DataFrame(lengths, columns=['counts'])
    
    lengths_summaries = create_lengths(int_summaries)
    lengths_texts = create_lengths(int_texts)
    
    print("Summaries:")
    print(lengths_summaries.describe())
    print()
    print("Texts:")
    print(lengths_texts.describe())
    

    结果为:
    Summaries:
    counts
    count 568412.000000
    mean 4.181620
    std 2.657872
    min 0.000000
    25% 2.000000
    50% 4.000000
    75% 5.000000
    max 48.000000

    Texts:
    counts
    count 568412.000000
    mean 41.996782
    std 42.520854
    min 1.000000
    25% 18.000000
    50% 29.000000
    75% 50.000000
    max 2085.000000

    # Inspect the length of texts 统计百分比
    print(np.percentile(lengths_texts.counts, 90))
    print(np.percentile(lengths_texts.counts, 95))
    print(np.percentile(lengths_texts.counts, 99))
    

    84.0
    115.0
    207.0

    # Inspect the length of summaries  检查摘要的长度
    print(np.percentile(lengths_summaries.counts, 90))
    print(np.percentile(lengths_summaries.counts, 95))
    print(np.percentile(lengths_summaries.counts, 99))
    

    8.0
    9.0
    13.0

    4.7计算UNK出现在句子中的次数
    def unk_counter(sentence):
        '''Counts the number of times UNK appears in a sentence.'''
        unk_count = 0
        for word in sentence:
            if word == vocab_to_int["<UNK>"]:
                unk_count += 1
        return unk_count
    
    4.8文本排序,设置范围
    # Sort the summaries and texts by the length of the texts, shortest to longest  按文本长度对摘要和文本进行排序,最短到最长
    # Limit the length of summaries and texts based on the min and max ranges.根据最小和最大范围限制摘要和文本的长度
    # Remove reviews that include too many UNKs删除包含太多UNK的评论
    
    sorted_summaries = []
    sorted_texts = []
    max_text_length = 84
    max_summary_length = 13
    min_length = 2
    unk_text_limit = 1
    unk_summary_limit = 0
    
    for length in range(min(lengths_texts.counts), max_text_length): 
        for count, words in enumerate(int_summaries):
            if (len(int_summaries[count]) >= min_length and
                len(int_summaries[count]) <= max_summary_length and
                len(int_texts[count]) >= min_length and
                unk_counter(int_summaries[count]) <= unk_summary_limit and
                unk_counter(int_texts[count]) <= unk_text_limit and
                length == len(int_texts[count])
               ):
                sorted_summaries.append(int_summaries[count])
                sorted_texts.append(int_texts[count])
            
    # Compare lengths to ensure they match
    print(len(sorted_summaries))
    print(len(sorted_texts))
    

    5.构建Seq2Seq模型

    这里使用的是RNN的变种-Bidirectional RNNs,Bidirectional RNNs(双向网络)的改进之处便是,假设当前的输出(第t步的输出)不仅仅与前面的序列有关,并且还与后面的序列有关。

    例如:预测一个语句中缺失的词语那么就需要根据上下文来进行预测。Bidirectional RNNs是一个相对较简单的RNNs,是由两个RNNs上下叠加在一起组成的。输出由这两个RNNs的隐藏层的状态决定的


    Bidirectional RNNs
    5.1输入层
    5.1.1设置模型输入,为模型的输入创建占位符
    def model_inputs():
        '''Create palceholders for inputs to the model'''
        
        input_data = tf.placeholder(tf.int32, [None, None], name='input')
        targets = tf.placeholder(tf.int32, [None, None], name='targets')
        lr = tf.placeholder(tf.float32, name='learning_rate')
        keep_prob = tf.placeholder(tf.float32, name='keep_prob')
        summary_length = tf.placeholder(tf.int32, (None,), name='summary_length')
        max_summary_length = tf.reduce_max(summary_length, name='max_dec_len')
        text_length = tf.placeholder(tf.int32, (None,), name='text_length')
    
        return input_data, targets, lr, keep_prob, summary_length, max_summary_length, text_length
    
    5.2将<GO>插入,便于批处理和训练
    def process_encoding_input(target_data, vocab_to_int, batch_size):
        '''Remove the last word id from each batch and concat the <GO> to the begining of each batch
          从每个批次中删除最后一个单词id,并将<GO>连接到每个批次的开头'''
        
        ending = tf.strided_slice(target_data, [0, 0], [batch_size, -1], [1, 1])
        dec_input = tf.concat([tf.fill([batch_size, 1], vocab_to_int['<GO>']), ending], 1)
    
        return dec_input
    
    5.2编码层
    5.2.1创建编码层
    def encoding_layer(rnn_size, sequence_length, num_layers, rnn_inputs, keep_prob):
        '''Create the encoding layer双向RNN,就是由两个RNN网络组织成的'''
        
        for layer in range(num_layers):
            with tf.variable_scope('encoder_{}'.format(layer)):
                cell_fw = tf.contrib.rnn.LSTMCell(rnn_size,
                                                  initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
                cell_fw = tf.contrib.rnn.DropoutWrapper(cell_fw, 
                                                        input_keep_prob = keep_prob)
    
                cell_bw = tf.contrib.rnn.LSTMCell(rnn_size,
                                                  initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
                cell_bw = tf.contrib.rnn.DropoutWrapper(cell_bw, 
                                                        input_keep_prob = keep_prob)
    
                enc_output, enc_state = tf.nn.bidirectional_dynamic_rnn(cell_fw, 
                                                                        cell_bw, 
                                                                        rnn_inputs,
                                                                        sequence_length,
                                                                        dtype=tf.float32)
        # Join outputs since we are using a bidirectional RNN
        enc_output = tf.concat(enc_output,2)
        
        return enc_output, enc_state
    
    5.2.2训练解码层
    def training_decoding_layer(dec_embed_input, summary_length, dec_cell, initial_state, output_layer, 
                                vocab_size, max_summary_length):
        '''Create the training logits
          logits: 未归一化的概率, 一般也就是 softmax层的输入。所以logits和lables的shape一样'''
        
        training_helper = tf.contrib.seq2seq.TrainingHelper(inputs=dec_embed_input,
                                                            sequence_length=summary_length,
                                                            time_major=False)
    
        training_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell,
                                                           training_helper,
                                                           initial_state,
                                                           output_layer) 
    
        training_logits, _ = tf.contrib.seq2seq.dynamic_decode(training_decoder,
                                                               output_time_major=False,
                                                               impute_finished=True,
                                                               maximum_iterations=max_summary_length)
        return training_logits
    
    5.2.3预测解码层
    def inference_decoding_layer(embeddings, start_token, end_token, dec_cell, initial_state, output_layer,
                                 max_summary_length, batch_size):
        '''Create the inference logits'''
        
        start_tokens = tf.tile(tf.constant([start_token], dtype=tf.int32), [batch_size], name='start_tokens')
        
        inference_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(embeddings,
                                                                    start_tokens,
                                                                    end_token)
                    
        inference_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell,
                                                            inference_helper,
                                                            initial_state,
                                                            output_layer)
                    
        inference_logits, _ = tf.contrib.seq2seq.dynamic_decode(inference_decoder,
                                                                output_time_major=False,
                                                                impute_finished=True,
                                                                maximum_iterations=max_summary_length)
        
        return inference_logits
    
    5.3解码层
    def decoding_layer(dec_embed_input, embeddings, enc_output, enc_state, vocab_size, text_length, summary_length, 
                       max_summary_length, rnn_size, vocab_to_int, keep_prob, batch_size, num_layers):
        '''Create the decoding cell and attention for the training and inference decoding layers
          为训练和预测解码层创建解码单元和注意力机制'''
        
        for layer in range(num_layers):
            with tf.variable_scope('decoder_{}'.format(layer)):
                lstm = tf.contrib.rnn.LSTMCell(rnn_size,
                                               initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
                dec_cell = tf.contrib.rnn.DropoutWrapper(lstm, 
                                                         input_keep_prob = keep_prob)
        
        output_layer = Dense(vocab_size,
                             kernel_initializer = tf.truncated_normal_initializer(mean = 0.0, stddev=0.1))
        
        attn_mech = tf.contrib.seq2seq.BahdanauAttention(rnn_size,
                                                      enc_output,
                                                      text_length,
                                                      normalize=False,
                                                      name='BahdanauAttention')
    
        dec_cell = tf.contrib.seq2seq.DynamicAttentionWrapper(dec_cell,
                                                              attn_mech,
                                                              rnn_size)
                
        initial_state = tf.contrib.seq2seq.DynamicAttentionWrapperState(enc_state[0],
                                                                        _zero_state_tensors(rnn_size, 
                                                                                            batch_size, 
                                                                                            tf.float32)) 
        with tf.variable_scope("decode"):
            training_logits = training_decoding_layer(dec_embed_input, 
                                                      summary_length, 
                                                      dec_cell, 
                                                      initial_state,
                                                      output_layer,
                                                      vocab_size, 
                                                      max_summary_length)
        with tf.variable_scope("decode", reuse=True):
            inference_logits = inference_decoding_layer(embeddings,  
                                                        vocab_to_int['<GO>'], 
                                                        vocab_to_int['<EOS>'],
                                                        dec_cell, 
                                                        initial_state, 
                                                        output_layer,
                                                        max_summary_length,
                                                        batch_size)
    
        return training_logits, inference_logits
    
    5.4组合Seq2Seq模型
    def seq2seq_model(input_data, target_data, keep_prob, text_length, summary_length, max_summary_length, 
                      vocab_size, rnn_size, num_layers, vocab_to_int, batch_size):
        '''Use the previous functions to create the training and inference logits
          使用之前的函数创建训练和预测logits'''
        
        # Use Numberbatch's embeddings and the newly created ones as our embeddings
        embeddings = word_embedding_matrix  # 矩阵
        
        enc_embed_input = tf.nn.embedding_lookup(embeddings, input_data)
        enc_output, enc_state = encoding_layer(rnn_size, text_length, num_layers, enc_embed_input, keep_prob)
        
        dec_input = process_encoding_input(target_data, vocab_to_int, batch_size)
        dec_embed_input = tf.nn.embedding_lookup(embeddings, dec_input)
        
        training_logits, inference_logits  = decoding_layer(dec_embed_input, 
                                                            embeddings,
                                                            enc_output,
                                                            enc_state, 
                                                            vocab_size, 
                                                            text_length, 
                                                            summary_length, 
                                                            max_summary_length,
                                                            rnn_size, 
                                                            vocab_to_int, 
                                                            keep_prob, 
                                                            batch_size,
                                                            num_layers)
        
        return training_logits, inference_logits
    
    5.5批处理文本句子
    5.5.1填充句子,让句子的长度达到一致
    def pad_sentence_batch(sentence_batch):
        """Pad sentences with <PAD> so that each sentence of a batch has the same length
          使用<PAD>填充句子,以便批处理中的每个句子具有相同的长度"""
        max_sentence = max([len(sentence) for sentence in sentence_batch])
        return [sentence + [vocab_to_int['<PAD>']] * (max_sentence - len(sentence)) for sentence in sentence_batch]
    
    5.5.2批量处理摘要,文本和句子的长度
    def get_batches(summaries, texts, batch_size):
        """Batch summaries, texts, and the lengths of their sentences together"""
        for batch_i in range(0, len(texts)//batch_size):
            start_i = batch_i * batch_size
            summaries_batch = summaries[start_i:start_i + batch_size]
            texts_batch = texts[start_i:start_i + batch_size]
            pad_summaries_batch = np.array(pad_sentence_batch(summaries_batch))
            pad_texts_batch = np.array(pad_sentence_batch(texts_batch))
            
            # Need the lengths for the _lengths parameters
            pad_summaries_lengths = []
            for summary in pad_summaries_batch:
                pad_summaries_lengths.append(len(summary))
            
            pad_texts_lengths = []
            for text in pad_texts_batch:
                pad_texts_lengths.append(len(text))
            
            yield pad_summaries_batch, pad_texts_batch, pad_summaries_lengths, pad_texts_lengths
    
    5.5设置超参数
    # Set the Hyperparameters
    epochs = 100
    batch_size = 64
    rnn_size = 256
    num_layers = 2
    learning_rate = 0.005
    keep_probability = 0.75
    
    5.6在TensorFlow构建模型需要的图来进行计算
    # Build the graph
    train_graph = tf.Graph()
    # Set the graph to default to ensure that it is ready for training将图表设置为默认,以确保它已准备好进行训练
    with train_graph.as_default():
        
        # Load the model inputs    
        input_data, targets, lr, keep_prob, summary_length, max_summary_length, text_length = model_inputs()
    
        # Create the training and inference logits
        training_logits, inference_logits = seq2seq_model(tf.reverse(input_data, [-1]),
                                                          targets, 
                                                          keep_prob,   
                                                          text_length,
                                                          summary_length,
                                                          max_summary_length,
                                                          len(vocab_to_int)+1,
                                                          rnn_size, 
                                                          num_layers, 
                                                          vocab_to_int,
                                                          batch_size)
        
        # Create tensors for the training logits and inference logits
        training_logits = tf.identity(training_logits.rnn_output, 'logits')
        inference_logits = tf.identity(inference_logits.sample_id, name='predictions')
        
        # Create the weights for sequence_loss
        masks = tf.sequence_mask(summary_length, max_summary_length, dtype=tf.float32, name='masks')
    
        with tf.name_scope("optimization"):
            # Loss function
            cost = tf.contrib.seq2seq.sequence_loss(
                training_logits,
                targets,
                masks)
    
            # Optimizer
            optimizer = tf.train.AdamOptimizer(learning_rate)
    
            # Gradient Clipping
            gradients = optimizer.compute_gradients(cost)
            capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None]
            train_op = optimizer.apply_gradients(capped_gradients)
    print("Graph is built.")
    

    6.训练网络

    6.1训练数据子集
    # Subset the data for training
    start = 200000
    end = start + 50000
    sorted_summaries_short = sorted_summaries[start:end]
    sorted_texts_short = sorted_texts[start:end]
    print("The shortest text length:", len(sorted_texts_short[0]))  # The shortest text length: 25
    print("The longest text length:",len(sorted_texts_short[-1]))  # The longest text length: 31
    
    6.2训练模型
    # Train the Model
    learning_rate_decay = 0.95
    min_learning_rate = 0.0005
    display_step = 20 # Check training loss after every 20 batches
    stop_early = 0 
    stop = 3 # If the update loss does not decrease in 3 consecutive update checks, stop training
    per_epoch = 3 # Make 3 update checks per epoch
    update_check = (len(sorted_texts_short)//batch_size//per_epoch)-1
    
    update_loss = 0 
    batch_loss = 0
    summary_update_loss = [] # Record the update losses for saving improvements in the model
    
    checkpoint = "best_model.ckpt" 
    with tf.Session(graph=train_graph) as sess:
        sess.run(tf.global_variables_initializer())
        
        # If we want to continue training a previous session
        #loader = tf.train.import_meta_graph("./" + checkpoint + '.meta')
        #loader.restore(sess, checkpoint)
        
        for epoch_i in range(1, epochs+1):
            update_loss = 0
            batch_loss = 0
            for batch_i, (summaries_batch, texts_batch, summaries_lengths, texts_lengths) in enumerate(
                    get_batches(sorted_summaries_short, sorted_texts_short, batch_size)):
                start_time = time.time()
                _, loss = sess.run(
                    [train_op, cost],
                    {input_data: texts_batch,
                     targets: summaries_batch,
                     lr: learning_rate,
                     summary_length: summaries_lengths,
                     text_length: texts_lengths,
                     keep_prob: keep_probability})
    
                batch_loss += loss
                update_loss += loss
                end_time = time.time()
                batch_time = end_time - start_time
    
                if batch_i % display_step == 0 and batch_i > 0:
                    print('Epoch {:>3}/{} Batch {:>4}/{} - Loss: {:>6.3f}, Seconds: {:>4.2f}'
                          .format(epoch_i,
                                  epochs, 
                                  batch_i, 
                                  len(sorted_texts_short) // batch_size, 
                                  batch_loss / display_step, 
                                  batch_time*display_step))
                    batch_loss = 0
    
                if batch_i % update_check == 0 and batch_i > 0:
                    print("Average loss for this update:", round(update_loss/update_check,3))
                    summary_update_loss.append(update_loss)
                    
                    # If the update loss is at a new minimum, save the model
                    if update_loss <= min(summary_update_loss):
                        print('New Record!') 
                        stop_early = 0
                        saver = tf.train.Saver() 
                        saver.save(sess, checkpoint)
    
                    else:
                        print("No Improvement.")
                        stop_early += 1
                        if stop_early == stop:
                            break
                    update_loss = 0
                
                        
            # Reduce learning rate, but not below its minimum value
            learning_rate *= learning_rate_decay
            if learning_rate < min_learning_rate:
                learning_rate = min_learning_rate
            
            if stop_early == stop:
                print("Stopping Training.")
                break
    

    7.测试模型

    7.1为模型准备文本语料
    def text_to_seq(text):
        '''Prepare the text for the model'''
        text = clean_text(text)
        return [vocab_to_int.get(word, vocab_to_int['<UNK>']) for word in text.split()]
    
    7.2输入语料,进行测试
    # Create your own review or use one from the dataset,创建自己的评论或使用数据集中的评论
    #input_sentence = "I have never eaten an apple before, but this red one was nice. \
                      #I think that I will try a green apple next time."
    #text = text_to_seq(input_sentence)
    random = np.random.randint(0,len(clean_texts))
    input_sentence = clean_texts[random]
    text = text_to_seq(clean_texts[random])
    
    checkpoint = "./best_model.ckpt"
    
    loaded_graph = tf.Graph()
    with tf.Session(graph=loaded_graph) as sess:
        # Load saved model
        loader = tf.train.import_meta_graph(checkpoint + '.meta')
        loader.restore(sess, checkpoint)
    
        input_data = loaded_graph.get_tensor_by_name('input:0')
        logits = loaded_graph.get_tensor_by_name('predictions:0')
        text_length = loaded_graph.get_tensor_by_name('text_length:0')
        summary_length = loaded_graph.get_tensor_by_name('summary_length:0')
        keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')
        
        #Multiply by batch_size to match the model's input parameters
        answer_logits = sess.run(logits, {input_data: [text]*batch_size, 
                                          summary_length: [np.random.randint(5,8)], 
                                          text_length: [len(text)]*batch_size,
                                          keep_prob: 1.0})[0] 
    
    # Remove the padding from the tweet
    pad = vocab_to_int["<PAD>"] 
    
    print('Original Text:', input_sentence)
    
    print('\nText')
    print('  Word Ids:    {}'.format([i for i in text]))
    print('  Input Words: {}'.format(" ".join([int_to_vocab[i] for i in text])))
    
    print('\nSummary')
    print('  Word Ids:       {}'.format([i for i in answer_logits if i != pad]))
    print('  Response Words: {}'.format(" ".join([int_to_vocab[i] for i in answer_logits if i != pad])))
    

    结果为:
    INFO:tensorflow:Restoring parameters from ./best_model.ckpt
    Original Text: love individual oatmeal cups found years ago sam quit selling sound big lots quit selling found target expensive buy individually trilled get entire case time go anywhere need water microwave spoon know quaker flavor packets

    Text
    Word Ids: [70595, 18808, 668, 45565, 51927, 51759, 32488, 13510, 32036, 59599, 11693, 444, 23335, 32036, 59599, 51927, 67316, 726, 24842, 50494, 48492, 1062, 44749, 38443, 42344, 67973, 14168, 7759, 5347, 29528, 58763, 18927, 17701, 20232, 47328]
    Input Words: love individual oatmeal cups found years ago sam quit selling sound big lots quit selling found target expensive buy individually trilled get entire case time go anywhere need water microwave spoon know quaker flavor packets

    Summary
    Word Ids: [70595, 28738]
    Response Words: love it

    Examples of reviews and summaries:
    • Review(1): The coffee tasted great and was at such a good price! I highly recommend this to everyone!
    • Summary(1): great coffee
    • Review(2): This is the worst cheese that I have ever bought! I will never buy it again and I hope you won't either!
    • Summary(2): omg gross gross
    • Review(3): love individual oatmeal cups found years ago sam quit selling sound big lots quit selling found target expensive buy individually trilled get entire case time go anywhere need water microwave spoon know quaker flavor packets
    • Summary(3): love it

    相关文章

      网友评论

        本文标题:02-seq2seq原理与实践

        本文链接:https://www.haomeiwen.com/subject/ovldyctx.html