    Transformer来自Google团队17年的文章Attention is all you need
    创新点:Transformer只采用了attention机制。不像传统的encoder-decoder的模型需要结合RNN或者CNN来使用。创新之处在于使用了scaled Dot-product AttentionMulti-Head Attention
    我觉得将Transformer解释的最容易懂的的还是The illustrated transformer,看完这篇博客就瞬间懂了很多东西。然后哈佛大学也给出了详细的pytorch版本的代码,有jupyter notebook详细的解释,看完也会有别样的收获哈。


    Attentin is all you need文中给出的架构图如下,能够比较详细的看到了一个encoder和decoder的细节。在这里,可以看到一个encoder里面有2个子层,然后一个decoder中含有3个子层,后面会详细说到这个结构。

    然后我们从全局的视野来看,encoder和decoder部分都包含了6个encoder和decoder。进入到第一个encoder的inputs结合embedding和positional embedding。通过了6个encoder之后,输出到了decoder部分的每一个decoder中。


    在这里说明一下,在tensorflow的代码里面,维度变成了(batch_size, max_len, vector_dimension)。其实就是增加了一个batch_size也就是输入的句子数,然后max_len就是控制的一句话的长度,最后是字向量的维度大小。

    可以从上图中看出,在向量进入self-attention层之前,是将词的embedding和位置的encoding做了一个相加的处理。因为模型里面没有用到RNN和CNN的东西,所以该论文采用了位置的编码来解决序列信息获取的问题。这里的positional encoding需要说明一下:




    • make_dic.py
    from __future__ import print_function
    from params import Params as pm
    import codecs
    import os
    from collections import Counter
    def make_dic(path, fname):
        Constructs vocabulary as a dictionary
            path: [String], Input file path
            fname: [String], Output file name
        Build vocabulary line by line to dictionary/ path
        text = codecs.open(path, 'r', 'utf-8').read()  #codes.open()得到的是一个对象,然后read()之后就变成了字符串了
        words = text.split()
        wordCount = Counter(words)
        if not os.path.exists('dictionary'):
        with codecs.open('dictionary/{}'.format(fname), 'w', 'utf-8') as f:
            for word, count in wordCount.most_common(len(wordCount)):
                f.write(u"{}\t{}\n".format(word, count))
    if __name__ == '__main__':
        make_dic(pm.src_train, "en.vocab.tsv")
        make_dic(pm.tgt_train, "de.vocab.tsv")
        print("MSG : Constructing Dictionary Finished!")


    • data_load.py
    # -*- coding: utf-8 -*-
    from __future__ import print_function
    from params import Params as pm
    import codecs
    import sys
    import numpy as np
    import tensorflow as tf
    def load_vocab(vocab):  #  'en.vocab.tsv'  'de.vocab.tsv'
        Load word token from encoding dictionary
            vocab: [String], vocabulary files
        vocab = [line.split()[0] for line in codecs.open('dictionary/{}'.format(vocab), 'r', 'utf-8').read().splitlines() if int(line.split()[1]) >= pm.word_limit_size]
        word2idx_dic = {word: idx for idx, word in enumerate(vocab)}
        idx2word_dic = {idx: word for idx, word in enumerate(vocab)}
        return word2idx_dic, idx2word_dic


    接着看generate_dataset这个函数,它的作用就是产生数据集,也就是将句子表示成了np array的形式。传入的是source句子和target句子。这里的集外词的id给为1。函数的前半部分做的是一个index化,后半部分做的是padding处理,我们可以知道<pad>符号在Vocabulary里面的位置为0,所以当句子的长度小于10的时候,我们就在不足的位置给加填上0,保证每个句子的长度都是相同的。

    def generate_dataset(source_sents, target_sents):
        Parse source sentences and target sentences from corpus with some formats
        Parse word token of each sentences
            source_sents: [List], encoding sentences from src-train file
            target_sents: [List], decoding sentences from tgt-train file
        Padding for word token sentence list
        en2idx, idx2en = load_vocab('en.vocab.tsv')
        de2idx, idx2de = load_vocab('de.vocab.tsv')
        in_list, out_list, Sources, Targets = [], [], [], []
        for source_sent, target_sent in zip(source_sents, target_sents):
            # 1 means <UNK>
            inpt = [en2idx.get(word, 1) for word in (source_sent + u" <EOS>").split()]
            outpt = [de2idx.get(word, 1) for word in (target_sent + u" <EOS>").split()]
            if max(len(inpt), len(outpt)) <= pm.maxlen:
                # sentence token list
                # sentence list
        X = np.zeros([len(in_list), pm.maxlen], np.int32)
        Y = np.zeros([len(out_list), pm.maxlen], np.int32)
        for i, (x, y) in enumerate(zip(in_list, out_list)):
            X[i] = np.lib.pad(x, (0, pm.maxlen - len(x)), 'constant', constant_values = (0, 0))
            Y[i] = np.lib.pad(y, (0, pm.maxlen - len(y)), 'constant', constant_values = (0, 0))
        return X, Y, Sources, Targets


    def load_data(l_data):
        Read train-data from input datasets
            l_data: [String], the file name of datasets which used to generate tokens
        if l_data == 'train':
            en_sents = [line for line in codecs.open(pm.src_train, 'r', 'utf-8').read().split('\n') if line]
            de_sents = [line for line in codecs.open(pm.tgt_train, 'r', 'utf-8').read().split('\n') if line]
            if len(en_sents) == len(de_sents):
                inpt, outpt, Sources, Targets = generate_dataset(en_sents, de_sents)
                print("MSG : Source length is different from Target length.")
            return inpt, outpt
        elif l_data == 'test':
            en_sents = [line for line in codecs.open(pm.src_test, 'r', 'utf-8').read().split('\n') if line]
            de_sents = [line for line in codecs.open(pm.tgt_test, 'r', 'utf-8').read().split('\n') if line]
            if len(en_sents) == len(de_sents):
                inpt, outpt, Sources, Targets = generate_dataset(en_sents, de_sents)
                print("MSG : Source length is different from Target length.")
            return inpt, Sources, Targets
            print("MSG : Error when load data.")

    这里面的inpt, outpt维度(句子的总数, maxlen),那么batch_num就是我们总共的要训练的批量数。
    输出x,y的维度(N, T)==(batch_size, maxlen)

    def get_batch_data():
        A batch dataset generator
        inpt, outpt = load_data("train")
        batch_num   = len(inpt) // pm.batch_size
        inpt  = tf.convert_to_tensor(inpt, tf.int32)
        outpt = tf.convert_to_tensor(outpt, tf.int32)
        # parsing data into queue used for pipeline operations as a generator. 
        input_queues = tf.train.slice_input_producer([inpt, outpt])
        # multi-thread processing using batch
        x, y = tf.train.shuffle_batch(input_queues,
                                    num_threads = 8,
                                    batch_size = pm.batch_size,
                                    capacity = pm.batch_size * 64,
                                    min_after_dequeue = pm.batch_size * 32,
                                    allow_smaller_final_batch = False)
        return x, y, batch_num


    # -*- coding: utf-8 -*-
    from __future__ import print_function
    import tensorflow as tf
    import numpy as np
    import math
    def normalize(inputs, epsilon = 1e-8, scope = "ln", reuse = None):
        Implement layer normalization
            inputs: [Tensor], A tensor with two or more dimensions, where the first one is "batch_size"
            epsilon: [Float], A small number for preventing ZeroDivision Error
            scope: [String], Optional scope for "variable_scope"
            reuse: [Boolean], If to reuse the weights of a previous layer by the same name
            A tensor with the same shape and data type as "inputs"
        with tf.variable_scope(scope, reuse = reuse):
            inputs_shape = inputs.get_shape()  # a.get_shape().as_list() --->a维度是(2,3),那么这个返回就是 [2, 3]
            params_shape = inputs_shape[-1 :]  # params_shape就是最后的一个维度了
            # tf.nn.moments 计算返回的 mean 和 variance 作为 tf.nn.batch_normalization 参数调用。
            mean, variance = tf.nn.moments(inputs, [-1], keep_dims = True)
            beta  = tf.Variable(tf.zeros(params_shape))
            gamma = tf.Variable(tf.ones(params_shape))
            normalized = (inputs - mean) / ((variance + epsilon) ** (.5))
            outputs = gamma * normalized + beta
        return outputs

    不论是self-attention还是feed-forward neural network,都会做一个正规化的操作。这里参数epsilon就是为了防止正规化的时候,分母为0,然后scope为命名空间。看代码就知道,这个和batch normalization的公式是一样的。目的肯定是使得训练过程,反向传播的时候的震荡幅度更小。好多复杂的网络在输出之后都会加上normalization。

    def positional_encoding(inputs,
                            zero_pad = True,
                            scale = True,
                            scope = "positional_embedding",
                            reuse = None):
        Positional_Encoding for a given tensor.
            inputs: [Tensor], A tensor contains the ids to be search from the lookup table, shape = [batch_size, 1 + len(inpt)]
            vocab_size: [Int], Vocabulary size
            num_units: [Int], Hidden size of embedding
            zero_pad: [Boolean], If True, all the values of the first row(id = 0) should be constant zero
            scale: [Boolean], If True, the output will be multiplied by sqrt num_units(check details from paper)
            scope: [String], Optional scope for 'variable_scope'
            reuse: [Boolean], If to reuse the weights of a previous layer by the same name
                A 'Tensor' with one more rank than inputs's, with the dimensionality should be 'num_units'
            inputs (batch_size, 1+len(inputs)) 那么N就是batch_size, 然后T就是maxlen,大小为10
            num_units 就是隐层单元的个数,维度的大小
        N, T = inputs.get_shape().as_list()
        with tf.variable_scope(scope, reuse = reuse):
            position_ind = tf.tile(tf.expand_dims(tf.range(T), 0), [N, 1])
            # First part of the PE function: sin and cos argument
            position_enc = np.array([
                [pos / np.power(10000, 2.*i/num_units) for i in range(num_units)]
                for pos in range(T)])
            # Second part, apply the cosine to even columns and sin to odds.
            position_enc[:, 0::2] = np.sin(position_enc[:, 0::2])  # dim 2i
            position_enc[:, 1::2] = np.cos(position_enc[:, 1::2])  # dim 2i+1
            # Convert to a tensor
            lookup_table = tf.convert_to_tensor(position_enc)
            if zero_pad:
                lookup_table = tf.concat((tf.zeros(shape=[1, num_units]),
                              lookup_table[1:, :]), 0)
            outputs = tf.nn.embedding_lookup(lookup_table, position_ind)
            if scale:
                outputs = outputs * num_units**0.5
        return tf.cast(outputs, tf.float32)

    这一部分的代码就是架构图里面的positional embedding的实现了,由于没有采用序列模型,那么这里讲位置进行了嵌入,就捕捉到了位置上面的信息。

    def embedding(inputs,
                zero_pad = True,
                scale = True,
                scope = "embedding",
                reuse = None):
        Embed a given tensor.
            inputs: [Tensor], A tensor contains the ids to be search from the lookup table
            vocab_size: [Int], Vocabulary size
            num_units: [Int], Hidden size of embedding
            zero_pad: [Boolean], If True, all the values of the first row(id = 0) should be constant zero
            scale: [Boolean], If True, the output will be multiplied by sqrt num_units(check details from paper)
            scope: [String], Optional scope for 'variable_scope'
            reuse: [Boolean], If to reuse the weights of a previous layer by the same name
                A 'Tensor' with one more rank than inputs's, with the dimensionality should be 'num_units'
            inputs传进来就(batch_size, 10)
            lookup_table维度(vocab_size, 512),进行了随机的初始化
          # shape = [vocabsize, 8]
        with tf.variable_scope(scope, reuse = reuse):
            lookup_table = tf.get_variable('lookup_table',
                                            dtype = tf.float32,
                                            shape = [vocab_size, num_units],
                                            initializer = tf.contrib.layers.xavier_initializer())
            if zero_pad:
                ''' tf.zeros 维度(1, 512)
                    lookup_table[1:, :]的目的是抛开了<PAD>这玩意儿,赋值为0,然后进行了合并
                    现在look  _table维度还是(vocab_size, 512  ) 
                lookup_table = tf.concat((tf.zeros(shape = [1, num_units]),  lookup_table[1:, :]), 0)
            # outputs 维度就是 (batch_size, 10, 512) ==[N ,T, S]
            outputs = tf.nn.embedding_lookup(lookup_table, inputs)
            if scale:
                # embedding 那一步
                outputs = outputs * math.sqrt(num_units)
        return outputs

    输入维度(batch_size, maxlen)==[N, T]
    输出维度(batch_size, maxlen, S)==[N, T, S]

    下面是multi-head attention,为该代码的核心部分。注释里面写清楚了维度的一个变化情况。
    最后输出维度[N, T_q, S]。

    def multihead_attention(queries,
                            num_units = None,
                            num_heads = 8,
                            dropout_rate = 0,
                            is_training = True,
                            causality = False,
                            scope = "multihead_attention",
                            reuse = None):
        Implement multihead attention
            queries: [Tensor], A 3-dimensions tensor with shape of [N, T_q, S_q]
            keys: [Tensor], A 3-dimensions tensor with shape of [N, T_k, S_k]
            num_units: [Int], Attention size
            num_heads: [Int], Number of heads
            dropout_rate: [Float], A ratio of dropout
            is_training: [Boolean], If true, controller of mechanism for dropout
            causality: [Boolean], If true, units that reference the future are masked
            scope: [String], Optional scope for "variable_scope"
            reuse: [Boolean], If to reuse the weights of a previous layer by the same name
            A 3-dimensions tensor with shape of [N, T_q, S]
        """ queries = self.enc  (batch_size, 10 ,512)==[N, T_q, S] keys也是self.enc  
            num_units =512, num_heads =10
        with tf.variable_scope(scope, reuse = reuse):
            if num_units is None:
                # length of sentence
                num_units = queries.get_shape().as_list()[-1]
            """ Linear layers in Figure 2(right) 就是Q、K、V进入scaled Dot-product Attention前的Linear的操作
            # 首先是进行了全连接的线性变换
            shape = [N, T_q, S]  (batch_size, 10 ,512), S可以理解为512"""
            Q = tf.layers.dense(queries, num_units, activation = tf.nn.relu)
            # shape = [N, T_k, S]
            K = tf.layers.dense(keys, num_units, activation = tf.nn.relu)
            # shape = [N, T_k, S]
            V = tf.layers.dense(keys, num_units, activation = tf.nn.relu)
                shape (batch_size*8, 10, 512/8=64)
            # Split and concat
            # shape = [N*h, T_q, S/h]
            Q_ = tf.concat(tf.split(Q, num_heads, axis = 2), axis = 0)
            # shape = [N*h, T_k, S/h]
            K_ = tf.concat(tf.split(K, num_heads, axis = 2), axis = 0)
            # shape = [N*h, T_k, S/h]
            V_ = tf.concat(tf.split(V, num_heads, axis = 2), axis = 0)
            # [N, T_q, S] * [N*h, T_k, S/h] 这一步的张量乘法是怎么做的?
            # shape = [N*h, T_q, T_k]   Q
            outputs = tf.matmul(Q_, tf.transpose(K_, [0, 2, 1]))
            # Scale
            outputs = outputs / (K_.get_shape().as_list()[-1] ** 0.5)
            # Masking
            # shape = [N, T_k]
            # 这里的tf.reduce_sum进行了降维,由三维降低到了2维度,然后是取绝对值,转成0-1之间的值
            '''[N, T_k, 512]------> [N, T_k] -----》[N*h, T_k] -----》[N*h, T_q, T_k] '''
            key_masks = tf.sign(tf.abs(tf.reduce_sum(keys, axis = -1)))
            # shape = [N*h, T_k]
            key_masks = tf.tile(key_masks, [num_heads, 1])
            # shape = [N*h, T_q, T_k]    tf.expand_dims就是扩维度
            key_masks = tf.tile(tf.expand_dims(key_masks, 1), [1, tf.shape(queries)[1], 1])
            # If key_masks == 0 outputs = [1]*length(outputs)
            paddings = tf.ones_like(outputs) * (-math.pow(2, 32) + 1)
            # shape = [N*h, T_q, T_k]
            outputs = tf.where(tf.equal(key_masks, 0), paddings, outputs)
            if causality: #如果为true的话,那么就是将这个东西未来的units给屏蔽了
                # reduce dims : shape = [T_q, T_k]
                diag_vals = tf.ones_like(outputs[0, :, :])
                # shape = [T_q, T_k]
                # use triangular matrix to ignore the affect from future words
                # like : [[1,0,0]
                #         [1,2,0]
                #         [1,2,3]]
                tril = tf.contrib.linalg.LinearOperatorTriL(diag_vals).to_dense()
                # shape = [N*h, T_q, T_k]
                masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(outputs)[0], 1, 1])
                paddings = tf.ones_like(masks) * (-math.pow(2, 32) + 1)
                # shape = [N*h, T_q, T_k]
                outputs = tf.where(tf.equal(masks, 0), paddings, outputs)
            # Output Activation
            outputs = tf.nn.softmax(outputs)
            # Query Masking
            # shape = [N, T_q]
            query_masks = tf.sign(tf.abs(tf.reduce_sum(queries, axis = -1)))
            # shape = [N*h, T_q]
            query_masks = tf.tile(query_masks, [num_heads, 1])
            # shape = [N*h, T_q, T_k]
            query_masks = tf.tile(tf.expand_dims(query_masks, -1), [1, 1, tf.shape(keys)[1]])
            outputs *= query_masks 
            # Dropouts
            outputs = tf.layers.dropout(outputs, rate = dropout_rate, training = tf.convert_to_tensor(is_training))
            # Weighted sum
            # shape = [N*h, T_q, S/h]
            outputs = tf.matmul(outputs, V_)
            # Restore shape
            # shape = [N, T_q, S]
            outputs = tf.concat(tf.split(outputs, num_heads, axis = 0), axis = 2)
            # Residual connection
            outputs += queries
            # Normalize
            # shape = [N, T_q, S]
            outputs = normalize(outputs)
        return outputs

    两层卷积之间加了relu非线性操作。之后是residual操作加上inputs残差,然后是normalize。最后输出的维度还是[N, T_q, S]。

    def feedforward(inputs,
                    num_units = [2048, 512],
                    scope = "multihead_attention",
                    reuse = None):
        Position-wise feed forward neural network
            inputs: [Tensor], A 3d tensor with shape [N, T, S]
            num_units: [Int], A list of convolution parameters
            scope: [String], Optional scope for "variable_scope"
            reuse: [Boolean], If to reuse the weights of a previous layer by the same name 
            A tensor converted by feedforward layers from inputs
        with tf.variable_scope(scope, reuse = reuse):
            # params = {"inputs": inputs, "filters": num_units[0], "kernel_size": 1, \
                      # "activation": tf.nn.relu, "use_bias": True}
            # outputs = tf.layers.conv1d(inputs = inputs, filters = num_units[0], kernel_size = 1, activation = tf.nn.relu, use_bias = True)
            # outputs = tf.layers.conv1d(**params)
            params = {"inputs": inputs, "num_outputs": num_units[0], \
                      "activation_fn": tf.nn.relu}
            outputs = tf.contrib.layers.fully_connected(**params)
            # params = {"inputs": inputs, "filters": num_units[1], "kernel_size": 1, \
            #         "activation": None, "use_bias": True}
            params = {"inputs": inputs, "num_outputs": num_units[1], \
                      "activation_fn": None}
            # outputs = tf.layers.conv1d(inputs = inputs, filters = num_units[1], kernel_size = 1, activation = None, use_bias = True)
            # outputs = tf.layers.conv1d(**params)
            outputs = tf.contrib.layers.fully_connected(**params)
            # residual connection
            outputs += inputs
            outputs = normalize(outputs)
        return outputs


    def label_smoothing(inputs, epsilon = 0.1):
        Implement label smoothing
            inputs: [Tensor], A 3d tensor with shape of [N, T, V]
            epsilon: [Float], Smoothing rate
            A tensor after smoothing
        ''' inputs的维度应该是(batch_size, sentense_length, vector dimension)
            N就是batch_size, T就是句子的长度,V就是向量的维度大小
        K = inputs.get_shape().as_list()[-1]
        return ((1 - epsilon) * inputs) + (epsilon / K)


    • train.py
      这里面的self.decoder_input采取了一个操作就是将每个句子加了一个初始化为2的id,然后除去了最后的一个句子结束符。然后它的维度还是[N ,T]
    # -*- coding: utf-8 -*-
    from __future__ import print_function
    import tensorflow as tf
    from params import Params as pm
    from data_loader import get_batch_data, load_vocab
    from modules import *
    from tqdm import tqdm
    import os
    class Graph():
        # 直接就是一个init初始化一下
        def __init__(self, is_training = True):
            self.graph = tf.Graph()
            with self.graph.as_default():
                if is_training:
                    self.inpt, self.outpt, self.batch_num = get_batch_data()
                    '''inpt(None, maxlen)  outpt(None, maxlen) maxlen=10'''
                    self.inpt = tf.placeholder(tf.int32, shape = (None, pm.maxlen))
                    self.outpt = tf.placeholder(tf.int32, shape = (None, pm.maxlen))
                # start with 2(<STR>) and without 3(<EOS>)
                self.decoder_input = tf.concat((tf.ones_like(self.outpt[:, :1])*2, self.outpt[:, :-1]), -1)
                #  直接就拿到了en和de的Vocabulary,en的大小是1644,de的大小是1588
                en2idx, idx2en = load_vocab('en.vocab.tsv')
                de2idx, idx2de = load_vocab('de.vocab.tsv')
                # Encoder
                with tf.variable_scope("encoder"):
                    ''' self.inpt维度是(batch_size, maxlen)
                        self.enc 维度是(batch_size, maxlen, 512)
                    self.enc = embedding(self.inpt,
                                        vocab_size = len(en2idx),
                                        num_units  = pm.hidden_units,
                                        scale = True,
                                        scope = "enc_embed")
                    # Position Encoding(use range from 0 to len(inpt) to represent position dim of each words)
                    # tf.tile(tf.expand_dims(tf.range(tf.shape(self.inpt)[1]), 0), [tf.shape(self.inpt)[0], 1]),
                    self.enc += positional_encoding(self.inpt,
                                        vocab_size = pm.maxlen,
                                        num_units  = pm.hidden_units,
                                        zero_pad   = False,
                                        scale = False,
                                        scope = "enc_pe")
                    # Dropout
                    self.enc = tf.layers.dropout(self.enc,
                                                rate = pm.dropout,
                                                training = tf.convert_to_tensor(is_training))
                    # Identical
                    for i in range(pm.num_identical):
                        with tf.variable_scope("num_identical_{}".format(i)):
                            # Multi-head Attention
                            self.enc = multihead_attention(queries = self.enc,
                                                            keys   = self.enc,
                                                            num_units = pm.hidden_units,
                                                            num_heads = pm.num_heads,
                                                            dropout_rate = pm.dropout,
                                                            is_training  = is_training,
                                                            causality = False)
                            self.enc = feedforward(self.enc, num_units = [4 * pm.hidden_units, pm.hidden_units])

    下面就是decoder部分的代码。这里可以参考前面decoder的结构,里面多出了一个attention部分,该部分接受到了encoder输出的张量和decoder中self-attention里面输入的张量,然后再进行了vanilla attention。
    最终decoder部分输出张量的维度是[N ,T, 512]

                # Decoder
                with tf.variable_scope("decoder"):
                    self.dec = embedding(self.decoder_input,
                                    vocab_size = len(de2idx),
                                    num_units  = pm.hidden_units,
                                    scale = True,
                                    scope = "dec_embed")
                    # Position Encoding(use range from 0 to len(inpt) to represent position dim)
                    self.dec += positional_encoding(self.decoder_input,
                                        vocab_size = pm.maxlen,
                                        num_units = pm.hidden_units,
                                        zero_pad  = False,
                                        scale = False,
                                        scope = "dec_pe")
                    # Dropout
                    self.dec = tf.layers.dropout(self.dec,
                                                rate = pm.dropout,
                                                training = tf.convert_to_tensor(is_training))
                    # Identical
                    for i in range(pm.num_identical):
                        with tf.variable_scope("num_identical_{}".format(i)):
                            # Multi-head Attention(self-attention)
                            self.dec = multihead_attention(queries = self.dec,
                                                            keys   = self.dec,
                                                            num_units = pm.hidden_units,
                                                            num_heads = pm.num_heads,
                                                            dropout_rate = pm.dropout,
                                                            is_training  = is_training,
                                                            causality = True,
                                                            scope = "self_attention")
                            # Multi-head Attention(vanilla-attention)
                            self.dec = multihead_attention(queries=self.dec, 
                            self.dec = feedforward(self.dec, num_units = [4 * pm.hidden_units, pm.hidden_units])

    self.logits:进行了Linear变化,维度是[N, T, len(de2idx)]
    self.preds:取了self.logits里面最后一个维度里面最大值的下标,维度是[n ,T]
    self.istarget:将self.preds中所有id不为0的位置的值用1.0代替,维度是[n ,T]
    self.acc: 对比self.preds, self.outpt,对应位置相等那么就是1.0,否则就是0。

                # Linear
                self.logits   = tf.layers.dense(self.dec, len(de2idx))
                self.preds    = tf.to_int32(tf.arg_max(self.logits, dimension = -1))
                self.istarget = tf.to_float(tf.not_equal(self.outpt, 0))
                self.acc      = tf.reduce_sum(tf.to_float(tf.equal(self.preds, self.outpt)) * self.istarget) / (tf.reduce_sum(self.istarget))
                tf.summary.scalar('acc', self.acc)

    is_training 为True的时候,也就是训练的时候,就需要进行下面的操作了。
    loss的维度是[N, T]

                if is_training:
                    # smooth inputs
                    self.y_smoothed = label_smoothing(tf.one_hot(self.outpt, depth = len(de2idx)))
                    # loss function
                    self.loss = tf.nn.softmax_cross_entropy_with_logits(logits = self.logits, labels = self.y_smoothed)
                    self.mean_loss = tf.reduce_sum(self.loss * self.istarget) / (tf.reduce_sum(self.istarget))
                    self.global_step = tf.Variable(0, name = 'global_step', trainable = False)
                    # optimizer
                    self.optimizer = tf.train.AdamOptimizer(learning_rate = pm.learning_rate, beta1 = 0.9, beta2 = 0.98, epsilon = 1e-8)
                    self.train_op  = self.optimizer.minimize(self.mean_loss, global_step = self.global_step)
                    tf.summary.scalar('mean_loss', self.mean_loss)
                    self.merged = tf.summary.merge_all()   
    if __name__ == '__main__':
        '''en2idx{'<PAD>':0, ...}, idx2en{0:'<PAD>'}都是字典形式 长度是1684'''
        '''de2idx{'<PAD>':0, ...}, idx2de{0:'<PAD>'}都是字典形式 长度是1597'''
        en2idx, idx2en = load_vocab('en.vocab.tsv')
        de2idx, idx2de = load_vocab('de.vocab.tsv')
        g = Graph("train")
        print("MSG : Graph loaded!")
        # save model and use this model to training
        supvisor = tf.train.Supervisor(graph = g.graph,logdir = pm.logdir,save_model_secs = 0)
        with supvisor.managed_session() as sess:
            for epoch in range(1, pm.num_epochs + 1):
                if supvisor.should_stop():
                # process bar
                for step in tqdm(range(g.batch_num), total = g.batch_num, ncols = 70, leave = False, unit = 'b'):
                if not os.path.exists(pm.checkpoint):
                g_step = sess.run(g.global_step)
                supvisor.saver.save(sess, pm.checkpoint + '/model_epoch_%02d_gs_%d' % (epoch, g_step))
        print("MSG : Done!")


