Transformer模型的学习总结

作者: 奔向算法的喵 | 来源:发表于2018-12-13 15:08 被阅读2087次

Transformer模型的学习总结
Transformer 模型总结
彻底搞懂transformer模型，亲测有效
transformer
Transformer初识
PyTorch模型减枝技术-pruning
字节实习一面nlp
手撸一个Transformer
Transformer(编解码架构)-Question
transformer模型

Transformer来自Google团队17年的文章Attention is all you need。
该文章的目的：减少计算量并且提高并行效率，同时不减弱最终的实验效果。
创新点：Transformer只采用了attention机制。不像传统的encoder-decoder的模型需要结合RNN或者CNN来使用。创新之处在于使用了scaled Dot-product Attention和Multi-Head Attention。
我觉得将Transformer解释的最容易懂的的还是The illustrated transformer，看完这篇博客就瞬间懂了很多东西。然后哈佛大学也给出了详细的pytorch版本的代码，有jupyter notebook详细的解释，看完也会有别样的收获哈。

一、原理分析

1、模型的架构图如下：
Attentin is all you need文中给出的架构图如下，能够比较详细的看到了一个encoder和decoder的细节。在这里，可以看到一个encoder里面有2个子层，然后一个decoder中含有3个子层，后面会详细说到这个结构。

然后我们从全局的视野来看，encoder和decoder部分都包含了6个encoder和decoder。进入到第一个encoder的inputs结合embedding和positional embedding。通过了6个encoder之后，输出到了decoder部分的每一个decoder中。

2、Encoder部分
Transformer的两个创新的结构图如下，明白了两个部分，后面的东西就清楚多了。

上面的Q，K和V，被作为一种抽象的向量，主要目的是用来做计算和辅助attention。根据文章我们知道Attention的计算公式如下:

在这里说明一下，在tensorflow的代码里面，维度变成了(batch_size, max_len, vector_dimension)。其实就是增加了一个batch_size也就是输入的句子数，然后max_len就是控制的一句话的长度，最后是字向量的维度大小。
上面说的是从下图中x1经过self-attention到了z1的状态，通过了self-attetion的张量还需要进过残差网络和LaterNorm的处理，然后进入到全连接的前馈网络中，前馈网络也是同样的操作，进行的残差处理和正规化。最后输出的张量才真正的进入到了下一个encoder之中，然后这样的操作，经过了6次，然后最后就能进入到decoder的部分了。

可以从上图中看出，在向量进入self-attention层之前，是将词的embedding和位置的encoding做了一个相加的处理。因为模型里面没有用到RNN和CNN的东西，所以该论文采用了位置的编码来解决序列信息获取的问题。这里的positional encoding需要说明一下：

最后一个decoder输出的向量会经过Linear层和softmax层。Linear层的作用就是对decoder部分出来的向量做映射成一个logits向量，然后softmax层根据这个logits向量，将其转换为了概率值，最后找到概率最大值的位置。这样就完成了解码的输出了。

二、代码分析

代码分析来自于https://github.com/EternalFeather/Transformer-in-generating-dialogue，该文件里面的文件组成和作用如下：

文件名字	作用
params.py	定义了该模型里面的所用到的超参数，比如学习率、隐藏单元葛个数等。
make_dic.py	用来做数据预处理的，作用是生成源语言和目标语言的Vocabulary文件。
data_load.py	该文件包含所有关于加载数据以及批量化数据的函数。
modules.py	核心代码部分，包括了embedding和positional embedding、、以及multihead-attention、正则化等函数
train.py	训练模型的代码，定义了模型，损失函数以及训练和保存模型的过程
eval.py	模型训练完之后，评估模型的性能

1、明确该模型的一些超参数(可调)

# -*- coding: utf-8 -*-
class Params:
    '''
    Parameters of our model
    '''
    src_train = "data/src-train.txt"
    tgt_train = "data/tgt-train.txt"
    src_test  = "data/src-val.txt"
    tgt_test  = "data/tgt-val.txt"

    num_identical = 6

    maxlen       = 10
    hidden_units = 512
    num_heads    = 8

    logdir = 'logdir'
    batch_size = 32
    num_epochs = 250
    dropout    = 0.1
    learning_rate = 0.0001

    word_limit_size  = 20
    word_limit_lower = 3

    checkpoint = 'checkpoint'

参数名	大小	对应到代码里面	意义
batch_size	32	N	批量的大小
learning_rate	0.001	lr	学习率的大小
maxlen	10	T,T_q,T_k	一句话最长多少个字
word_limit_size	20		出现次数小于20的话，那么认作UNK
hidden_units	512	num_units,S	隐藏单元的个数、维度大小
num_identical	6		encoder和decoder部分叠加的个数
num_epochs	250		训练的时候，总的epochs数目
num_heads	8		multi-head attention里面的头数
dropout_rate	0.1		dropout大小

2、数据的预处理

make_dic.py

from __future__ import print_function
from params import Params as pm
import codecs
import os
from collections import Counter

def make_dic(path, fname):
    '''
    Constructs vocabulary as a dictionary

    Args:
        path: [String], Input file path
        fname: [String], Output file name

    Build vocabulary line by line to dictionary/ path
    '''
    text = codecs.open(path, 'r', 'utf-8').read()  #codes.open()得到的是一个对象，然后read()之后就变成了字符串了
    words = text.split()
    wordCount = Counter(words)
    if not os.path.exists('dictionary'):
        os.mkdir('dictionary')
    with codecs.open('dictionary/{}'.format(fname), 'w', 'utf-8') as f:
        f.write("{}\t1000000000\n{}\t1000000000\n{}\t1000000000\n{}\t1000000000\n".format("<PAD>","<UNK>","<STR>","<EOS>"))
        for word, count in wordCount.most_common(len(wordCount)):
            f.write(u"{}\t{}\n".format(word, count))

if __name__ == '__main__':
    make_dic(pm.src_train, "en.vocab.tsv")
    make_dic(pm.tgt_train, "de.vocab.tsv")
    print("MSG : Constructing Dictionary Finished!")

运行这个文件之后，我们得到了en.vocab和de.vocab两个文件。en.vocab部分结果如下:

<PAD>   1000000000
<UNK>   1000000000
<STR>   1000000000
<EOS>   1000000000
有   17300
的   15767
`   12757
-   10831
卦   8461
八   7865
麼   7771
沒   7324
嗎   6024
是   5940
......
ASCII   1
SAISONduSOLEIL  1
豌   1
迺   1
ThuDec2223  1
snis    1
Ya  1
2100    1
雇   1

主要作用就是统计了一下出现单词的次数，然后按照出现的次数进行了一个排序，用于后续的data_load模块的操作

data_load.py

# -*- coding: utf-8 -*-
from __future__ import print_function
from params import Params as pm
import codecs
import sys
import numpy as np
import tensorflow as tf

def load_vocab(vocab):  #  'en.vocab.tsv'  'de.vocab.tsv'
    '''
    Load word token from encoding dictionary
    Args:
        vocab: [String], vocabulary files
    ''' 
    vocab = [line.split()[0] for line in codecs.open('dictionary/{}'.format(vocab), 'r', 'utf-8').read().splitlines() if int(line.split()[1]) >= pm.word_limit_size]
    word2idx_dic = {word: idx for idx, word in enumerate(vocab)}
    idx2word_dic = {idx: word for idx, word in enumerate(vocab)}
    return word2idx_dic, idx2word_dic

load_vocab这个函数的作用就是处理前面根据词频产生的文档，进过处理之后：

#en.vocab:
word2idx ={'<PAD>': 0, '<UNK>': 1, '<STR>': 2, '<EOS>': 3, '有': 4, 
'的': 5, '`': 6, '-': 7, '卦': 8, '八': 9, ..., '爬': 1642, 'U': 1643}
idx2word={{0: '<PAD>', 1: '<UNK>', 2: '<STR>', 3: '<EOS>', 4: '有', 
5: '的', 6: '`', 7: '-', 8: '卦', 9: '八', ..., 1642: '爬', 1643: 'U'}}

#de.vocab
word2idx={'<PAD>': 0, '<UNK>': 1, '<STR>': 2, '<EOS>': 3, '-': 4, 
'的': 5, '`': 6, '不': 7, '人': 8, '好': 9, ..., '遺': 1586, '搜': 1587}
idx2word={0: '<PAD>', 1: '<UNK>', 2: '<STR>',3:'<EOS>', 4: '-', 
5: '的', 6: '`', 7: '不', 8: '人', 9: '好', ..., 1586: '遺', 1587: '搜'}

接着看generate_dataset这个函数，它的作用就是产生数据集，也就是将句子表示成了np array的形式。传入的是source句子和target句子。这里的集外词的id给为1。函数的前半部分做的是一个index化，后半部分做的是padding处理，我们可以知道<pad>符号在Vocabulary里面的位置为0，所以当句子的长度小于10的时候，我们就在不足的位置给加填上0，保证每个句子的长度都是相同的。
最终返回的X，Y的维度都是(句子数，最大的句子的长度)

def generate_dataset(source_sents, target_sents):
    '''
    Parse source sentences and target sentences from corpus with some formats
    Parse word token of each sentences
    Args:
        source_sents: [List], encoding sentences from src-train file
        target_sents: [List], decoding sentences from tgt-train file

    Padding for word token sentence list
    '''
    en2idx, idx2en = load_vocab('en.vocab.tsv')
    de2idx, idx2de = load_vocab('de.vocab.tsv')

    in_list, out_list, Sources, Targets = [], [], [], []
    for source_sent, target_sent in zip(source_sents, target_sents):
        # 1 means <UNK>
        inpt = [en2idx.get(word, 1) for word in (source_sent + u" <EOS>").split()]
        outpt = [de2idx.get(word, 1) for word in (target_sent + u" <EOS>").split()]
        if max(len(inpt), len(outpt)) <= pm.maxlen:
            # sentence token list
            in_list.append(np.array(inpt))
            out_list.append(np.array(outpt))
            # sentence list
            Sources.append(source_sent)
            Targets.append(target_sent)

    X = np.zeros([len(in_list), pm.maxlen], np.int32)
    Y = np.zeros([len(out_list), pm.maxlen], np.int32)
    for i, (x, y) in enumerate(zip(in_list, out_list)):
        X[i] = np.lib.pad(x, (0, pm.maxlen - len(x)), 'constant', constant_values = (0, 0))
        Y[i] = np.lib.pad(y, (0, pm.maxlen - len(y)), 'constant', constant_values = (0, 0))

    return X, Y, Sources, Targets

load_data这个函数的作用就是给generate_dataset里面提供参数用的。

def load_data(l_data):
    '''
    Read train-data from input datasets

    Args:
        l_data: [String], the file name of datasets which used to generate tokens
    '''
    if l_data == 'train':
        en_sents = [line for line in codecs.open(pm.src_train, 'r', 'utf-8').read().split('\n') if line]
        de_sents = [line for line in codecs.open(pm.tgt_train, 'r', 'utf-8').read().split('\n') if line]
        if len(en_sents) == len(de_sents):
            inpt, outpt, Sources, Targets = generate_dataset(en_sents, de_sents)
        else:
            print("MSG : Source length is different from Target length.")
            sys.exit(0)
        return inpt, outpt
    elif l_data == 'test':
        en_sents = [line for line in codecs.open(pm.src_test, 'r', 'utf-8').read().split('\n') if line]
        de_sents = [line for line in codecs.open(pm.tgt_test, 'r', 'utf-8').read().split('\n') if line]
        if len(en_sents) == len(de_sents):
            inpt, outpt, Sources, Targets = generate_dataset(en_sents, de_sents)
        else:
            print("MSG : Source length is different from Target length.")
            sys.exit(0)
        return inpt, Sources, Targets
    else:
        print("MSG : Error when load data.")
        sys.exit(0)

利用到了前面的load_data这个函数，参数可以是train或者test。
这里面的inpt, outpt维度(句子的总数, maxlen)，那么batch_num就是我们总共的要训练的批量数。
输出x,y的维度(N, T)==(batch_size, maxlen)

def get_batch_data():
    '''
    A batch dataset generator
    '''
    inpt, outpt = load_data("train")

    batch_num   = len(inpt) // pm.batch_size

    inpt  = tf.convert_to_tensor(inpt, tf.int32)
    outpt = tf.convert_to_tensor(outpt, tf.int32)

    # parsing data into queue used for pipeline operations as a generator. 
    input_queues = tf.train.slice_input_producer([inpt, outpt])

    # multi-thread processing using batch
    x, y = tf.train.shuffle_batch(input_queues,
                                num_threads = 8,
                                batch_size = pm.batch_size,
                                capacity = pm.batch_size * 64,
                                min_after_dequeue = pm.batch_size * 32,
                                allow_smaller_final_batch = False)

    return x, y, batch_num

3、核心部分-modules.py

# -*- coding: utf-8 -*-
from __future__ import print_function
import tensorflow as tf
import numpy as np
import math

def normalize(inputs, epsilon = 1e-8, scope = "ln", reuse = None):
    '''
    Implement layer normalization
    Args:
        inputs: [Tensor], A tensor with two or more dimensions, where the first one is "batch_size"
        epsilon: [Float], A small number for preventing ZeroDivision Error
        scope: [String], Optional scope for "variable_scope"
        reuse: [Boolean], If to reuse the weights of a previous layer by the same name
    Returns:
        A tensor with the same shape and data type as "inputs"
    '''
    with tf.variable_scope(scope, reuse = reuse):
        inputs_shape = inputs.get_shape()  # a.get_shape().as_list() --->a维度是(2,3)，那么这个返回就是 [2, 3]
        params_shape = inputs_shape[-1 :]  # params_shape就是最后的一个维度了

        # tf.nn.moments 计算返回的 mean 和 variance 作为 tf.nn.batch_normalization 参数调用。
        mean, variance = tf.nn.moments(inputs, [-1], keep_dims = True)
        beta  = tf.Variable(tf.zeros(params_shape))
        gamma = tf.Variable(tf.ones(params_shape))
        normalized = (inputs - mean) / ((variance + epsilon) ** (.5))
        outputs = gamma * normalized + beta

    return outputs

不论是self-attention还是feed-forward neural network，都会做一个正规化的操作。这里参数epsilon就是为了防止正规化的时候，分母为0，然后scope为命名空间。看代码就知道，这个和batch normalization的公式是一样的。目的肯定是使得训练过程，反向传播的时候的震荡幅度更小。好多复杂的网络在输出之后都会加上normalization。

def positional_encoding(inputs,
                        vocab_size,
                        num_units,
                        zero_pad = True,
                        scale = True,
                        scope = "positional_embedding",
                        reuse = None):
    '''
    Positional_Encoding for a given tensor.

    Args:
        inputs: [Tensor], A tensor contains the ids to be search from the lookup table, shape = [batch_size, 1 + len(inpt)]
        vocab_size: [Int], Vocabulary size
        num_units: [Int], Hidden size of embedding
        zero_pad: [Boolean], If True, all the values of the first row(id = 0) should be constant zero
        scale: [Boolean], If True, the output will be multiplied by sqrt num_units(check details from paper)
        scope: [String], Optional scope for 'variable_scope'
        reuse: [Boolean], If to reuse the weights of a previous layer by the same name

        Returns:
            A 'Tensor' with one more rank than inputs's, with the dimensionality should be 'num_units'
    '''

    """
        inputs (batch_size, 1+len(inputs)) 那么N就是batch_size, 然后T就是maxlen，大小为10
        num_units 就是隐层单元的个数，维度的大小
    """
    N, T = inputs.get_shape().as_list()
    
    with tf.variable_scope(scope, reuse = reuse):
        position_ind = tf.tile(tf.expand_dims(tf.range(T), 0), [N, 1])

        # First part of the PE function: sin and cos argument
        position_enc = np.array([
            [pos / np.power(10000, 2.*i/num_units) for i in range(num_units)]
            for pos in range(T)])

        # Second part, apply the cosine to even columns and sin to odds.
        position_enc[:, 0::2] = np.sin(position_enc[:, 0::2])  # dim 2i
        position_enc[:, 1::2] = np.cos(position_enc[:, 1::2])  # dim 2i+1

        # Convert to a tensor
        lookup_table = tf.convert_to_tensor(position_enc)

        if zero_pad:
            lookup_table = tf.concat((tf.zeros(shape=[1, num_units]),
                          lookup_table[1:, :]), 0)
        outputs = tf.nn.embedding_lookup(lookup_table, position_ind)

        if scale:
            outputs = outputs * num_units**0.5
        
    return tf.cast(outputs, tf.float32)

这一部分的代码就是架构图里面的positional embedding的实现了，由于没有采用序列模型，那么这里讲位置进行了嵌入，就捕捉到了位置上面的信息。

def embedding(inputs,
            vocab_size,
            num_units,
            zero_pad = True,
            scale = True,
            scope = "embedding",
            reuse = None):
    '''
    Embed a given tensor.
    Args:
        inputs: [Tensor], A tensor contains the ids to be search from the lookup table
        vocab_size: [Int], Vocabulary size
        num_units: [Int], Hidden size of embedding
        zero_pad: [Boolean], If True, all the values of the first row(id = 0) should be constant zero
        scale: [Boolean], If True, the output will be multiplied by sqrt num_units(check details from paper)
        scope: [String], Optional scope for 'variable_scope'
        reuse: [Boolean], If to reuse the weights of a previous layer by the same name

        Returns:
            A 'Tensor' with one more rank than inputs's, with the dimensionality should be 'num_units'
    '''

    """
        inputs传进来就(batch_size, 10)
        lookup_table维度(vocab_size, 512)，进行了随机的初始化
        """ 
      # shape = [vocabsize, 8]
    with tf.variable_scope(scope, reuse = reuse):
        lookup_table = tf.get_variable('lookup_table',
                                        dtype = tf.float32,
                                        shape = [vocab_size, num_units],
                                        initializer = tf.contrib.layers.xavier_initializer())

        if zero_pad:
            ''' tf.zeros 维度(1, 512)
                lookup_table[1:, :]的目的是抛开了<PAD>这玩意儿，赋值为0，然后进行了合并
                现在look  _table维度还是(vocab_size, 512  ) 
            '''
            lookup_table = tf.concat((tf.zeros(shape = [1, num_units]),  lookup_table[1:, :]), 0)

        # outputs 维度就是 (batch_size, 10, 512) ==[N ,T, S]
        outputs = tf.nn.embedding_lookup(lookup_table, inputs)

        if scale:
            # embedding 那一步
            outputs = outputs * math.sqrt(num_units)

    return outputs

输入维度(batch_size, maxlen)==[N, T]
输出维度(batch_size, maxlen, S)==[N, T, S]
这里面注意初始化lookup_table时把id=0的那一行（第一行）初始化为全0的结果。scale为True，paper中在embedding里面说明为什么这里需要做一个scale的操作。

下面是multi-head attention，为该代码的核心部分。注释里面写清楚了维度的一个变化情况。
最后输出维度[N, T_q, S]。

def multihead_attention(queries,
                        keys,
                        num_units = None,
                        num_heads = 8,
                        dropout_rate = 0,
                        is_training = True,
                        causality = False,
                        scope = "multihead_attention",
                        reuse = None):
    '''
    Implement multihead attention

    Args:
        queries: [Tensor], A 3-dimensions tensor with shape of [N, T_q, S_q]
        keys: [Tensor], A 3-dimensions tensor with shape of [N, T_k, S_k]
        num_units: [Int], Attention size
        num_heads: [Int], Number of heads
        dropout_rate: [Float], A ratio of dropout
        is_training: [Boolean], If true, controller of mechanism for dropout
        causality: [Boolean], If true, units that reference the future are masked
        scope: [String], Optional scope for "variable_scope"
        reuse: [Boolean], If to reuse the weights of a previous layer by the same name
    
    Returns:
        A 3-dimensions tensor with shape of [N, T_q, S]
    '''
    """ queries = self.enc  (batch_size, 10 ,512)==[N, T_q, S] keys也是self.enc  
        num_units =512, num_heads =10
    """
    with tf.variable_scope(scope, reuse = reuse):
        if num_units is None:
            # length of sentence
            num_units = queries.get_shape().as_list()[-1]

        """ Linear layers in Figure 2(right) 就是Q、K、V进入scaled Dot-product Attention前的Linear的操作
        # 首先是进行了全连接的线性变换
        shape = [N, T_q, S]  (batch_size, 10 ,512)， S可以理解为512"""
        Q = tf.layers.dense(queries, num_units, activation = tf.nn.relu)
        # shape = [N, T_k, S]
        K = tf.layers.dense(keys, num_units, activation = tf.nn.relu)
        # shape = [N, T_k, S]
        V = tf.layers.dense(keys, num_units, activation = tf.nn.relu)
        '''
            Q_、K_、V_就是权重WQ、WK、WV。
            shape (batch_size*8， 10, 512/8=64)
        '''
        # Split and concat
        # shape = [N*h, T_q, S/h]
        Q_ = tf.concat(tf.split(Q, num_heads, axis = 2), axis = 0)
        # shape = [N*h, T_k, S/h]
        K_ = tf.concat(tf.split(K, num_heads, axis = 2), axis = 0)
        # shape = [N*h, T_k, S/h]
        V_ = tf.concat(tf.split(V, num_heads, axis = 2), axis = 0)

        # [N, T_q, S] * [N*h, T_k, S/h] 这一步的张量乘法是怎么做的？
        # shape = [N*h, T_q, T_k]   Q
        outputs = tf.matmul(Q_, tf.transpose(K_, [0, 2, 1]))

        # Scale
        outputs = outputs / (K_.get_shape().as_list()[-1] ** 0.5)

        # Masking
        # shape = [N, T_k]
        # 这里的tf.reduce_sum进行了降维，由三维降低到了2维度，然后是取绝对值，转成0-1之间的值
        '''[N, T_k, 512]------> [N, T_k] -----》[N*h, T_k] -----》[N*h, T_q, T_k] '''
        key_masks = tf.sign(tf.abs(tf.reduce_sum(keys, axis = -1)))
        # shape = [N*h, T_k]
        key_masks = tf.tile(key_masks, [num_heads, 1])
        # shape = [N*h, T_q, T_k]    tf.expand_dims就是扩维度
        key_masks = tf.tile(tf.expand_dims(key_masks, 1), [1, tf.shape(queries)[1], 1])

        # If key_masks == 0 outputs = [1]*length(outputs)
        paddings = tf.ones_like(outputs) * (-math.pow(2, 32) + 1)
        # shape = [N*h, T_q, T_k]
        outputs = tf.where(tf.equal(key_masks, 0), paddings, outputs)



        if causality: #如果为true的话，那么就是将这个东西未来的units给屏蔽了
            # reduce dims : shape = [T_q, T_k]
            diag_vals = tf.ones_like(outputs[0, :, :])
            # shape = [T_q, T_k]
            # use triangular matrix to ignore the affect from future words
            # like : [[1,0,0]
            #         [1,2,0]
            #         [1,2,3]]
            tril = tf.contrib.linalg.LinearOperatorTriL(diag_vals).to_dense()
            # shape = [N*h, T_q, T_k]
            masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(outputs)[0], 1, 1])

            paddings = tf.ones_like(masks) * (-math.pow(2, 32) + 1)
            # shape = [N*h, T_q, T_k]
            outputs = tf.where(tf.equal(masks, 0), paddings, outputs)

        # Output Activation
        outputs = tf.nn.softmax(outputs)

        # Query Masking
        # shape = [N, T_q]
        query_masks = tf.sign(tf.abs(tf.reduce_sum(queries, axis = -1)))
        # shape = [N*h, T_q]
        query_masks = tf.tile(query_masks, [num_heads, 1])
        # shape = [N*h, T_q, T_k]
        query_masks = tf.tile(tf.expand_dims(query_masks, -1), [1, 1, tf.shape(keys)[1]])
        outputs *= query_masks 

        # Dropouts
        outputs = tf.layers.dropout(outputs, rate = dropout_rate, training = tf.convert_to_tensor(is_training))

        # Weighted sum
        # shape = [N*h, T_q, S/h]
        outputs = tf.matmul(outputs, V_)

        # Restore shape
        # shape = [N, T_q, S]
        outputs = tf.concat(tf.split(outputs, num_heads, axis = 0), axis = 2)

        # Residual connection
        outputs += queries

        # Normalize
        # shape = [N, T_q, S]
        outputs = normalize(outputs)

    return outputs

两层卷积之间加了relu非线性操作。之后是residual操作加上inputs残差，然后是normalize。最后输出的维度还是[N, T_q, S]。

def feedforward(inputs,
                num_units = [2048, 512],
                scope = "multihead_attention",
                reuse = None):
    '''
    Position-wise feed forward neural network

    Args:
        inputs: [Tensor], A 3d tensor with shape [N, T, S]
        num_units: [Int], A list of convolution parameters
        scope: [String], Optional scope for "variable_scope"
        reuse: [Boolean], If to reuse the weights of a previous layer by the same name 
    
    Return:
        A tensor converted by feedforward layers from inputs
    '''

    with tf.variable_scope(scope, reuse = reuse):
        # params = {"inputs": inputs, "filters": num_units[0], "kernel_size": 1, \
                  # "activation": tf.nn.relu, "use_bias": True}
        # outputs = tf.layers.conv1d(inputs = inputs, filters = num_units[0], kernel_size = 1, activation = tf.nn.relu, use_bias = True)
        # outputs = tf.layers.conv1d(**params)
        params = {"inputs": inputs, "num_outputs": num_units[0], \
                  "activation_fn": tf.nn.relu}
        outputs = tf.contrib.layers.fully_connected(**params)

        # params = {"inputs": inputs, "filters": num_units[1], "kernel_size": 1, \
        #         "activation": None, "use_bias": True}
        params = {"inputs": inputs, "num_outputs": num_units[1], \
                  "activation_fn": None}
        # outputs = tf.layers.conv1d(inputs = inputs, filters = num_units[1], kernel_size = 1, activation = None, use_bias = True)
        # outputs = tf.layers.conv1d(**params)
        outputs = tf.contrib.layers.fully_connected(**params)

        # residual connection
        outputs += inputs

        outputs = normalize(outputs)

    return outputs

最后是进行了一个平滑的操作，就是one_hot中的0改成了一个很小的数，1改成了一个比较接近于1的数。

def label_smoothing(inputs, epsilon = 0.1):
    '''
    Implement label smoothing

    Args:
        inputs: [Tensor], A 3d tensor with shape of [N, T, V]
        epsilon: [Float], Smoothing rate

    Return:
        A tensor after smoothing
    '''
    ''' inputs的维度应该是(batch_size, sentense_length, vector dimension)
        N就是batch_size, T就是句子的长度，V就是向量的维度大小
    '''
    K = inputs.get_shape().as_list()[-1]
    return ((1 - epsilon) * inputs) + (epsilon / K)

4、训练

train.py
这里面的self.decoder_input采取了一个操作就是将每个句子加了一个初始化为2的id，然后除去了最后的一个句子结束符。然后它的维度还是[N ,T]

# -*- coding: utf-8 -*-

from __future__ import print_function
import tensorflow as tf
from params import Params as pm
from data_loader import get_batch_data, load_vocab
from modules import *
from tqdm import tqdm
import os

class Graph():
    # 直接就是一个init初始化一下
    def __init__(self, is_training = True):
        self.graph = tf.Graph()

        with self.graph.as_default():
            if is_training:
                self.inpt, self.outpt, self.batch_num = get_batch_data()

            else:
                '''inpt(None, maxlen)  outpt(None, maxlen) maxlen=10'''
                self.inpt = tf.placeholder(tf.int32, shape = (None, pm.maxlen))

                self.outpt = tf.placeholder(tf.int32, shape = (None, pm.maxlen))

            # start with 2(<STR>) and without 3(<EOS>)
            self.decoder_input = tf.concat((tf.ones_like(self.outpt[:, :1])*2, self.outpt[:, :-1]), -1)

            #  直接就拿到了en和de的Vocabulary，en的大小是1644，de的大小是1588
            en2idx, idx2en = load_vocab('en.vocab.tsv')
            de2idx, idx2de = load_vocab('de.vocab.tsv')

            # Encoder
            with tf.variable_scope("encoder"):
                ''' self.inpt维度是(batch_size, maxlen)
                    self.enc 维度是(batch_size, maxlen, 512)
                '''
                self.enc = embedding(self.inpt,
                                    vocab_size = len(en2idx),
                                    num_units  = pm.hidden_units,
                                    scale = True,
                                    scope = "enc_embed")

                # Position Encoding(use range from 0 to len(inpt) to represent position dim of each words)
                # tf.tile(tf.expand_dims(tf.range(tf.shape(self.inpt)[1]), 0), [tf.shape(self.inpt)[0], 1]),
                self.enc += positional_encoding(self.inpt,
                                    vocab_size = pm.maxlen,
                                    num_units  = pm.hidden_units,
                                    zero_pad   = False,
                                    scale = False,
                                    scope = "enc_pe")

                # Dropout
                self.enc = tf.layers.dropout(self.enc,
                                            rate = pm.dropout,
                                            training = tf.convert_to_tensor(is_training))

                # Identical
                for i in range(pm.num_identical):
                    with tf.variable_scope("num_identical_{}".format(i)):
                        # Multi-head Attention
                        self.enc = multihead_attention(queries = self.enc,
                                                        keys   = self.enc,
                                                        num_units = pm.hidden_units,
                                                        num_heads = pm.num_heads,
                                                        dropout_rate = pm.dropout,
                                                        is_training  = is_training,
                                                        causality = False)

                        self.enc = feedforward(self.enc, num_units = [4 * pm.hidden_units, pm.hidden_units])

下面就是decoder部分的代码。这里可以参考前面decoder的结构，里面多出了一个attention部分，该部分接受到了encoder输出的张量和decoder中self-attention里面输入的张量，然后再进行了vanilla attention。
最终decoder部分输出张量的维度是[N ,T, 512]

            # Decoder
            with tf.variable_scope("decoder"):
                self.dec = embedding(self.decoder_input,
                                vocab_size = len(de2idx),
                                num_units  = pm.hidden_units,
                                scale = True,
                                scope = "dec_embed")

                # Position Encoding(use range from 0 to len(inpt) to represent position dim)
                self.dec += positional_encoding(self.decoder_input,
                                    vocab_size = pm.maxlen,
                                    num_units = pm.hidden_units,
                                    zero_pad  = False,
                                    scale = False,
                                    scope = "dec_pe")

                # Dropout
                self.dec = tf.layers.dropout(self.dec,
                                            rate = pm.dropout,
                                            training = tf.convert_to_tensor(is_training))

                # Identical
                for i in range(pm.num_identical):
                    with tf.variable_scope("num_identical_{}".format(i)):
                        # Multi-head Attention(self-attention)
                        self.dec = multihead_attention(queries = self.dec,
                                                        keys   = self.dec,
                                                        num_units = pm.hidden_units,
                                                        num_heads = pm.num_heads,
                                                        dropout_rate = pm.dropout,
                                                        is_training  = is_training,
                                                        causality = True,
                                                        scope = "self_attention")

                        # Multi-head Attention(vanilla-attention)
                        self.dec = multihead_attention(queries=self.dec, 
                                                        keys=self.enc, 
                                                        num_units=pm.hidden_units, 
                                                        num_heads=pm.num_heads,
                                                        dropout_rate=pm.dropout,
                                                        is_training=is_training, 
                                                        causality=False,
                                                        scope="vanilla_attention")

                        self.dec = feedforward(self.dec, num_units = [4 * pm.hidden_units, pm.hidden_units])

现在已经走到了decoder部分输出了：
self.logits：进行了Linear变化，维度是[N, T, len(de2idx)]
self.preds：取了self.logits里面最后一个维度里面最大值的下标，维度是[n ,T]
self.istarget：将self.preds中所有id不为0的位置的值用1.0代替，维度是[n ,T]
self.acc: 对比self.preds, self.outpt，对应位置相等那么就是1.0，否则就是0。

            # Linear
            self.logits   = tf.layers.dense(self.dec, len(de2idx))
            self.preds    = tf.to_int32(tf.arg_max(self.logits, dimension = -1))
            self.istarget = tf.to_float(tf.not_equal(self.outpt, 0))
            self.acc      = tf.reduce_sum(tf.to_float(tf.equal(self.preds, self.outpt)) * self.istarget) / (tf.reduce_sum(self.istarget))
            tf.summary.scalar('acc', self.acc)

is_training 为True的时候，也就是训练的时候，就需要进行下面的操作了。
loss的维度是[N, T]

            if is_training:
                # smooth inputs
                self.y_smoothed = label_smoothing(tf.one_hot(self.outpt, depth = len(de2idx)))
                # loss function
                self.loss = tf.nn.softmax_cross_entropy_with_logits(logits = self.logits, labels = self.y_smoothed)
                self.mean_loss = tf.reduce_sum(self.loss * self.istarget) / (tf.reduce_sum(self.istarget))

                self.global_step = tf.Variable(0, name = 'global_step', trainable = False)
                # optimizer
                self.optimizer = tf.train.AdamOptimizer(learning_rate = pm.learning_rate, beta1 = 0.9, beta2 = 0.98, epsilon = 1e-8)
                self.train_op  = self.optimizer.minimize(self.mean_loss, global_step = self.global_step)

                tf.summary.scalar('mean_loss', self.mean_loss)
                self.merged = tf.summary.merge_all()

if __name__ == '__main__':
    '''en2idx{'<PAD>':0, ...}, idx2en{0:'<PAD>'}都是字典形式 长度是1684'''
    '''de2idx{'<PAD>':0, ...}, idx2de{0:'<PAD>'}都是字典形式 长度是1597'''
    en2idx, idx2en = load_vocab('en.vocab.tsv')
    de2idx, idx2de = load_vocab('de.vocab.tsv')

    g = Graph("train")
    print("MSG : Graph loaded!")

    # save model and use this model to training
    supvisor = tf.train.Supervisor(graph = g.graph,logdir = pm.logdir,save_model_secs = 0)

    with supvisor.managed_session() as sess:
        for epoch in range(1, pm.num_epochs + 1):
            if supvisor.should_stop():
                break
            # process bar
            for step in tqdm(range(g.batch_num), total = g.batch_num, ncols = 70, leave = False, unit = 'b'):
                sess.run(g.train_op)

            if not os.path.exists(pm.checkpoint):
                os.mkdir(pm.checkpoint)
            g_step = sess.run(g.global_step)
            supvisor.saver.save(sess, pm.checkpoint + '/model_epoch_%02d_gs_%d' % (epoch, g_step))

    print("MSG : Done!")

三、留下一些问题思考

1、Transformer的创新之处在于哪里？和传统的encoder-decoder模型的区别在哪里？它想实现的目标是什么？
2、为什么要采用self-attention以及multi-head attention？
3、如何理解transformer里面所采用的mask？

最后，有些理解不到位的部分，请指教！

参考资料:
1、Attention is all you need论文
2、attention is all you need模型笔记
3、The illustrated transformer
4、The Annotated Transformer

Transformer模型的学习总结
Transformer来自Google团队17年的文章Attention is all you need。该文章的...
Transformer 模型总结
1、Transformer 模型的结构图 2、Transformer 模型简述 Transformer 是由多个 ...
彻底搞懂transformer模型，亲测有效
依次读完这个两篇博客，彻底理解transformer模型:1.深度学习中的注意力模型2.详解Transformer...
transformer
最近transformer的结构改进论文挺多的，总结一下。 transformer是一个seq2seq模型。从R...
Transformer初识
首先需要明确的是，Transformer是一个翻译模型。与之前主流的翻译模型相比，transformer的依然...
PyTorch模型减枝技术-pruning
介绍减枝(prune)是深度学习模型压缩常见的技术之一, 目的是使得CNN/RNN/Transformer等模型...
字节实习一面nlp
45min 问项目 nlp问题： rnn和基于transformer模型的各自优缺点 transformer en...
手撸一个Transformer
Transformer 关于Transformer的理论学习：Transformer详解[https://blog...
Transformer(编解码架构)-Question
1. Transformer 模型架构说一下？ 2. Transformer 结构， BERT有几种Embeddi...
transformer模型
参考文章Transformer注意力机制有效的解释：Transformer所使用的注意力机制的核心思想是去计算一句...