Transformer

作者: 一梦换须臾_ | 来源:发表于2018-10-16 22:06 被阅读81次

    写在前面

    这一篇文章主要是介绍 transformer 模型
    论文参考:
    Attention is All You Need
    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
    知识点参考:
    Attention原理和源码解析
    Transformer详解
    语言模型和迁移学习
    Google BERT.
    项目参考:
    Transformer in Pytorch


    RNN + Attention

    Recall

    Another formation of Attention

    Attention = A(Q, K, V)=softmax(sim(Q, K)) • V

    Advantages & Disadvantages

    Advantages

    1. Takes positional information into account

    Disadvantages

    1. Parallel Computation
    2. Only decoder - encoder attention, has no concern on encoder itself and decoder itself

    Transformer

    Attention

    In transformer model, we represent attention in this way:


    Scaled Dot-Product Attention
    When |K| becomes large,the dot products grow large in magnitude, softmax(qKt) may be similar to 0 or 1

    Multi-head Attention
    It is beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively

    Self Attention
    Besides decoder-encoder attention, we can also discover self-attention in encoder itself or decoder itself.

    Encoder self-attention: Q=K=V = output of previous layer

    Decoder self-attention: Q=K=V = output of previous layer, mask out all the right-part attention, only permit current state's paying attention to the previous state, not future state


    Encoder

    • Embedding-Layer: token embedding & positional embedding(later)

    • SubLayer_1: Multi-Head Attention: encoder self-attention

    • SubLayer_2: FeedForward Networks: a simple, position-wise fully connected feed forward network

    Decoder

    • Embedding-Layer: token embedding & positional embedding

    • SubLayer_1: Masked Multi-Head Attention: decoder masked self-attention

    • SubLayer_2: Multi-Head Attention:
      Q: The previous decoder layer
      K, V: Output of the encoder

    • SubLayer_2: FeedForward Networks: a simple, position-wise fully connected feed forward network

    • Linear & Softmax: Softmax to classify

    input of each sub_layer: x
    output of each sub_layer: LayerNorm(x + SubLayer(x))

    Positional-Encoding

    Because of no recurrence and no convolution, we should use positional-encoding to make use of the order of sequence


    d_model: the same size of token embedding
    pos: current position of token sequence
    i: dimension

    That is, each dimension of the positional encoding corresponds to a sinusoid


    Experiments

    DataSet

    WMT'16 Multimodal Translation: Multi30k (de-en)

    PreProcess

    Train

    • Elapse per epoch (on NVIDIA Titan X)
      Training set: 0.888 minutes
      Validation set: 0.011 minutes

    Evaluate


    BERT(Bidirectional Encoder Representations from Transformers)

    Pre-train Model

    1. ELMo: Shallow Bi-directional, like a traditional Language Model
    2. OpenAI GPT: left-to-right, like a decoder
    3. BERT: Deep Bi-directional, like an encoder

    Input Embedding

    • Token Embeddings是词向量,第一个单词是CLS标志,可以用于之后的分类任务(CLS for classification)
    • Segment Embeddings用来区别两种句子,因为预训练不光做LM还要做以两个句子为输入的分类任务
    • Position Embeddings和之前文章中的Transformer不一样,不是三角函数而是学习出来的

    Fine-Tuning

    image.png

    相关文章

      网友评论

        本文标题:Transformer

        本文链接:https://www.haomeiwen.com/subject/neekzftx.html