美文网首页
Note 1: Transformer

Note 1: Transformer

作者: qin7zhen | 来源:发表于2020-07-08 23:07 被阅读0次

Attention Is All You Need [1]

[1]

1. Encoder-Decoder

  • The encoder maps an input sequence of symbol representations (x_1,\ldots, x_n) to a sequence of continuous representations z = (z_1, \ldots, z_n).
  • Given z, the decoder then generates an output sequence (y_1, \ldots, y_m) of symbols one element at a time.
  • At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.
Overview of Transformer
  • Encoder
    • It has 6 identical layers.
    • Each layer has a multi-head self-attention sub-layer and a position-wise fully connect feed-forward sub-layer in turn.
    • Each sub-layer uses the residual connection mechanism, and followed by a normalization layer.
    • The residual connection mechanism uses x+F(x) as its final result.
  • Decoder
    • It has 6 identical layers.
    • Each layer has three sub-layers:
      • Masked multi-head attention layer ensues the prediction at time t can only depends on the known output at positions less than t.
      • Multi-head attention layer further adds the output of the encoder stack into this decoder.
      • Position-wise fully connect feed-forward layer.

2. Attention

  • Mapping a query and a set of key-value pairs to a weighted output.
  • Scaled Dot-product Attention
    Scaled Dot-product Attention
    • Given three vectors, a query Q \in R^{1 \times d_k}, keys K \in R^{1 \times d_k} and values V \in R^{1 \times d_v}:
      Attention(Q, K, V)=softmax(\frac{QK^T}{\sqrt{d_k}})V
  • Multi-head Attention
    Multi-head Attention
    • Jointly collect information from different representation subspace focused on different positions.
    • Given n queries Q \in R^{n \times d_{model}}, keys K and values V:
      MultiHead(Q, K, V)=Concat({head}_1, \cdots, {head}_n)W^o
      where \ {head}_i=Attention(QW_i^Q, KW_i^K, VW_i^V)
      • W_i^Q \in R^{d_{model} \times d_k}, W_i^K \in R^{d_{model} \times d_k}, W_i^V \in R^{d_{model} \times d_v}, {head}_i \in R^{n \times d_v}, Concat(\cdot) \in R^{n \times hd_v}, W^o \in R^{hd_v \times d_{model}}.
      • The Q and MultiHead(Q,K,V) have the same dimension R^{n \times d_{model}}.
      • First, it linearly projects the queries, keys and values h times to learn h different Q, K and V.
      • Next, it concatenates all yield output values together.
      • At last, it projects the concatenated vector into a d_v -dimension vector.
        Example [2]
  • Attention in Transformer
    • Encoder's multi-head:
      • Q=K=V=the output of previous layer.
      • Each position in the encoder can attend to all positions in the previous layer of the encoder.
    • Decoder's masked multi-head:
      • Q=K=V=the masked output of previous layer.
      • For example, if we predict the t-th output token, all tokens after timestamp t have to be marked.
      • This prevents leftward information flow in the decoder in order to preserve the auto-regressive property.
      • It masks out (setting to -\infty) all values in the input of the softmax which correspond to illegal connections during the scaled dot-product attention.
    • Decoder's multi-head:
      • Q=the output of the previous decoder layer, K=V=the encoder stack's output.
      • This allows every position in the decoder to attend over all positions in the input sequence.

3. Position-wise Feed-forward Networks

FFN(x)=max(0, xw_i+b_i)w_2+b_2

  • ReLu activation function: max(0, x).
  • It's applied to each position separately and identically.

4. Positional Encoding

  • To make use of the order of the sequence.
  • Add at the bottoms of the encoder and decoder stacks.
  • Have the same dimension d_{model} as the embeddings.
  • PE(pos, 2i)=sin(pos/10000^{2i / d_{model}})
    PE(pos, 2i+1)=cos(pos/10000^{2i / d_{model}})
    • pos is the index of position, 1 \leq pos \leq n in encoder while 1 \leq pos \leq m in decoder.
    • i is the index of the dimension d_{model}, 1 \leq i \leq d_{model}.

Reference

[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
[2] 口仆. Transformer 原理解析 https://zhuanlan.zhihu.com/p/135873679

相关文章

网友评论

      本文标题:Note 1: Transformer

      本文链接:https://www.haomeiwen.com/subject/kdkhcktx.html