Improving Language Understanding by Generative Pre-Training

Radford et al., (2018)

GPT (Generative Pre-Training) is semi-supervised approach for language understanding tasks using a combination of unsupervised pre-training and supervised fine-tuning.
- Goal: It can learn a universal representation that transfers with little adaptation to a wide range of tasks.
- Assumption: We have a large corpus of unlabeled text and several annotated training sets.

2. Two-stage training procedure

Unsupervised pre-training : Use a language modeling objective on the unlabeled data to learn the initial parameters of a neural network model (This paper selects the Transformer (Vaswani et al., 2017) as its model architecture).
supervised fine-tuning: Adapt these parameters to a target task using the corresponding supervised objective.

3. Unsupervised pre-training

Given an unsupervised corpus of tokens $U=\{u_1, \ldots, u_n\}$ .

A multi-layer Transformer applies a multi-headed self-attention operation over the input context tokens followed by position-wise feedforward layers to produce an output distribution over target tokens:
$\begin{aligned} h_0 &= UW_e+W_p\\ h_l &={ transformer\_block(h_{l-1}) \forall{i}\in[1,n]}\\ P(u) &= softmax(h_nW_e^T) \end{aligned}$
where $U=(u_{-k},\ldots,u_{-1})$ is the context vector of tokens, $n$ is the number of layerts, $W_e$ is the token embedding matrix, and $W_p$ is the position embedding matrix.
The objective is to maximize the following likelihood $L_1$ :
$L_1(U)=\sum_{i}{\log{P(u_i|u_{i-k},\ldots,u_{i-1};\Theta)}}$
where $k$ is the size of the context window and $\Theta$ is the model's parameters.

4. Supervised fine-tuning

Given a labeled dataset $C$ where each instance is a sequence of input tokens $[x^1, \ldots, x^m]$ along with a label $y$ .

Pass the inputs through the pre-trained model to get the $h_l^m$ and then fed $h_l^m$ into an added linear output layer with parameters $W_y$ to predict $y$ :
$P(p|x^1,\ldots, x^m)=softmax(h_l^m W_y)$
The objective is to maximize the following likelihood :
- Including the language modeling $L_1$ as auxiliary objective to the fine-tuning can not only improve the generalization of the supervised model, but also accelerate convergence during training.

5. Task-specific input transformations

All following transformations include adding randomly initialized start and end tokens ( $\langle s \rangle$ , $\langle e \rangle$ ).

Radford et al., (2018)

Reference

Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).