1. The University of Cambridge’s Machine Translation Systems for WMT18

Combine the three most commonly used architectures: recurrent, convolutional, and self-attention-based models like the Transformer

If we want to combine q models $M_1,...,M_q$ , we first divide the models into two groups by selecting a p with 1 $\le$ p $\le$ q.

Then, we refer to the first group $M_1,...,M_p$ as full posterior scores and the second group $M_p,...,M_q$ as MBR-based scores.

Full-posterior models scores compute as follows：

在这里插入图片描述

Combined scores compute as follows：

在这里插入图片描述

No words contain more than 40 characters.
Sentences must not contain HTML tags.
The minimum sentence length is 4 words.
The character ratio between source and targetmust not exceed 1:3 or 3:1
Source and target sentences must be equal af-ter stripping out non-numerical characters.
Sentences must end with punctuation marks.

2. NTT’s Neural Machine Translation Systems for WMT 2018

Transformer Big

use language model (such as KenLM) to evaluate a sentences naturalness
use a word alignment model (such as fast_align) to check whether the sentence pair has the same meaning

R2L model re-ranks an n-best hypothesis generated by the Left-to-Right (L2R) model (n=10)

Transformer Big + Ensemble-decoding + R2L Reranking

Dual conditional cross-entropy filtering

For a sentence pair(x, y), cross-entropy compute as follows:

$adq(x, y) = exp(-(|H_A(y|x) - H_B(y|x)|) + \frac{1}{2} (H_A(y|x) + H_B(y|x)))$

where A and B are translation models trained on the same data but in inverse directions.(We setting $A = W_{de->en}$ and $B = W_{en->de}$ )

$H_M(y|x) = - \frac{1}{|y|} \sum\limits_{t=1}^{|y|} log P_M(y_t|y<t, x)$

$P_M(x|y)$ is the probability distribution for a model M
Data weighting

sentence instance weighting is a feature available in Marian(Junczys-Dowmunt et al., 2018) .

sentence score = Data weighting * cross-entropy -> sort and select by sentence score