Sparse Attention with Linear Units
Apr 2021
EMNLP2021
https://arxiv.org/abs/2104.07012
https://github.com/bzhangGo/zero
Biao Zhang, Ivan Titov, Rico Sennrich
Recently, it has been argued that encoder-decoder models can be made more interpretable by replacing the softmax function in the attention with its sparse variants. In this work, we introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU, and show that sparsity naturally emerges from such a formulation. Training stability is achieved with layer normalization with either a specialized initialization or an additional gating function. Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms. We apply ReLA to the Transformer and conduct experiments on five machine translation tasks. ReLA achieves translation performance comparable to several strong baselines, with training and decoding speed similar to that of the vanilla attention. Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment than recent sparsified softmax-based models. Intriguingly, ReLA heads also learn to attend to nothing (i.e. 'switch off') for some queries, which is not possible with sparsified softmax alternatives.
最近,有人认为,通过将注意力中的softmax函数替换为其稀疏变量,可以使编码器-解码器模型更具解释性。在这项工作中,我们介绍了一种新的、简单的方法来实现注意力的稀疏性:我们用ReLU替换softmax激活,并表明稀疏性自然地从这样的公式中出现。通过使用专门的初始化或附加的选通函数进行层规范化,可以实现训练稳定性。我们的模型,我们称之为校正线性注意(ReLA),比之前提出的稀疏注意机制更容易实现,效率更高。我们将ReLA应用于Transformer,并在五个机器翻译任务上进行了实验。ReLA的翻译性能相当于几个强大的基线,训练和解码速度类似于香草注意力。我们的分析表明,ReLA提供了较高的稀疏率和头部多样性,并且诱导交叉注意在源-目标词对齐方面比最近基于softmax的稀疏模型具有更好的准确性。有趣的是,RelaHeads还学会了对一些查询不做任何处理(即“关闭”),这在稀疏的softmax备选方案中是不可能的。
网友评论