校正线性注意ReLA

作者: Valar_Morghulis | 来源:发表于2023-01-21 14:12 被阅读0次

固定效应|批量校正表型值:adjust_phe
Django makemigrations 坑请注意(rela
GWAS基石---混合线性模型
考研80
非线性转化注意事项
android第一行代码笔记5-布局
Relational Design
室内设计，3dmax渲染教程，线性工作流及VRAY设置方法
proposal
图像预处理

Sparse Attention with Linear Units

Apr 2021

EMNLP2021

https://arxiv.org/abs/2104.07012

https://github.com/bzhangGo/zero

Biao Zhang, Ivan Titov, Rico Sennrich

Recently, it has been argued that encoder-decoder models can be made more interpretable by replacing the softmax function in the attention with its sparse variants. In this work, we introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU, and show that sparsity naturally emerges from such a formulation. Training stability is achieved with layer normalization with either a specialized initialization or an additional gating function. Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms. We apply ReLA to the Transformer and conduct experiments on five machine translation tasks. ReLA achieves translation performance comparable to several strong baselines, with training and decoding speed similar to that of the vanilla attention. Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment than recent sparsified softmax-based models. Intriguingly, ReLA heads also learn to attend to nothing (i.e. 'switch off') for some queries, which is not possible with sparsified softmax alternatives.

最近，有人认为，通过将注意力中的softmax函数替换为其稀疏变量，可以使编码器-解码器模型更具解释性。在这项工作中，我们介绍了一种新的、简单的方法来实现注意力的稀疏性：我们用ReLU替换softmax激活，并表明稀疏性自然地从这样的公式中出现。通过使用专门的初始化或附加的选通函数进行层规范化，可以实现训练稳定性。我们的模型，我们称之为校正线性注意（ReLA），比之前提出的稀疏注意机制更容易实现，效率更高。我们将ReLA应用于Transformer，并在五个机器翻译任务上进行了实验。ReLA的翻译性能相当于几个强大的基线，训练和解码速度类似于香草注意力。我们的分析表明，ReLA提供了较高的稀疏率和头部多样性，并且诱导交叉注意在源-目标词对齐方面比最近基于softmax的稀疏模型具有更好的准确性。有趣的是，RelaHeads还学会了对一些查询不做任何处理（即“关闭”），这在稀疏的softmax备选方案中是不可能的。