DaViT: Dual Attention Vision Transformers
https://arxiv.org/pdf/2204.03645.pdf
Authors: Mingyu Ding, Bin Xiao, Noel Codella, Ping Luo, Jingdong Wang, Lu Yuan
Abstract: In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective vision transformer architecture that is able to capture global context while maintaining computational efficiency. We propose approaching the problem from an orthogonal angle: exploiting self-attention mechanisms with both "spatial tokens" and "channel tokens". With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature dimension. With channel tokens, we have the inverse: the channel dimension defines the token scope, and the spatial dimension defines the token feature dimension. We further group tokens along the sequence direction for both spatial and channel tokens to maintain the linear complexity of the entire model. We show that these two self-attentions complement each other: (i) since each channel token contains an abstract representation of the entire image, the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels; (ii) the spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention. Extensive experiments show our DaViT achieves state-of-the-art performance on four different tasks with efficient computations. Without extra data, DaViT-Tiny, DaViT-Small, and DaViT-Base achieve 82.8%, 84.2%, and 84.6% top-1 accuracy on ImageNet-1K with 28.3M, 49.7M, and 87.9M parameters, respectively. When we further scale up DaViT with 1.5B weakly supervised image and text pairs, DaViT-Gaint reaches 90.4% top-1 accuracy on ImageNet-1K. Code is available at https://github.com/dingmyu/davit. △ Less
摘要:在这项工作中,我们介绍了双注意视觉转换器(DaViT),这是一种简单而有效的视觉转换器体系结构,能够在保持计算效率的同时捕获全局上下文。我们建议从一个正交的角度来处理这个问题:利用“空间标记”和“通道标记”的自我注意机制。对于空间令牌,空间维度定义令牌范围,通道维度定义令牌特征维度。对于通道令牌,我们有相反的定义:通道维度定义令牌范围,空间维度定义令牌特征维度。我们进一步沿着序列方向对空间和通道令牌进行分组,以保持整个模型的线性复杂性。我们发现,这两种自我注意是相辅相成的:(i)由于每个通道标记都包含整个图像的抽象表示,因此在计算通道之间的注意分数时,通道注意通过考虑所有空间位置,自然捕获全局交互和表示;(ii)空间注意通过跨空间位置执行细粒度交互来细化局部表征,这反过来有助于通道注意中的全局信息建模。大量的实验表明,我们的吊柱在四种不同的任务上都能达到最先进的性能,并且计算效率很高。在没有额外数据的情况下,DaViT Tiny、DaViT Small和DaViT Base在ImageNet-1K上分别达到82.8%、84.2%和84.6%的顶级精度,参数分别为28.3M、49.7M和87.9M。当我们进一步放大1.5B弱监督图像和文本对的吊柱时,吊柱Gaint在ImageNet-1K上达到了90.4%的顶级精度。代码可在https://github.com/dingmyu/davit.
Submitted 7 April, 2022; originally announced April 2022.
网友评论