Extreme Masking for Learning Instance and Distributed Visual Representations
9 Jun 2022
https://arxiv.org/abs/2206.04667
Authors: Zhirong Wu, Zihang Lai, Xiao Sun, Stephen Lin
第一作者的页面:https://www.microsoft.com/en-us/research/people/wuzhiron/
Abstract: The paper presents a scalable approach for learning distributed representations over individual tokens and a holistic instance representation simultaneously. We use self-attention blocks to represent distributed tokens, followed by cross-attention blocks to aggregate the holistic instance. The core of the approach is the use of extremely large token masking (75%-90%) as the data augmentation for supervision. Our model, named ExtreMA, follows the plain BYOL approach where the instance representation from the unmasked subset is trained to predict that from the intact input. Learning requires the model to capture informative variations in an instance, instead of encouraging invariances. The paper makes three contributions: 1) Random masking is a strong and computationally efficient data augmentation for learning generalizable attention representations. 2) With multiple sampling per instance, extreme masking greatly speeds up learning and hungers for more data. 3) Distributed representations can be learned from the instance supervision alone, unlike per-token supervisions in masked modeling.
摘要:本文提出了一种可扩展的方法来学习individual tokens上的分布式表示,并同时学习一个整体实例表示。我们使用自我注意块来表示分布式令牌,然后使用交叉注意块来聚合整体实例。该方法的核心是使用超大令牌屏蔽(75%-90%)作为数据增强以进行监控。我们的模型名为ExtreMA,遵循普通的BYOL方法,其中训练来自未屏蔽子集的实例表示,以从完整的输入中预测。学习需要模型捕捉实例中的信息变化,而不是鼓励不变性。本文做出了三个贡献:1)随机掩蔽是一种强大且计算效率高的数据扩充,用于学习广义注意表征。2) 通过对每个实例进行多次采样,极端掩蔽可以大大加快学习速度,并渴望获得更多数据。3) 分布式表示可以单独从实例监控中学习,这与掩蔽建模中的每令牌监控不同。
网友评论