ViT-Adapter

作者: Valar_Morghulis | 来源:发表于2022-05-19 10:18 被阅读0次

ViT-Adapter

Vision Transformer Adapter for Dense Predictions

作者：Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu Qiao

原文发表于2022年5月22日，地址：https://arxiv.org/abs/2205.08534

开源：https://github.com/czczup/ViT-Adapter

This work investigates a simple yet powerful adapter for Vision Transformer (ViT). Unlike recent visual transformers that introduce vision-specific inductive biases into their architectures, ViT achieves inferior performance on dense prediction tasks due to lacking prior information of images. To solve this issue, we propose a Vision Transformer Adapter (ViT-Adapter), which can remedy the defects of ViT and achieve comparable performance to vision-specific models by introducing inductive biases via an additional architecture. Specifically, the backbone in our framework is a vanilla transformer that can be pre-trained with multi-modal data. When fine-tuning on downstream tasks, a modality-specific adapter is used to introduce the data and tasks' prior information into the model, making it suitable for these tasks. We verify the effectiveness of our ViT-Adapter on multiple downstream tasks, including object detection, instance segmentation, and semantic segmentation. Notably, when using HTC++, our ViT-Adapter-L yields 60.1 box AP and 52.1 mask AP on COCO test-dev, surpassing Swin-L by 1.4 box AP and 1.0 mask AP. For semantic segmentation, our ViT-Adapter-L establishes a new state-of-the-art of 60.5 mIoU on ADE20K val, 0.6 points higher than SwinV2-G. We hope that the proposed ViT-Adapter could serve as an alternative for vision-specific transformers and facilitate future research. The code and models will be released at this https URL.

本文研究了一种简单但功能强大的视觉转换器适配器（ViT）。与最近的视觉变换器不同，ViT在其结构中引入了视觉特定的感应偏差，由于缺乏图像的先验信息，因此在密集预测任务中的性能较差。为了解决这个问题，我们提出了一种视觉转换器适配器（ViT适配器），它可以通过一种附加的架构引入感应偏置，从而弥补ViT的缺陷，并实现与视觉特定模型相当的性能。具体来说，我们框架中的主干是一个普通的转换器，可以使用多模态数据进行预训练。在对下游任务进行微调时，使用特定于模态的适配器将数据和任务的先验信息引入模型，使其适合这些任务。我们验证了ViT适配器在多个下游任务上的有效性，包括对象检测、实例分割和语义分割。值得注意的是，当使用HTC++时，我们的ViT-Adapter-L在COCO测试设备上产生了60.1 box AP和52.1 mask AP，超过了SWN-L 1.4 box AP和1.0 mask AP。对于语义分割，我们的ViT-Adapter-L在ADE20K val上建立了一个新的最先进的60.5 mIoU，比SwinV2-G高0.6个百分点。我们希望拟议的ViT适配器可以作为视觉专用变压器的替代品，并促进未来的研究。代码和模型将在此https URL上发布。