Vision Transformer Adapter for Dense Predictions
作者:Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu Qiao
原文发表于2022年5月22日,地址:https://arxiv.org/abs/2205.08534
开源:https://github.com/czczup/ViT-Adapter
This work investigates a simple yet powerful adapter for Vision Transformer (ViT). Unlike recent visual transformers that introduce vision-specific inductive biases into their architectures, ViT achieves inferior performance on dense prediction tasks due to lacking prior information of images. To solve this issue, we propose a Vision Transformer Adapter (ViT-Adapter), which can remedy the defects of ViT and achieve comparable performance to vision-specific models by introducing inductive biases via an additional architecture. Specifically, the backbone in our framework is a vanilla transformer that can be pre-trained with multi-modal data. When fine-tuning on downstream tasks, a modality-specific adapter is used to introduce the data and tasks' prior information into the model, making it suitable for these tasks. We verify the effectiveness of our ViT-Adapter on multiple downstream tasks, including object detection, instance segmentation, and semantic segmentation. Notably, when using HTC++, our ViT-Adapter-L yields 60.1 box AP and 52.1 mask AP on COCO test-dev, surpassing Swin-L by 1.4 box AP and 1.0 mask AP. For semantic segmentation, our ViT-Adapter-L establishes a new state-of-the-art of 60.5 mIoU on ADE20K val, 0.6 points higher than SwinV2-G. We hope that the proposed ViT-Adapter could serve as an alternative for vision-specific transformers and facilitate future research. The code and models will be released at this https URL.
本文研究了一种简单但功能强大的视觉转换器适配器(ViT)。与最近的视觉变换器不同,ViT在其结构中引入了视觉特定的感应偏差,由于缺乏图像的先验信息,因此在密集预测任务中的性能较差。为了解决这个问题,我们提出了一种视觉转换器适配器(ViT适配器),它可以通过一种附加的架构引入感应偏置,从而弥补ViT的缺陷,并实现与视觉特定模型相当的性能。具体来说,我们框架中的主干是一个普通的转换器,可以使用多模态数据进行预训练。在对下游任务进行微调时,使用特定于模态的适配器将数据和任务的先验信息引入模型,使其适合这些任务。我们验证了ViT适配器在多个下游任务上的有效性,包括对象检测、实例分割和语义分割。值得注意的是,当使用HTC++时,我们的ViT-Adapter-L在COCO测试设备上产生了60.1 box AP和52.1 mask AP,超过了SWN-L 1.4 box AP和1.0 mask AP。对于语义分割,我们的ViT-Adapter-L在ADE20K val上建立了一个新的最先进的60.5 mIoU,比SwinV2-G高0.6个百分点。我们希望拟议的ViT适配器可以作为视觉专用变压器的替代品,并促进未来的研究。代码和模型将在此https URL上发布。
网友评论