A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation
Dec 2021
ECCV 2022
Wuyang Chen, Xianzhi Du, Fan Yang, Lucas Beyer, Xiaohua Zhai, Tsung-Yi Lin, Huizhong Chen, Jing Li, Xiaodan Song, Zhangyang Wang, Denny Zhou
[University of Texas, Google]
https://arxiv.org/abs/2112.09747
编者注:奇怪的是,它的开源似乎不容易找到,原文摘要的链接目前是404。在分类实验上似乎表现一般,并且没有和其它方法的对比,实际上,UViT-B以6.9GFLOPs实现81.3%的top1准确度,相比之下Swin-T以4.5GFLOPs实现81.3%的top1准确度。似乎架构是专为检测、实例分割这样的密集预测任务设计。检测实验上,只聚集在低计算量区间,而没有扩展到更大的模型。如果是一个计算资源有限的小团队,尚可以理解。但对于作者团队,就让人担心该方法的可扩展性。
https://github.com/tensorflow/models/blob/master/official/projects/panoptic/train.py
https://github.com/tensorflow/models/tree/master/official/projects/mae
摘要:这项工作提出了一种简单的视觉Transformer设计,作为对象定位和实例分割任务的强大基线。Transformer最近在图像分类任务中展示了具有竞争力的性能。为了将ViT应用于对象检测和密集预测任务,许多作品继承了卷积网络和高度定制的ViT架构的多级设计。在这种设计背后,目标是在计算成本和多尺度全局上下文的有效聚合之间寻求更好的权衡。然而,现有的作品采用了多阶段的架构设计作为黑箱解决方案,而没有清楚地了解其真正的好处。在本文中,我们全面研究了ViT上的三种架构设计选择——空间缩减、双通道和多尺度特征——并证明了一种普通的ViT架构可以实现这一目标,而无需手工制作多尺度特征,保持了原始的ViT设计理念。我们进一步完成了一个缩放规则,以优化模型在精度和计算成本/模型大小上的权衡。通过在整个编码器块中利用恒定的特征分辨率和隐藏大小,我们提出了一种称为通用视觉Transformer(UViT)的简单紧凑的ViT架构,该架构在COCO对象检测和实例分割任务上实现了强大的性能。
This work presents a simple vision transformer design as a strong baseline for object localization and instance segmentation tasks. Transformers recently demonstrate competitive performance in image classification tasks. To adopt ViT to object detection and dense prediction tasks, many works inherit the multistage design from convolutional networks and highly customized ViT architectures. Behind this design, the goal is to pursue a better trade-off between computational cost and effective aggregation of multiscale global contexts. However, existing works adopt the multistage architectural design as a black-box solution without a clear understanding of its true benefits. In this paper, we comprehensively study three architecture design choices on ViT -- spatial reduction, doubled channels, and multiscale features -- and demonstrate that a vanilla ViT architecture can fulfill this goal without handcrafting multiscale features, maintaining the original ViT design philosophy. We further complete a scaling rule to optimize our model's trade-off on accuracy and computation cost / model size. By leveraging a constant feature resolution and hidden size throughout the encoder blocks, we propose a simple and compact ViT architecture called Universal Vision Transformer (UViT) that achieves strong performance on COCO object detection and instance segmentation tasks.
网友评论