EfficientFormer: Vision Transformers at MobileNet Speed
2 Jun 2022 · Yanyu Li, Geng Yuan, Yang Wen, Eric Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren · Edit social preview
https://arxiv.org/abs/2206.01191
https://github.com/snap-research/efficientformer
Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, e.g., attention mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance? To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient designs. Then we introduce a dimension-consistent pure transformer (without MobileNet blocks) as design paradigm. Finally, we perform latency-driven slimming to get a series of final models dubbed EfficientFormer. Extensive experiments show the superiority of EfficientFormer in performance and speed on mobile devices. Our fastest model, EfficientFormer-L1, achieves 79.2% top-1 accuracy on ImageNet-1K with only 1.6 ms inference latency on iPhone 12 (compiled with CoreML), which is even a bit faster than MobileNetV2 (1.7 ms, 71.8% top-1), and our largest model, EfficientFormer-L7, obtains 83.3% accuracy with only 7.0 ms latency. Our work proves that properly designed transformers can reach extremely low latency on mobile devices while maintaining high performance
视觉转换器(ViT)在计算机视觉任务方面取得了快速进展,在各种基准上取得了有希望的结果。然而,由于大量的参数和模型设计,例如注意机制,基于ViT的模型通常比轻量级卷积网络慢数倍。因此,为实时应用程序部署ViT尤其具有挑战性,尤其是在资源受限的硬件(如移动设备)上。近年来,人们试图通过网络架构搜索或与MobileNet块的混合设计来降低ViT的计算复杂度,但推理速度仍不令人满意。这就引出了一个重要的问题:变压器能否像MobileNet一样快速运行,同时获得高性能?为了回答这个问题,我们首先回顾了基于ViT的模型中使用的网络架构和运营商,并确定了低效的设计。然后,我们引入了一个维度一致的纯transformer(无MobileNet块)作为设计范例。最后,我们执行延迟驱动的瘦身,以获得一系列称为EfficientFormer的最终模型。在移动设备上进行的大量实验表明,EfficientFormer在性能和速度上都具有优越性。我们最快的型号EfficientFormer-L1在ImageNet-1K上实现了79.2%的top-1精度,而iPhone 12(使用CoreML编译)上的推理延迟仅为1.6 ms,甚至比MobileNetV2(1.7 ms,71.8%top-1)快一点;我们最大的型号EfficientFormer-L7在延迟仅为7.0 ms的情况下获得了83.3%的精度。我们的工作证明,设计得当的转换器可以在移动设备上达到极低的延迟,同时保持高性能
网友评论