Token Merging: Your ViT But Faster
https://arxiv.org/abs/2210.09461
https://github.com/facebookresearch/tome
We introduce Token Merging (ToMe), a simple method to increase the throughput of existing ViT models without needing to train. ToMe gradually combines similar tokens in a transformer using a general and light-weight matching algorithm that is as fast as pruning while being more accurate. Off-the-shelf, ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2.2x the throughput of ViT-L on video with only a 0.2-0.3% accuracy drop in each case. ToMe can also easily be applied during training, improving in practice training speed up to 2x for MAE fine-tuning on video. Training with ToMe further minimizes accuracy drop, leading to 2x the throughput of ViT-B on audio for only a 0.4% mAP drop. Qualitatively, we find that ToMe merges object parts into one token, even over multiple frames of video. Overall, ToMe's accuracy and speed are competitive with state-of-the-art on images, video, and audio.
我们引入令牌合并(Token Merging,ToMe),这是一种简单的方法,可以在不需要训练的情况下提高现有ViT模型的吞吐量。ToMe使用一种通用的轻量级匹配算法,在一个变换器中逐步组合类似的令牌,这种算法与修剪一样快,同时更准确。现成的ToMe在图像上的吞吐量是最先进的ViT-L@512和ViT-H@518机型的两倍,在视频上的吞吐量为ViT-L的2.2倍,每种情况下的准确率仅为0.2-0.3%。ToMe也可以在训练过程中轻松应用,将实际训练速度提高到2倍,以便在视频中进行MAE微调。使用ToMe进行的训练进一步将精确度下降降至最低,从而使ViT-B在音频上的吞吐量提高了2倍,仅需0.4%的mAP下降。定性上,我们发现ToMe将对象部分合并到一个令牌中,甚至在多个视频帧上。总的来说,ToMe的准确性和速度在图像、视频和音频方面与最先进的技术具有竞争力。
网友评论