Token Merging：无需训练将ViT速度提升至2倍

作者: Valar_Morghulis | 来源:发表于2023-01-14 08:51 被阅读0次

Tokens-to-Token ViT: Training Vi
Tokens-to-Token ViT: Training Vi
变量及函数的提升
提升阅读速度的方法
RCS(Rights Convert Staking)是区块链的
合并 pull request（Merging a pull r
网赚
算法—排序篇2
鉴权token和refresh_token
Training data-efﬁcient image tra

Token Merging: Your ViT But Faster

https://arxiv.org/abs/2210.09461

https://github.com/facebookresearch/tome

We introduce Token Merging (ToMe), a simple method to increase the throughput of existing ViT models without needing to train. ToMe gradually combines similar tokens in a transformer using a general and light-weight matching algorithm that is as fast as pruning while being more accurate. Off-the-shelf, ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2.2x the throughput of ViT-L on video with only a 0.2-0.3% accuracy drop in each case. ToMe can also easily be applied during training, improving in practice training speed up to 2x for MAE fine-tuning on video. Training with ToMe further minimizes accuracy drop, leading to 2x the throughput of ViT-B on audio for only a 0.4% mAP drop. Qualitatively, we find that ToMe merges object parts into one token, even over multiple frames of video. Overall, ToMe's accuracy and speed are competitive with state-of-the-art on images, video, and audio.

我们引入令牌合并（Token Merging，ToMe），这是一种简单的方法，可以在不需要训练的情况下提高现有ViT模型的吞吐量。ToMe使用一种通用的轻量级匹配算法，在一个变换器中逐步组合类似的令牌，这种算法与修剪一样快，同时更准确。现成的ToMe在图像上的吞吐量是最先进的ViT-L@512和ViT-H@518机型的两倍，在视频上的吞吐量为ViT-L的2.2倍，每种情况下的准确率仅为0.2-0.3%。ToMe也可以在训练过程中轻松应用，将实际训练速度提高到2倍，以便在视频中进行MAE微调。使用ToMe进行的训练进一步将精确度下降降至最低，从而使ViT-B在音频上的吞吐量提高了2倍，仅需0.4%的mAP下降。定性上，我们发现ToMe将对象部分合并到一个令牌中，甚至在多个视频帧上。总的来说，ToMe的准确性和速度在图像、视频和音频方面与最先进的技术具有竞争力。