视频MAE

作者: Valar_Morghulis | 来源:发表于2022-05-20 09:53 被阅读0次

    Masked Autoencoders As Spatiotemporal Learners

    作者:Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, Kaiming He

    原文:https://arxiv.org/abs/2205.09113

    发表于:2022.5.18

    This paper studies a conceptually simple extension of Masked Autoencoders (MAE) to spatiotemporal representation learning from videos. We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels. Interestingly, we show that our MAE method can learn strong representations with almost no inductive bias on spacetime (only except for patch and positional embeddings), and spacetime-agnostic random masking performs the best. We observe that the optimal masking ratio is as high as 90% (vs. 75% on images), supporting the hypothesis that this ratio is related to information redundancy of the data. A high masking ratio leads to a large speedup, e.g., > 4x in wall-clock time or even more. We report competitive results on several challenging video datasets using vanilla Vision Transformers. We observe that MAE can outperform supervised pre-training by large margins. We further report encouraging results of training on real-world, uncurated Instagram data. Our study suggests that the general framework of masked autoencoding (BERT, MAE, etc.) can be a unified methodology for representation learning with minimal domain knowledge.

    本文研究了一种概念上简单的将屏蔽自动编码器(MAE)扩展到视频时空表示学习的方法。我们随机屏蔽视频中的时空补丁,并学习自动编码器以像素为单位重建它们。有趣的是,我们的MAE方法可以学习强表示,几乎没有时空诱导偏差(只有面片和位置嵌入除外),时空不可知随机掩蔽表现最好。我们观察到,最佳掩蔽率高达90%(而图像为75%),支持该比率与数据信息冗余相关的假设。高掩蔽率会导致较大的加速,例如,挂钟时间>4倍甚至更多。我们使用vanilla Vision Transformers报告了几个具有挑战性的视频数据集的竞争结果。我们观察到,MAE的表现大大优于有监督的预训练。我们还报告了在真实世界中未经处理的Instagram数据上进行培训的令人鼓舞的结果。我们的研究表明,掩蔽自动编码的一般框架(BERT、MAE等)可以是一种统一的表示学习方法,只需最少的领域知识。

    相关文章

      网友评论

          本文标题:视频MAE

          本文链接:https://www.haomeiwen.com/subject/hjqfprtx.html