https://arxiv.org/abs/2112.10740
发表于2021年12月,和MAE同时期的相似工作,来自于
Are Large-scale Datasets Necessary for Self-Supervised Pre-training?
Alaaeldin El-Nouby, Gautier Izacard, Hugo Touvron, Ivan Laptev, Hervé Jegou, Edouard Grave
Pre-training models on large scale datasets, like ImageNet, is a standard practice in computer vision. This paradigm is especially effective for tasks with small training sets, for which high-capacity models tend to overfit. In this work, we consider a self-supervised pre-training scenario that only leverages the target task data. We consider datasets, like Stanford Cars, Sketch or COCO, which are order(s) of magnitude smaller than Imagenet. Our study shows that denoising autoencoders, such as BEiT or a variant that we introduce in this paper, are more robust to the type and size of the pre-training data than popular self-supervised methods trained by comparing image embeddings.We obtain competitive performance compared to ImageNet pre-training on a variety of classification datasets, from different domains. On COCO, when pre-training solely using COCO images, the detection and instance segmentation performance surpasses the supervised ImageNet pre-training in a comparable setting.
像ImageNet这样的大规模数据集上的预训练模型是计算机视觉中的标准实践。这种模式对于训练集较小的任务尤其有效,因为对于这些任务,高容量的模型往往过于适合。在这项工作中,我们考虑一个只利用目标任务数据的自我监督训练前场景。我们考虑的数据集,如斯坦福汽车、Sketch或COCO,其数量级比Imagenet小。我们的研究表明,与通过比较图像嵌入训练的常用自监督方法相比,去噪自动编码器(如BEiT或本文介绍的变体)对预训练数据的类型和大小更具鲁棒性。与ImageNet预培训相比,我们在来自不同领域的各种分类数据集上获得了具有竞争力的性能。在COCO上,当仅使用COCO图像进行预训练时,检测和实例分割性能在可比环境下优于监督的ImageNet预训练。
![](https://img.haomeiwen.com/i13727053/3524d919208431ec.png)
![](https://img.haomeiwen.com/i13727053/da992fd2044f1ddc.png)
![](https://img.haomeiwen.com/i13727053/af55b619f1a87b10.png)
网友评论