美文网首页
PaLI:一个联合缩放的多语言图像模型

PaLI:一个联合缩放的多语言图像模型

作者: Valar_Morghulis | 来源:发表于2023-03-10 18:18 被阅读0次

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Sep 2022

ICLR 2023    notable-top-5%

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut

[Google Research]

https://arxiv.org/abs/2209.06794

https://openreview.net/forum?id=mWVoBz4W0u 

ICLR评审意见:这项工作介绍了PaLI,这是一种新的大规模视觉语言模型,具有在大规模多语言图像-文本对上预训练的编码器-解码器架构。在提交人反驳后,它获得了6888分。所有评审人员都对论文感到满意,一致认为(1)结果很强,经验丰富,(2)进行了充分的深入分析和消融研究,以更好地理解模型,(3)论文撰写得很好。另一方面,模型本身的新颖性有点有限,而且由于使用内部数据进行大规模预训练的性质,再现性很低。总的来说,这篇论文做出了坚实的贡献,广大读者将对这篇论文感兴趣,因此,AC希望建议接受这篇论文。这篇论文为多模态预训练社区做出了坚实的贡献,普通观众将对这篇论文感兴趣。它展示了如何预训练大规模多语言多模态模型,并实现了强大的性能。尽管技术上的新颖性相当有限,但本文通过展示扩展模型训练的能力以及如何扩展模型训练仍然具有重要意义。

摘要:有效的缩放和灵活的任务界面使大型语言模型能够在许多任务中表现出色。PaLI(路径语言和图像模型)将这种方法扩展到语言和视觉的联合建模。PaLI基于视觉和文本输入生成文本,并使用该界面以多种语言执行多种视觉、语言和多模式任务。为了训练PaLI,我们使用了大型预训练的编码器-解码器语言模型和视觉变换器(ViT)。这使我们能够利用他们现有的能力,并利用培训他们的巨大成本。我们发现视觉和语言组件的联合缩放非常重要。由于现有的语言变形金刚比他们的视觉模型要大得多,我们训练了迄今为止最大的ViT(ViT-e),以量化更大容量视觉模型的益处。为了训练PaLI,我们基于包含100多种语言的10B图像和文本的新图像文本训练集,创建了大量多语言预训练任务。PaLI在多种视觉和语言任务(如字幕、视觉问答、场景文本理解)方面达到了最先进的水平,同时保持了简单、模块化和可扩展的设计。

Effective scaling and a flexible task interface enable large language models to excel at many tasks. PaLI (Pathways Language and Image model) extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pretrained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train the largest ViT to date (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.

相关文章

网友评论

      本文标题:PaLI:一个联合缩放的多语言图像模型

      本文链接:https://www.haomeiwen.com/subject/benykdtx.html