“我语言的局限,即是我世界的局限。” ————路德维希·维特根斯坦
Language Is Not All You Need: Aligning Perception with Language Models
Feb 2023
Shaohan Huang*, Li Dong*, Wenhui Wang*, Yaru Hao*, Saksham Singhal*, Shuming Ma*, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, Furu Wei†
(*Equal contribution. †Corresponding author.)
[Microsoft]
https://arxiv.org/abs/2302.14045
https://github.com/microsoft/unilm
语言、多模态感知、动作和世界建模的大融合是迈向人工通用智能的关键一步。在这项工作中,我们介绍了Kosmos-1,这是一种多模态大型语言模型(MLLM),它可以感知一般模态,在上下文中学习(即,few-shot),并遵循指令(即,zero-shot)。具体来说,我们在网络规模的多模态语料库上从头开始训练Kosmos-1,包括任意交错的文本和图像、图像标题对和文本数据。我们在没有任何梯度更新或微调的情况下,评估各种设置,包括zero-shot、few-shot和多模式思维链提示。实验结果表明,Kosmos-1在(i)语言理解、生成,甚至无OCR NLP(直接输入文档图像),(ii)感知语言任务,包括多模态对话、图像字幕、视觉问题解答,以及(iii)视觉任务,例如具有描述的图像识别(通过文本指令指定分类)。我们还表明,MLLM可以受益于跨模态转移,即将知识从语言转移到多模态,以及从多模态转移到语言。此外,我们介绍了Raven IQ测试的数据集,该数据集诊断MLLM的非言语推理能力。
A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning. Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions). We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language. In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.
图1:KOSMOS-1是一个多模态大语言模型(MLLM),它能够感知多模态输入,遵循指令,并不仅针对语言任务,还针对多模态任务执行上下文学习。在这项工作中,我们将视觉与大型语言模型(LLM)相结合,推进了从LLM到MLLM的趋势。Figure 1: KOSMOS-1 is a multimodal large language model (MLLM) that is capable of perceiving multimodal input, following instructions, and performing in-context learning for not only language tasks but also multimodal tasks. In this work, we align vision with large language models (LLMs), advancing the trend of going from LLMs to MLLMs. 图2:KOSMOS-1生成的选定示例。蓝色框是输入提示,粉色框是KOSMOS-1输出。示例包括(1)-(2)视觉解释,(3)-(4)视觉问题解答,(5)网页问题解答,,(6)简单数学方程,以及(7)-(8)数字识别。 Figure 2: Selected examples generated from KOSMOS-1. Blue boxes are input prompt and pink boxes are KOSMOS-1 output. The examples include (1)-(2) visual explanation, (3)-(4) visual question answering, (5) web page question answering, (6) simple math equation, and (7)-(8) number recognition. 图3:KOSMOS-1生成的选定示例。蓝色框是输入提示,粉色框是KOSMOS-1输出。示例包括(1)-(2)图像字幕、(3)-(6)视觉问答、(7)-(8)OCR和(9)-(11)视觉对话。Figure 3: Selected examples generated from KOSMOS-1. Blue boxes are input prompt and pink boxes are KOSMOS-1 output. The examples include (1)-(2) image captioning, (3)-(6) visual question answering, (7)-(8) OCR, and (9)-(11) visual dialogue.
网友评论