1. Introduction
- paper for Ego4D challenge
- task: Long-Term Action Anticipation (LTA) task 【长时视频动作预测】
- given an input video with an annotated action period, LTA aims to predict possible future actions
- https://github.com/DanDoge/Palm
key idea:
- We argue that a shared feature is not enough to predict future actions as it cannot model the complex dependency.
- We hypothesize that leveraging commonsense knowledge embedded in large language models can help us discover the underlying structure and dependency of different activities.
Solution:
- input video --> image caption model, action recognition model --> captions and recognized actions;
- use {captions, actions} to create a prompt --> large language model --> future action anticipation;
![](https://img.haomeiwen.com/i9933353/aa931b20e0bfb263.png)
2. method
use natural language to describe past events and perform reasoning and prediction in the semantic space, leveraging the commonsense knowledge embedded in large language models.
2.1 Task
long-term action anticipation (LTA)
给定大约5分钟的视频片段,其中包括视频中每个动作的时间边界信息,从中预测出20个未来动作序列,每个动作都由一个“动词-名词”(verb, noun)的组合来描述。
2.2 Prompt Design
formulate the action anticipation task as a sentence completion task
通过prompt LLMs让其能够根据给定过去的动作描述预测未来动作。
Prompt template as:
![](https://img.haomeiwen.com/i9933353/3e706e759f526a92.png)
其中, 1) 红色框的instruction paragraph用于guide LLMs; 2)蓝色框的是一些training例子,N表示past actions的数量,对于past action,caption和action都会给到LLM,而Z表示GT的future action;3)绿色框的是prediction,对于prediction,会给出N‘个past caption以及action,但是会设定N‘ > N。
past actions:
- backbone: EgoVLP, a vison-language model trained on Ego4D dataset for extracting video features
- head: transformer for extracting a single feature vector from the input video features + two classification heads to predict the verbs and the nouns.
narrations:
- to provide more information related to visual input
- thus, for each past action, use the middle frame to generate a caption starting with the prefix "a person is"
prompt selection:
- iteratively select a set of examples
from training set
;
- semantically close to the query prompt
but also diverse enough.
![](https://img.haomeiwen.com/i9933353/6c69ec834a5e38f4.png)
- S: semantic similarity between two descriptions
- specifically, the MPNet is used to extract the textural embeddings and the cosine embedding is to measure the similarity.
2.3 LLM Inference
- extract the verb-noun pairs and append them to our prediction if both verb and noun fall in the Ego4D label space. (: a close set prediction
- For predictions that have less than 20 actions, we pad it with the last action.
3 Result
achieves the 1st place at CVPR 2023 competitions.
![](https://img.haomeiwen.com/i9933353/cfadd98da46e8b10.png)
![](https://img.haomeiwen.com/i9933353/a43a6055ac8a1bb8.png)
网友评论