1. Introduction

paper for Ego4D challenge
task: Long-Term Action Anticipation (LTA) task 【长时视频动作预测】
given an input video with an annotated action period, LTA aims to predict possible future actions
https://github.com/DanDoge/Palm

key idea:

We argue that a shared feature is not enough to predict future actions as it cannot model the complex dependency.
We hypothesize that leveraging commonsense knowledge embedded in large language models can help us discover the underlying structure and dependency of different activities.

Solution:

input video --> image caption model, action recognition model --> captions and recognized actions;
use {captions, actions} to create a prompt --> large language model --> future action anticipation;

2. method

use natural language to describe past events and perform reasoning and prediction in the semantic space, leveraging the commonsense knowledge embedded in large language models.

2.1 Task

long-term action anticipation (LTA)

给定大约5分钟的视频片段，其中包括视频中每个动作的时间边界信息，从中预测出20个未来动作序列，每个动作都由一个“动词-名词”（verb, noun）的组合来描述。

2.2 Prompt Design

formulate the action anticipation task as a sentence completion task

通过prompt LLMs让其能够根据给定过去的动作描述预测未来动作。

Prompt template as：

其中， 1）红色框的instruction paragraph用于guide LLMs； 2）蓝色框的是一些training例子，N表示past actions的数量，对于past action，caption和action都会给到LLM，而Z表示GT的future action；3）绿色框的是prediction，对于prediction，会给出N‘个past caption以及action，但是会设定N‘ > N。

past actions:

backbone: EgoVLP, a vison-language model trained on Ego4D dataset for extracting video features
head: transformer for extracting a single feature vector from the input video features + two classification heads to predict the verbs and the nouns.

narrations:

to provide more information related to visual input
thus, for each past action, use the middle frame to generate a caption starting with the prefix "a person is"

prompt selection:

iteratively select a set of examples $p_{i}$ from training set $D$ ;
semantically close to the query prompt $q$ but also diverse enough.

S: semantic similarity between two descriptions
specifically, the MPNet is used to extract the textural embeddings and the cosine embedding is to measure the similarity.

2.3 LLM Inference

extract the verb-noun pairs and append them to our prediction if both verb and noun fall in the Ego4D label space. (: a close set prediction
For predictions that have less than 20 actions, we pad it with the last action.