美文网首页
Paper | Palm: Predicting Actions

Paper | Palm: Predicting Actions

作者: 与阳光共进早餐 | 来源:发表于2023-09-28 17:29 被阅读0次

1. Introduction

  • paper for Ego4D challenge
  • task: Long-Term Action Anticipation (LTA) task 【长时视频动作预测】
  • given an input video with an annotated action period, LTA aims to predict possible future actions
  • https://github.com/DanDoge/Palm

key idea:

  • We argue that a shared feature is not enough to predict future actions as it cannot model the complex dependency.
  • We hypothesize that leveraging commonsense knowledge embedded in large language models can help us discover the underlying structure and dependency of different activities.

Solution:

  1. input video --> image caption model, action recognition model --> captions and recognized actions;
  2. use {captions, actions} to create a prompt --> large language model --> future action anticipation;

2. method

use natural language to describe past events and perform reasoning and prediction in the semantic space, leveraging the commonsense knowledge embedded in large language models.

2.1 Task

long-term action anticipation (LTA)

给定大约5分钟的视频片段,其中包括视频中每个动作的时间边界信息,从中预测出20个未来动作序列,每个动作都由一个“动词-名词”(verb, noun)的组合来描述。

2.2 Prompt Design

formulate the action anticipation task as a sentence completion task

通过prompt LLMs让其能够根据给定过去的动作描述预测未来动作。

Prompt template as:

其中, 1) 红色框的instruction paragraph用于guide LLMs; 2)蓝色框的是一些training例子,N表示past actions的数量,对于past action,caption和action都会给到LLM,而Z表示GT的future action;3)绿色框的是prediction,对于prediction,会给出N‘个past caption以及action,但是会设定N‘ > N。

past actions:

  • backbone: EgoVLP, a vison-language model trained on Ego4D dataset for extracting video features
  • head: transformer for extracting a single feature vector from the input video features + two classification heads to predict the verbs and the nouns.

narrations:

  • to provide more information related to visual input
  • thus, for each past action, use the middle frame to generate a caption starting with the prefix "a person is"

prompt selection:

  • iteratively select a set of examples p_{i} from training set D;
  • semantically close to the query prompt q but also diverse enough.
  • S: semantic similarity between two descriptions
  • specifically, the MPNet is used to extract the textural embeddings and the cosine embedding is to measure the similarity.

2.3 LLM Inference

  • extract the verb-noun pairs and append them to our prediction if both verb and noun fall in the Ego4D label space. (: a close set prediction
  • For predictions that have less than 20 actions, we pad it with the last action.

3 Result

achieves the 1st place at CVPR 2023 competitions.

相关文章

网友评论

      本文标题:Paper | Palm: Predicting Actions

      本文链接:https://www.haomeiwen.com/subject/rfvlbdtx.html