Commonsense Knowledge Prompting for Few-shot Action Recognition in Videos

Yuheng Shi, Xinxiao Wu, Hanxi Lin, Jiebo Luo

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)

Abstract

Few-shot action recognition in videos is challenging as the lack of supervision makes it extremely difficult to generalize well to unseen actions. To address this challenge, we propose a simple yet effective method, called knowledge prompting, which leverages commonsense knowledge of actions from external resources to prompt-tune a powerful pre-trained vision-language model for few-shot classification. To that end, we first collect a large-scale corpus of language descriptions of actions, defined as text proposals, to build an action knowledge base. The collection of text proposals is done by filling in a handcraft sentence template with an external action-related corpus or by extracting action-related phrases from captions of Web instruction videos. Next, we feed these text proposals to a pre-trained vision-language model along with video frames to generate matching scores of the proposals for each frame, and the scores can be treated as action semantics with strong generalization. Finally, we design a lightweight temporal modeling network to capture the temporal evolution of action semantics for classification. Extensive experiments on six benchmark datasets demonstrate that our method generally achieves state-of-the-art performance while reducing the training computational cost to 0.1&#x0025; of the existing methods. Code is available at <uri>https://github.com/OldStone0124/Knowledge-Prompting-for-FSAR.</uri>

Original languageEnglish
Pages (from-to)1-11
Number of pages11
JournalIEEE Transactions on Multimedia
DOIs
Publication statusAccepted/In press - 2024

Keywords

  • Few-shot action recognition
  • Proposals
  • Semantics
  • Task analysis
  • Text recognition
  • Training
  • Videos
  • Visualization
  • action semantics
  • knowledge prompting
  • pre-trained vision-language model

Fingerprint

Dive into the research topics of 'Commonsense Knowledge Prompting for Few-shot Action Recognition in Videos'. Together they form a unique fingerprint.

Cite this