Commonsense Knowledge Prompting for Few-shot Action Recognition in Videos

Yuheng Shi, Xinxiao Wu, Hanxi Lin, Jiebo Luo

科研成果: 期刊稿件文章同行评审

2 引用 (Scopus)

摘要

Few-shot action recognition in videos is challenging as the lack of supervision makes it extremely difficult to generalize well to unseen actions. To address this challenge, we propose a simple yet effective method, called knowledge prompting, which leverages commonsense knowledge of actions from external resources to prompt-tune a powerful pre-trained vision-language model for few-shot classification. To that end, we first collect a large-scale corpus of language descriptions of actions, defined as text proposals, to build an action knowledge base. The collection of text proposals is done by filling in a handcraft sentence template with an external action-related corpus or by extracting action-related phrases from captions of Web instruction videos. Next, we feed these text proposals to a pre-trained vision-language model along with video frames to generate matching scores of the proposals for each frame, and the scores can be treated as action semantics with strong generalization. Finally, we design a lightweight temporal modeling network to capture the temporal evolution of action semantics for classification. Extensive experiments on six benchmark datasets demonstrate that our method generally achieves state-of-the-art performance while reducing the training computational cost to 0.1&#x0025; of the existing methods. Code is available at <uri>https://github.com/OldStone0124/Knowledge-Prompting-for-FSAR.</uri>

源语言英语
页(从-至)1-11
页数11
期刊IEEE Transactions on Multimedia
DOI
出版状态已接受/待刊 - 2024

指纹

探究 'Commonsense Knowledge Prompting for Few-shot Action Recognition in Videos' 的科研主题。它们共同构成独一无二的指纹。

引用此

Shi, Y., Wu, X., Lin, H., & Luo, J. (已接受/印刷中). Commonsense Knowledge Prompting for Few-shot Action Recognition in Videos. IEEE Transactions on Multimedia, 1-11. https://doi.org/10.1109/TMM.2024.3361157