Abstract
Few-shot action recognition in videos is challenging as the lack of supervision makes it extremely difficult to generalize well to unseen actions. To address this challenge, we propose a simple yet effective method, called knowledge prompting, which leverages commonsense knowledge of actions from external resources to prompt-tune a powerful pre-trained vision-language model for few-shot classification. To that end, we first collect a large-scale corpus of language descriptions of actions, defined as text proposals, to build an action knowledge base. The collection of text proposals is done by filling in a handcraft sentence template with an external action-related corpus or by extracting action-related phrases from captions of Web instruction videos. Next, we feed these text proposals to a pre-trained vision-language model along with video frames to generate matching scores of the proposals for each frame, and the scores can be treated as action semantics with strong generalization. Finally, we design a lightweight temporal modeling network to capture the temporal evolution of action semantics for classification. Extensive experiments on six benchmark datasets demonstrate that our method generally achieves state-of-the-art performance while reducing the training computational cost to 0.1% of the existing methods. Code is available at <uri>https://github.com/OldStone0124/Knowledge-Prompting-for-FSAR.</uri>
Original language | English |
---|---|
Pages (from-to) | 1-11 |
Number of pages | 11 |
Journal | IEEE Transactions on Multimedia |
DOIs | |
Publication status | Accepted/In press - 2024 |
Keywords
- Few-shot action recognition
- Proposals
- Semantics
- Task analysis
- Text recognition
- Training
- Videos
- Visualization
- action semantics
- knowledge prompting
- pre-trained vision-language model