Commonsense Knowledge Prompting for Few-shot Action Recognition in Videos

Yuheng Shi; Xinxiao Wu; Hanxi Lin; Jiebo Luo

doi:10.1109/TMM.2024.3361157

Commonsense Knowledge Prompting for Few-shot Action Recognition in Videos

Yuheng Shi, Xinxiao Wu, Hanxi Lin, Jiebo Luo

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

1 Citation (Scopus)

Abstract

Few-shot action recognition in videos is challenging as the lack of supervision makes it extremely difficult to generalize well to unseen actions. To address this challenge, we propose a simple yet effective method, called knowledge prompting, which leverages commonsense knowledge of actions from external resources to prompt-tune a powerful pre-trained vision-language model for few-shot classification. To that end, we first collect a large-scale corpus of language descriptions of actions, defined as text proposals, to build an action knowledge base. The collection of text proposals is done by filling in a handcraft sentence template with an external action-related corpus or by extracting action-related phrases from captions of Web instruction videos. Next, we feed these text proposals to a pre-trained vision-language model along with video frames to generate matching scores of the proposals for each frame, and the scores can be treated as action semantics with strong generalization. Finally, we design a lightweight temporal modeling network to capture the temporal evolution of action semantics for classification. Extensive experiments on six benchmark datasets demonstrate that our method generally achieves state-of-the-art performance while reducing the training computational cost to 0.1% of the existing methods. Code is available at <uri>https://github.com/OldStone0124/Knowledge-Prompting-for-FSAR.</uri>

Original language	English
Pages (from-to)	1-11
Number of pages	11
Journal	IEEE Transactions on Multimedia
DOIs	https://doi.org/10.1109/TMM.2024.3361157
Publication status	Accepted/In press - 2024

Keywords

Few-shot action recognition
Proposals
Semantics
Task analysis
Text recognition
Training
Videos
Visualization
action semantics
knowledge prompting
pre-trained vision-language model

Access to Document

10.1109/TMM.2024.3361157

Cite this

@article{461e7ca85f6b467387a3985b64a86527,

title = "Commonsense Knowledge Prompting for Few-shot Action Recognition in Videos",

abstract = "Few-shot action recognition in videos is challenging as the lack of supervision makes it extremely difficult to generalize well to unseen actions. To address this challenge, we propose a simple yet effective method, called knowledge prompting, which leverages commonsense knowledge of actions from external resources to prompt-tune a powerful pre-trained vision-language model for few-shot classification. To that end, we first collect a large-scale corpus of language descriptions of actions, defined as text proposals, to build an action knowledge base. The collection of text proposals is done by filling in a handcraft sentence template with an external action-related corpus or by extracting action-related phrases from captions of Web instruction videos. Next, we feed these text proposals to a pre-trained vision-language model along with video frames to generate matching scores of the proposals for each frame, and the scores can be treated as action semantics with strong generalization. Finally, we design a lightweight temporal modeling network to capture the temporal evolution of action semantics for classification. Extensive experiments on six benchmark datasets demonstrate that our method generally achieves state-of-the-art performance while reducing the training computational cost to 0.1% of the existing methods. Code is available at https://github.com/OldStone0124/Knowledge-Prompting-for-FSAR.",

keywords = "Few-shot action recognition, Proposals, Semantics, Task analysis, Text recognition, Training, Videos, Visualization, action semantics, knowledge prompting, pre-trained vision-language model",

author = "Yuheng Shi and Xinxiao Wu and Hanxi Lin and Jiebo Luo",

note = "Publisher Copyright: IEEE",

year = "2024",

doi = "10.1109/TMM.2024.3361157",

language = "English",

pages = "1--11",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Commonsense Knowledge Prompting for Few-shot Action Recognition in Videos

AU - Shi, Yuheng

AU - Wu, Xinxiao

AU - Lin, Hanxi

AU - Luo, Jiebo

N1 - Publisher Copyright: IEEE

PY - 2024

Y1 - 2024

N2 - Few-shot action recognition in videos is challenging as the lack of supervision makes it extremely difficult to generalize well to unseen actions. To address this challenge, we propose a simple yet effective method, called knowledge prompting, which leverages commonsense knowledge of actions from external resources to prompt-tune a powerful pre-trained vision-language model for few-shot classification. To that end, we first collect a large-scale corpus of language descriptions of actions, defined as text proposals, to build an action knowledge base. The collection of text proposals is done by filling in a handcraft sentence template with an external action-related corpus or by extracting action-related phrases from captions of Web instruction videos. Next, we feed these text proposals to a pre-trained vision-language model along with video frames to generate matching scores of the proposals for each frame, and the scores can be treated as action semantics with strong generalization. Finally, we design a lightweight temporal modeling network to capture the temporal evolution of action semantics for classification. Extensive experiments on six benchmark datasets demonstrate that our method generally achieves state-of-the-art performance while reducing the training computational cost to 0.1% of the existing methods. Code is available at https://github.com/OldStone0124/Knowledge-Prompting-for-FSAR.

AB - Few-shot action recognition in videos is challenging as the lack of supervision makes it extremely difficult to generalize well to unseen actions. To address this challenge, we propose a simple yet effective method, called knowledge prompting, which leverages commonsense knowledge of actions from external resources to prompt-tune a powerful pre-trained vision-language model for few-shot classification. To that end, we first collect a large-scale corpus of language descriptions of actions, defined as text proposals, to build an action knowledge base. The collection of text proposals is done by filling in a handcraft sentence template with an external action-related corpus or by extracting action-related phrases from captions of Web instruction videos. Next, we feed these text proposals to a pre-trained vision-language model along with video frames to generate matching scores of the proposals for each frame, and the scores can be treated as action semantics with strong generalization. Finally, we design a lightweight temporal modeling network to capture the temporal evolution of action semantics for classification. Extensive experiments on six benchmark datasets demonstrate that our method generally achieves state-of-the-art performance while reducing the training computational cost to 0.1% of the existing methods. Code is available at https://github.com/OldStone0124/Knowledge-Prompting-for-FSAR.

KW - Few-shot action recognition

KW - Proposals

KW - Semantics

KW - Task analysis

KW - Text recognition

KW - Training

KW - Videos

KW - Visualization

KW - action semantics

KW - knowledge prompting

KW - pre-trained vision-language model

UR - http://www.scopus.com/inward/record.url?scp=85184311732&partnerID=8YFLogxK

U2 - 10.1109/TMM.2024.3361157

DO - 10.1109/TMM.2024.3361157

M3 - Article

AN - SCOPUS:85184311732

SN - 1520-9210

SP - 1

EP - 11

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

ER -

Commonsense Knowledge Prompting for Few-shot Action Recognition in Videos

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this