Commonsense Knowledge Prompting for Few-shot Action Recognition in Videos

Yuheng Shi; Xinxiao Wu; Hanxi Lin; Jiebo Luo

doi:10.1109/TMM.2024.3361157

Commonsense Knowledge Prompting for Few-shot Action Recognition in Videos

Yuheng Shi, Xinxiao Wu, Hanxi Lin, Jiebo Luo

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

2 引用（Scopus）

摘要

Few-shot action recognition in videos is challenging as the lack of supervision makes it extremely difficult to generalize well to unseen actions. To address this challenge, we propose a simple yet effective method, called knowledge prompting, which leverages commonsense knowledge of actions from external resources to prompt-tune a powerful pre-trained vision-language model for few-shot classification. To that end, we first collect a large-scale corpus of language descriptions of actions, defined as text proposals, to build an action knowledge base. The collection of text proposals is done by filling in a handcraft sentence template with an external action-related corpus or by extracting action-related phrases from captions of Web instruction videos. Next, we feed these text proposals to a pre-trained vision-language model along with video frames to generate matching scores of the proposals for each frame, and the scores can be treated as action semantics with strong generalization. Finally, we design a lightweight temporal modeling network to capture the temporal evolution of action semantics for classification. Extensive experiments on six benchmark datasets demonstrate that our method generally achieves state-of-the-art performance while reducing the training computational cost to 0.1% of the existing methods. Code is available at <uri>https://github.com/OldStone0124/Knowledge-Prompting-for-FSAR.</uri>

源语言	英语
页（从-至）	1-11
页数	11
期刊	IEEE Transactions on Multimedia
DOI	https://doi.org/10.1109/TMM.2024.3361157
出版状态	已接受/待刊 - 2024

访问文件

10.1109/TMM.2024.3361157

其它文件与链接

链接到 Scopus 的出版物

引用此

Shi, Y., Wu, X., Lin, H., & Luo, J. (已接受/印刷中). Commonsense Knowledge Prompting for Few-shot Action Recognition in Videos. IEEE Transactions on Multimedia, 1-11. https://doi.org/10.1109/TMM.2024.3361157

@article{461e7ca85f6b467387a3985b64a86527,

title = "Commonsense Knowledge Prompting for Few-shot Action Recognition in Videos",

abstract = "Few-shot action recognition in videos is challenging as the lack of supervision makes it extremely difficult to generalize well to unseen actions. To address this challenge, we propose a simple yet effective method, called knowledge prompting, which leverages commonsense knowledge of actions from external resources to prompt-tune a powerful pre-trained vision-language model for few-shot classification. To that end, we first collect a large-scale corpus of language descriptions of actions, defined as text proposals, to build an action knowledge base. The collection of text proposals is done by filling in a handcraft sentence template with an external action-related corpus or by extracting action-related phrases from captions of Web instruction videos. Next, we feed these text proposals to a pre-trained vision-language model along with video frames to generate matching scores of the proposals for each frame, and the scores can be treated as action semantics with strong generalization. Finally, we design a lightweight temporal modeling network to capture the temporal evolution of action semantics for classification. Extensive experiments on six benchmark datasets demonstrate that our method generally achieves state-of-the-art performance while reducing the training computational cost to 0.1% of the existing methods. Code is available at https://github.com/OldStone0124/Knowledge-Prompting-for-FSAR.",

keywords = "Few-shot action recognition, Proposals, Semantics, Task analysis, Text recognition, Training, Videos, Visualization, action semantics, knowledge prompting, pre-trained vision-language model",

author = "Yuheng Shi and Xinxiao Wu and Hanxi Lin and Jiebo Luo",

note = "Publisher Copyright: IEEE",

year = "2024",

doi = "10.1109/TMM.2024.3361157",

language = "English",

pages = "1--11",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Commonsense Knowledge Prompting for Few-shot Action Recognition in Videos

AU - Shi, Yuheng

AU - Wu, Xinxiao

AU - Lin, Hanxi

AU - Luo, Jiebo

N1 - Publisher Copyright: IEEE

PY - 2024

Y1 - 2024

N2 - Few-shot action recognition in videos is challenging as the lack of supervision makes it extremely difficult to generalize well to unseen actions. To address this challenge, we propose a simple yet effective method, called knowledge prompting, which leverages commonsense knowledge of actions from external resources to prompt-tune a powerful pre-trained vision-language model for few-shot classification. To that end, we first collect a large-scale corpus of language descriptions of actions, defined as text proposals, to build an action knowledge base. The collection of text proposals is done by filling in a handcraft sentence template with an external action-related corpus or by extracting action-related phrases from captions of Web instruction videos. Next, we feed these text proposals to a pre-trained vision-language model along with video frames to generate matching scores of the proposals for each frame, and the scores can be treated as action semantics with strong generalization. Finally, we design a lightweight temporal modeling network to capture the temporal evolution of action semantics for classification. Extensive experiments on six benchmark datasets demonstrate that our method generally achieves state-of-the-art performance while reducing the training computational cost to 0.1% of the existing methods. Code is available at https://github.com/OldStone0124/Knowledge-Prompting-for-FSAR.

AB - Few-shot action recognition in videos is challenging as the lack of supervision makes it extremely difficult to generalize well to unseen actions. To address this challenge, we propose a simple yet effective method, called knowledge prompting, which leverages commonsense knowledge of actions from external resources to prompt-tune a powerful pre-trained vision-language model for few-shot classification. To that end, we first collect a large-scale corpus of language descriptions of actions, defined as text proposals, to build an action knowledge base. The collection of text proposals is done by filling in a handcraft sentence template with an external action-related corpus or by extracting action-related phrases from captions of Web instruction videos. Next, we feed these text proposals to a pre-trained vision-language model along with video frames to generate matching scores of the proposals for each frame, and the scores can be treated as action semantics with strong generalization. Finally, we design a lightweight temporal modeling network to capture the temporal evolution of action semantics for classification. Extensive experiments on six benchmark datasets demonstrate that our method generally achieves state-of-the-art performance while reducing the training computational cost to 0.1% of the existing methods. Code is available at https://github.com/OldStone0124/Knowledge-Prompting-for-FSAR.

KW - Few-shot action recognition

KW - Proposals

KW - Semantics

KW - Task analysis

KW - Text recognition

KW - Training

KW - Videos

KW - Visualization

KW - action semantics

KW - knowledge prompting

KW - pre-trained vision-language model

UR - http://www.scopus.com/inward/record.url?scp=85184311732&partnerID=8YFLogxK

U2 - 10.1109/TMM.2024.3361157

DO - 10.1109/TMM.2024.3361157

M3 - Article

AN - SCOPUS:85184311732

SN - 1520-9210

SP - 1

EP - 11

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

ER -

Commonsense Knowledge Prompting for Few-shot Action Recognition in Videos

摘要

访问文件

其它文件与链接

指纹

引用此