Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection

Shuo Yang; Yongqi Wang; Xiaofeng Ji; Xinxiao Wu

doi:10.1609/aaai.v38i7.28472

Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection

Shuo Yang^*, Yongqi Wang^*, Xiaofeng Ji, Xinxiao Wu

^*此作品的通讯作者

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

1 引用（Scopus）

摘要

Open-vocabulary video visual relationship detection aims to extend video visual relationship detection beyond annotated categories by detecting unseen relationships between objects in videos. Recent progresses in open-vocabulary perception, primarily driven by large-scale image-text pre-trained models like CLIP, have shown remarkable success in recognizing novel objects and semantic categories. However, directly applying CLIP-like models to video visual relationship detection encounters significant challenges due to the substantial gap between images and video object relationships. To address this challenge, we propose a multi-modal prompting method that adapts CLIP well to open-vocabulary video visual relationship detection by prompt-tuning on both visual representation and language input. Specifically, we enhance the image encoder of CLIP by using spatio-temporal visual prompting to capture spatio-temporal contexts, thereby making it suitable for object-level relationship representation in videos. Furthermore, we propose vision-guided language prompting to leverage CLIP's comprehensive semantic knowledge for discovering unseen relationship categories, thus facilitating recognizing novel video relationships. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our method, especially achieving a significant gain of nearly 10% in mAP on novel relationship categories on the VidVRD dataset.

源语言	英语
主期刊名	Technical Tracks 14
编辑	Michael Wooldridge, Jennifer Dy, Sriraam Natarajan
出版商	Association for the Advancement of Artificial Intelligence
页	6513-6521
页数	9
版本	7
ISBN（电子版）	1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879
DOI	https://doi.org/10.1609/aaai.v38i7.28472
出版状态	已出版 - 25 3月 2024
活动	38th AAAI Conference on Artificial Intelligence, AAAI 2024 - Vancouver, 加拿大期限: 20 2月 2024 → 27 2月 2024

出版系列

姓名	Proceedings of the AAAI Conference on Artificial Intelligence
编号	7
卷	38
ISSN（印刷版）	2159-5399
ISSN（电子版）	2374-3468

会议

会议	38th AAAI Conference on Artificial Intelligence, AAAI 2024
国家/地区	加拿大
市	Vancouver
时期	20/02/24 → 27/02/24

访问文件

10.1609/aaai.v38i7.28472

其它文件与链接

链接到 Scopus 的出版物

引用此

Yang, S., Wang, Y., Ji, X., & Wu, X. (2024). Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection. 在 M. Wooldridge, J. Dy, & S. Natarajan (编辑), Technical Tracks 14 (7 编辑, 页码 6513-6521). (Proceedings of the AAAI Conference on Artificial Intelligence; 卷 38, 号码 7). Association for the Advancement of Artificial Intelligence. https://doi.org/10.1609/aaai.v38i7.28472

Yang, Shuo ; Wang, Yongqi ; Ji, Xiaofeng 等. / Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection. Technical Tracks 14. 编辑 / Michael Wooldridge ; Jennifer Dy ; Sriraam Natarajan. 7. 编辑 Association for the Advancement of Artificial Intelligence, 2024. 页码 6513-6521 (Proceedings of the AAAI Conference on Artificial Intelligence; 7).

@inproceedings{94152989d5a54496a250177f0eaaf67a,

title = "Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection",

abstract = "Open-vocabulary video visual relationship detection aims to extend video visual relationship detection beyond annotated categories by detecting unseen relationships between objects in videos. Recent progresses in open-vocabulary perception, primarily driven by large-scale image-text pre-trained models like CLIP, have shown remarkable success in recognizing novel objects and semantic categories. However, directly applying CLIP-like models to video visual relationship detection encounters significant challenges due to the substantial gap between images and video object relationships. To address this challenge, we propose a multi-modal prompting method that adapts CLIP well to open-vocabulary video visual relationship detection by prompt-tuning on both visual representation and language input. Specifically, we enhance the image encoder of CLIP by using spatio-temporal visual prompting to capture spatio-temporal contexts, thereby making it suitable for object-level relationship representation in videos. Furthermore, we propose vision-guided language prompting to leverage CLIP's comprehensive semantic knowledge for discovering unseen relationship categories, thus facilitating recognizing novel video relationships. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our method, especially achieving a significant gain of nearly 10% in mAP on novel relationship categories on the VidVRD dataset.",

author = "Shuo Yang and Yongqi Wang and Xiaofeng Ji and Xinxiao Wu",

note = "Publisher Copyright: Copyright {\textcopyright} 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.; 38th AAAI Conference on Artificial Intelligence, AAAI 2024 ; Conference date: 20-02-2024 Through 27-02-2024",

year = "2024",

month = mar,

day = "25",

doi = "10.1609/aaai.v38i7.28472",

language = "English",

series = "Proceedings of the AAAI Conference on Artificial Intelligence",

publisher = "Association for the Advancement of Artificial Intelligence",

number = "7",

pages = "6513--6521",

editor = "Michael Wooldridge and Jennifer Dy and Sriraam Natarajan",

booktitle = "Technical Tracks 14",

edition = "7",

}

Yang, S, Wang, Y, Ji, X & Wu, X 2024, Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection. 在 M Wooldridge, J Dy & S Natarajan (编辑), Technical Tracks 14. 7 编辑, Proceedings of the AAAI Conference on Artificial Intelligence, 号码 7, 卷 38, Association for the Advancement of Artificial Intelligence, 页码 6513-6521, 38th AAAI Conference on Artificial Intelligence, AAAI 2024, Vancouver, 加拿大, 20/02/24. https://doi.org/10.1609/aaai.v38i7.28472

Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection. / Yang, Shuo; Wang, Yongqi; Ji, Xiaofeng 等.
Technical Tracks 14. 编辑 / Michael Wooldridge; Jennifer Dy; Sriraam Natarajan. 7. 编辑 Association for the Advancement of Artificial Intelligence, 2024. 页码 6513-6521 (Proceedings of the AAAI Conference on Artificial Intelligence; 卷 38, 号码 7).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection

AU - Yang, Shuo

AU - Wang, Yongqi

AU - Ji, Xiaofeng

AU - Wu, Xinxiao

PY - 2024/3/25

Y1 - 2024/3/25

N2 - Open-vocabulary video visual relationship detection aims to extend video visual relationship detection beyond annotated categories by detecting unseen relationships between objects in videos. Recent progresses in open-vocabulary perception, primarily driven by large-scale image-text pre-trained models like CLIP, have shown remarkable success in recognizing novel objects and semantic categories. However, directly applying CLIP-like models to video visual relationship detection encounters significant challenges due to the substantial gap between images and video object relationships. To address this challenge, we propose a multi-modal prompting method that adapts CLIP well to open-vocabulary video visual relationship detection by prompt-tuning on both visual representation and language input. Specifically, we enhance the image encoder of CLIP by using spatio-temporal visual prompting to capture spatio-temporal contexts, thereby making it suitable for object-level relationship representation in videos. Furthermore, we propose vision-guided language prompting to leverage CLIP's comprehensive semantic knowledge for discovering unseen relationship categories, thus facilitating recognizing novel video relationships. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our method, especially achieving a significant gain of nearly 10% in mAP on novel relationship categories on the VidVRD dataset.

AB - Open-vocabulary video visual relationship detection aims to extend video visual relationship detection beyond annotated categories by detecting unseen relationships between objects in videos. Recent progresses in open-vocabulary perception, primarily driven by large-scale image-text pre-trained models like CLIP, have shown remarkable success in recognizing novel objects and semantic categories. However, directly applying CLIP-like models to video visual relationship detection encounters significant challenges due to the substantial gap between images and video object relationships. To address this challenge, we propose a multi-modal prompting method that adapts CLIP well to open-vocabulary video visual relationship detection by prompt-tuning on both visual representation and language input. Specifically, we enhance the image encoder of CLIP by using spatio-temporal visual prompting to capture spatio-temporal contexts, thereby making it suitable for object-level relationship representation in videos. Furthermore, we propose vision-guided language prompting to leverage CLIP's comprehensive semantic knowledge for discovering unseen relationship categories, thus facilitating recognizing novel video relationships. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our method, especially achieving a significant gain of nearly 10% in mAP on novel relationship categories on the VidVRD dataset.

UR - http://www.scopus.com/inward/record.url?scp=85189560064&partnerID=8YFLogxK

U2 - 10.1609/aaai.v38i7.28472

DO - 10.1609/aaai.v38i7.28472

M3 - Conference contribution

AN - SCOPUS:85189560064

T3 - Proceedings of the AAAI Conference on Artificial Intelligence

SP - 6513

EP - 6521

BT - Technical Tracks 14

A2 - Wooldridge, Michael

A2 - Dy, Jennifer

A2 - Natarajan, Sriraam

PB - Association for the Advancement of Artificial Intelligence

T2 - 38th AAAI Conference on Artificial Intelligence, AAAI 2024

Y2 - 20 February 2024 through 27 February 2024

ER -

Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此