Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection

Shuo Yang*, Yongqi Wang*, Xiaofeng Ji, Xinxiao Wu

*此作品的通讯作者

科研成果: 书/报告/会议事项章节会议稿件同行评审

1 引用 (Scopus)
Plum Print visual indicator of research metrics
  • Citations
    • Citation Indexes: 1
  • Captures
    • Readers: 5
see details

摘要

Open-vocabulary video visual relationship detection aims to extend video visual relationship detection beyond annotated categories by detecting unseen relationships between objects in videos. Recent progresses in open-vocabulary perception, primarily driven by large-scale image-text pre-trained models like CLIP, have shown remarkable success in recognizing novel objects and semantic categories. However, directly applying CLIP-like models to video visual relationship detection encounters significant challenges due to the substantial gap between images and video object relationships. To address this challenge, we propose a multi-modal prompting method that adapts CLIP well to open-vocabulary video visual relationship detection by prompt-tuning on both visual representation and language input. Specifically, we enhance the image encoder of CLIP by using spatio-temporal visual prompting to capture spatio-temporal contexts, thereby making it suitable for object-level relationship representation in videos. Furthermore, we propose vision-guided language prompting to leverage CLIP's comprehensive semantic knowledge for discovering unseen relationship categories, thus facilitating recognizing novel video relationships. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our method, especially achieving a significant gain of nearly 10% in mAP on novel relationship categories on the VidVRD dataset.

源语言英语
主期刊名Technical Tracks 14
编辑Michael Wooldridge, Jennifer Dy, Sriraam Natarajan
出版商Association for the Advancement of Artificial Intelligence
6513-6521
页数9
版本7
ISBN(电子版)1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879
DOI
出版状态已出版 - 25 3月 2024
活动38th AAAI Conference on Artificial Intelligence, AAAI 2024 - Vancouver, 加拿大
期限: 20 2月 202427 2月 2024

出版系列

姓名Proceedings of the AAAI Conference on Artificial Intelligence
编号7
38
ISSN(印刷版)2159-5399
ISSN(电子版)2374-3468

会议

会议38th AAAI Conference on Artificial Intelligence, AAAI 2024
国家/地区加拿大
Vancouver
时期20/02/2427/02/24

指纹

探究 'Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection' 的科研主题。它们共同构成独一无二的指纹。

引用此

Yang, S., Wang, Y., Ji, X., & Wu, X. (2024). Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection. 在 M. Wooldridge, J. Dy, & S. Natarajan (编辑), Technical Tracks 14 (7 编辑, 页码 6513-6521). (Proceedings of the AAAI Conference on Artificial Intelligence; 卷 38, 号码 7). Association for the Advancement of Artificial Intelligence. https://doi.org/10.1609/aaai.v38i7.28472