Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection

Shuo Yang; Yongqi Wang; Xiaofeng Ji; Xinxiao Wu

doi:10.1609/aaai.v38i7.28472

Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection

Shuo Yang^*, Yongqi Wang^*, Xiaofeng Ji, Xinxiao Wu

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

1 Citation (Scopus)

Abstract

Open-vocabulary video visual relationship detection aims to extend video visual relationship detection beyond annotated categories by detecting unseen relationships between objects in videos. Recent progresses in open-vocabulary perception, primarily driven by large-scale image-text pre-trained models like CLIP, have shown remarkable success in recognizing novel objects and semantic categories. However, directly applying CLIP-like models to video visual relationship detection encounters significant challenges due to the substantial gap between images and video object relationships. To address this challenge, we propose a multi-modal prompting method that adapts CLIP well to open-vocabulary video visual relationship detection by prompt-tuning on both visual representation and language input. Specifically, we enhance the image encoder of CLIP by using spatio-temporal visual prompting to capture spatio-temporal contexts, thereby making it suitable for object-level relationship representation in videos. Furthermore, we propose vision-guided language prompting to leverage CLIP's comprehensive semantic knowledge for discovering unseen relationship categories, thus facilitating recognizing novel video relationships. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our method, especially achieving a significant gain of nearly 10% in mAP on novel relationship categories on the VidVRD dataset.

Original language	English
Title of host publication	Technical Tracks 14
Editors	Michael Wooldridge, Jennifer Dy, Sriraam Natarajan
Publisher	Association for the Advancement of Artificial Intelligence
Pages	6513-6521
Number of pages	9
Edition	7
ISBN (Electronic)	1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879
DOIs	https://doi.org/10.1609/aaai.v38i7.28472
Publication status	Published - 25 Mar 2024
Event	38th AAAI Conference on Artificial Intelligence, AAAI 2024 - Vancouver, Canada Duration: 20 Feb 2024 → 27 Feb 2024

Publication series

Name	Proceedings of the AAAI Conference on Artificial Intelligence
Number	7
Volume	38
ISSN (Print)	2159-5399
ISSN (Electronic)	2374-3468

Conference

Conference	38th AAAI Conference on Artificial Intelligence, AAAI 2024
Country/Territory	Canada
City	Vancouver
Period	20/02/24 → 27/02/24

Access to Document

10.1609/aaai.v38i7.28472

Cite this

Yang, S., Wang, Y., Ji, X., & Wu, X. (2024). Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection. In M. Wooldridge, J. Dy, & S. Natarajan (Eds.), Technical Tracks 14 (7 ed., pp. 6513-6521). (Proceedings of the AAAI Conference on Artificial Intelligence; Vol. 38, No. 7). Association for the Advancement of Artificial Intelligence. https://doi.org/10.1609/aaai.v38i7.28472

@inproceedings{94152989d5a54496a250177f0eaaf67a,

title = "Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection",

abstract = "Open-vocabulary video visual relationship detection aims to extend video visual relationship detection beyond annotated categories by detecting unseen relationships between objects in videos. Recent progresses in open-vocabulary perception, primarily driven by large-scale image-text pre-trained models like CLIP, have shown remarkable success in recognizing novel objects and semantic categories. However, directly applying CLIP-like models to video visual relationship detection encounters significant challenges due to the substantial gap between images and video object relationships. To address this challenge, we propose a multi-modal prompting method that adapts CLIP well to open-vocabulary video visual relationship detection by prompt-tuning on both visual representation and language input. Specifically, we enhance the image encoder of CLIP by using spatio-temporal visual prompting to capture spatio-temporal contexts, thereby making it suitable for object-level relationship representation in videos. Furthermore, we propose vision-guided language prompting to leverage CLIP's comprehensive semantic knowledge for discovering unseen relationship categories, thus facilitating recognizing novel video relationships. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our method, especially achieving a significant gain of nearly 10% in mAP on novel relationship categories on the VidVRD dataset.",

author = "Shuo Yang and Yongqi Wang and Xiaofeng Ji and Xinxiao Wu",

note = "Publisher Copyright: Copyright {\textcopyright} 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.; 38th AAAI Conference on Artificial Intelligence, AAAI 2024 ; Conference date: 20-02-2024 Through 27-02-2024",

year = "2024",

month = mar,

day = "25",

doi = "10.1609/aaai.v38i7.28472",

language = "English",

series = "Proceedings of the AAAI Conference on Artificial Intelligence",

publisher = "Association for the Advancement of Artificial Intelligence",

number = "7",

pages = "6513--6521",

editor = "Michael Wooldridge and Jennifer Dy and Sriraam Natarajan",

booktitle = "Technical Tracks 14",

edition = "7",

}

Yang, S, Wang, Y, Ji, X & Wu, X 2024, Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection. in M Wooldridge, J Dy & S Natarajan (eds), Technical Tracks 14. 7 edn, Proceedings of the AAAI Conference on Artificial Intelligence, no. 7, vol. 38, Association for the Advancement of Artificial Intelligence, pp. 6513-6521, 38th AAAI Conference on Artificial Intelligence, AAAI 2024, Vancouver, Canada, 20/02/24. https://doi.org/10.1609/aaai.v38i7.28472

Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection. / Yang, Shuo; Wang, Yongqi; Ji, Xiaofeng et al.
Technical Tracks 14. ed. / Michael Wooldridge; Jennifer Dy; Sriraam Natarajan. 7. ed. Association for the Advancement of Artificial Intelligence, 2024. p. 6513-6521 (Proceedings of the AAAI Conference on Artificial Intelligence; Vol. 38, No. 7).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection

AU - Yang, Shuo

AU - Wang, Yongqi

AU - Ji, Xiaofeng

AU - Wu, Xinxiao

PY - 2024/3/25

Y1 - 2024/3/25

N2 - Open-vocabulary video visual relationship detection aims to extend video visual relationship detection beyond annotated categories by detecting unseen relationships between objects in videos. Recent progresses in open-vocabulary perception, primarily driven by large-scale image-text pre-trained models like CLIP, have shown remarkable success in recognizing novel objects and semantic categories. However, directly applying CLIP-like models to video visual relationship detection encounters significant challenges due to the substantial gap between images and video object relationships. To address this challenge, we propose a multi-modal prompting method that adapts CLIP well to open-vocabulary video visual relationship detection by prompt-tuning on both visual representation and language input. Specifically, we enhance the image encoder of CLIP by using spatio-temporal visual prompting to capture spatio-temporal contexts, thereby making it suitable for object-level relationship representation in videos. Furthermore, we propose vision-guided language prompting to leverage CLIP's comprehensive semantic knowledge for discovering unseen relationship categories, thus facilitating recognizing novel video relationships. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our method, especially achieving a significant gain of nearly 10% in mAP on novel relationship categories on the VidVRD dataset.

AB - Open-vocabulary video visual relationship detection aims to extend video visual relationship detection beyond annotated categories by detecting unseen relationships between objects in videos. Recent progresses in open-vocabulary perception, primarily driven by large-scale image-text pre-trained models like CLIP, have shown remarkable success in recognizing novel objects and semantic categories. However, directly applying CLIP-like models to video visual relationship detection encounters significant challenges due to the substantial gap between images and video object relationships. To address this challenge, we propose a multi-modal prompting method that adapts CLIP well to open-vocabulary video visual relationship detection by prompt-tuning on both visual representation and language input. Specifically, we enhance the image encoder of CLIP by using spatio-temporal visual prompting to capture spatio-temporal contexts, thereby making it suitable for object-level relationship representation in videos. Furthermore, we propose vision-guided language prompting to leverage CLIP's comprehensive semantic knowledge for discovering unseen relationship categories, thus facilitating recognizing novel video relationships. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our method, especially achieving a significant gain of nearly 10% in mAP on novel relationship categories on the VidVRD dataset.

UR - http://www.scopus.com/inward/record.url?scp=85189560064&partnerID=8YFLogxK

U2 - 10.1609/aaai.v38i7.28472

DO - 10.1609/aaai.v38i7.28472

M3 - Conference contribution

AN - SCOPUS:85189560064

T3 - Proceedings of the AAAI Conference on Artificial Intelligence

SP - 6513

EP - 6521

BT - Technical Tracks 14

A2 - Wooldridge, Michael

A2 - Dy, Jennifer

A2 - Natarajan, Sriraam

PB - Association for the Advancement of Artificial Intelligence

T2 - 38th AAAI Conference on Artificial Intelligence, AAAI 2024

Y2 - 20 February 2024 through 27 February 2024

ER -

Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this