TY - GEN
T1 - METOR
T2 - 34th Internationa Joint Conference on Artificial Intelligence, IJCAI 2025
AU - Wang, Yongqi
AU - Wu, Xinxiao
AU - Yang, Shuo
N1 - Publisher Copyright:
© 2025 International Joint Conferences on Artificial Intelligence. All rights reserved.
PY - 2025
Y1 - 2025
N2 - Open-vocabulary video visual relationship detection aims to detect objects and their relationships in videos without being restricted by predefined object or relationship categories. Existing methods leverage the rich semantic knowledge of pre-trained vision-language models such as CLIP to identify novel categories. They typically adopt a cascaded pipeline to first detect objects and then classify relationships based on the detected objects, which may lead to error propagation and thus suboptimal performance. In this paper, we propose Mutual EnhancemenT of Objects and Relationships (METOR), a query-based unified framework to jointly model and mutually enhance object detection and relationship classification in open-vocabulary scenarios. Under this framework, we first design a CLIP-based contextual refinement encoding module that extracts visual contexts of objects and relationships to refine the encoding of text features and object queries, thus improving the generalization of encoding to novel categories. Then we propose an iterative enhancement module to alternatively enhance the representations of objects and relationships by fully exploiting their interdependence to improve recognition performance. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate that our framework achieves state-of-the-art performance. Codes are at https://github.com/wangyongqi558/METOR.
AB - Open-vocabulary video visual relationship detection aims to detect objects and their relationships in videos without being restricted by predefined object or relationship categories. Existing methods leverage the rich semantic knowledge of pre-trained vision-language models such as CLIP to identify novel categories. They typically adopt a cascaded pipeline to first detect objects and then classify relationships based on the detected objects, which may lead to error propagation and thus suboptimal performance. In this paper, we propose Mutual EnhancemenT of Objects and Relationships (METOR), a query-based unified framework to jointly model and mutually enhance object detection and relationship classification in open-vocabulary scenarios. Under this framework, we first design a CLIP-based contextual refinement encoding module that extracts visual contexts of objects and relationships to refine the encoding of text features and object queries, thus improving the generalization of encoding to novel categories. Then we propose an iterative enhancement module to alternatively enhance the representations of objects and relationships by fully exploiting their interdependence to improve recognition performance. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate that our framework achieves state-of-the-art performance. Codes are at https://github.com/wangyongqi558/METOR.
UR - https://www.scopus.com/pages/publications/105021837668
U2 - 10.24963/ijcai.2025/223
DO - 10.24963/ijcai.2025/223
M3 - Conference contribution
AN - SCOPUS:105021837668
T3 - IJCAI International Joint Conference on Artificial Intelligence
SP - 2000
EP - 2008
BT - Proceedings of the 34th International Joint Conference on Artificial Intelligence, IJCAI 2025
A2 - Kwok, James
PB - International Joint Conferences on Artificial Intelligence
Y2 - 16 August 2025 through 22 August 2025
ER -