TY - GEN
T1 - External Memory Matters
T2 - 34th Internationa Joint Conference on Artificial Intelligence, IJCAI 2025
AU - Dang, Jisheng
AU - Zheng, Huicheng
AU - Wu, Xudong
AU - Jiao, Jingmei
AU - Wang, Bimei
AU - Yang, Jun
AU - Hu, Bin
AU - Lai, Jianhuang
AU - Chua, Tat Seng
N1 - Publisher Copyright:
© 2025 International Joint Conferences on Artificial Intelligence. All rights reserved.
PY - 2025
Y1 - 2025
N2 - Long video understanding with Large Language Models (LLMs) enables the description of objects that are not explicitly present in the training data. However, continuous changes in known objects and the emergence of new ones require up-to-date knowledge of objects and their dynamics for effective understanding of the open world. To alleviate this, we propose an efficient Retrieval-Enhanced Video Understanding method, dubbed REVU, which leverages external knowledge to enhance the performance of open-world learning. First, REVU introduces an extensible external text-object memory with minimal text-visual mapping, involving static and dynamic multimodal information to help LLMs-based models align text and vision features. Second, REVU retrieves object information from external databases and dynamically integrates frame-specific data from videos, enabling effective knowledge aggregation to comprehend the open world. We conducted experiments on multiple benchmark datasets, and our model demonstrates strong adaptability to out-of-domain data without requiring additional fine-tuning or retraining. Experiments on benchmark video understanding datasets reveal that our model achieves state-of-the-art performance and robust generalization.
AB - Long video understanding with Large Language Models (LLMs) enables the description of objects that are not explicitly present in the training data. However, continuous changes in known objects and the emergence of new ones require up-to-date knowledge of objects and their dynamics for effective understanding of the open world. To alleviate this, we propose an efficient Retrieval-Enhanced Video Understanding method, dubbed REVU, which leverages external knowledge to enhance the performance of open-world learning. First, REVU introduces an extensible external text-object memory with minimal text-visual mapping, involving static and dynamic multimodal information to help LLMs-based models align text and vision features. Second, REVU retrieves object information from external databases and dynamically integrates frame-specific data from videos, enabling effective knowledge aggregation to comprehend the open world. We conducted experiments on multiple benchmark datasets, and our model demonstrates strong adaptability to out-of-domain data without requiring additional fine-tuning or retraining. Experiments on benchmark video understanding datasets reveal that our model achieves state-of-the-art performance and robust generalization.
UR - https://www.scopus.com/pages/publications/105021805627
U2 - 10.24963/ijcai.2025/97
DO - 10.24963/ijcai.2025/97
M3 - Conference contribution
AN - SCOPUS:105021805627
T3 - IJCAI International Joint Conference on Artificial Intelligence
SP - 864
EP - 872
BT - Proceedings of the 34th International Joint Conference on Artificial Intelligence, IJCAI 2025
A2 - Kwok, James
PB - International Joint Conferences on Artificial Intelligence
Y2 - 16 August 2025 through 22 August 2025
ER -