TY - GEN
T1 - Frame-Wise Multimodal Retrieval in Video Corpus with Contrastive Learning
AU - Lu, Bo
AU - Liang, Guiyuan
AU - Zhao, Tianbao
AU - Liang, Xiaoyuan
AU - Yuan, Ye
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2026.
PY - 2026
Y1 - 2026
N2 - The rise of vast, unedited video content has made it crucial to find clips that match text queries accurately. The existing methods typically create clips to match the text queries by leveraging sliding windows or uniform sampling. However, these methods usually lead to inefficient and inaccurate retrieval due to their failure to effectively capture the interaction between query and video content at a fine-grained level, and they struggle with scalability when dealing with large-scale video datasets. To address these issues, we propose a novel Frame-Wise Multimodal Retrieval framework with contrastive learning for video moment retrieval (FCVR). Firstly, FCVR independently encodes text and video with unimodal encoding model, respectively. Secondly, a frame-level contrastive learning module and a video-level contrastive learning module are designed for further improving the efficiency and precision of video moment retrieval. Specifically, an internal-frame prediction module is designed to evaluate the similarity between frames by using the frame similarity score module, which significantly enhance the ability of locate video content related to text queries via fine-grained frame-level analysis. Extensive experiments demonstrate the superiority of FCVR over several state-of-the-art methods in terms of both accuracy and retrieval efficiency.
AB - The rise of vast, unedited video content has made it crucial to find clips that match text queries accurately. The existing methods typically create clips to match the text queries by leveraging sliding windows or uniform sampling. However, these methods usually lead to inefficient and inaccurate retrieval due to their failure to effectively capture the interaction between query and video content at a fine-grained level, and they struggle with scalability when dealing with large-scale video datasets. To address these issues, we propose a novel Frame-Wise Multimodal Retrieval framework with contrastive learning for video moment retrieval (FCVR). Firstly, FCVR independently encodes text and video with unimodal encoding model, respectively. Secondly, a frame-level contrastive learning module and a video-level contrastive learning module are designed for further improving the efficiency and precision of video moment retrieval. Specifically, an internal-frame prediction module is designed to evaluate the similarity between frames by using the frame similarity score module, which significantly enhance the ability of locate video content related to text queries via fine-grained frame-level analysis. Extensive experiments demonstrate the superiority of FCVR over several state-of-the-art methods in terms of both accuracy and retrieval efficiency.
KW - Frame-wise Matching
KW - Moment Localization
KW - Multimodal Retrieval
KW - Video Corpus Moment Retrieval
KW - Video Moment
UR - https://www.scopus.com/pages/publications/105029846893
U2 - 10.1007/978-981-95-5722-6_16
DO - 10.1007/978-981-95-5722-6_16
M3 - Conference contribution
AN - SCOPUS:105029846893
SN - 9789819557219
T3 - Lecture Notes in Computer Science
SP - 194
EP - 203
BT - Web and Big Data - 9th International Joint Conference, APWeb-WAIM 2025, Proceedings
A2 - Li, Jiajia
A2 - Chbeir, Richard
A2 - Li, Lei
A2 - Zong, Chuanyu
A2 - Zhang, Yanfeng
A2 - Zhang, Mengxuan
PB - Springer Science and Business Media Deutschland GmbH
T2 - 9th Asia-Pacific Web and Web-Age Information Management Joint International Conference on Web and Big Data, APWeb-WAIM 2025
Y2 - 28 August 2025 through 30 August 2025
ER -