Frame-Wise Multimodal Retrieval in Video Corpus with Contrastive Learning

  • Bo Lu
  • , Guiyuan Liang*
  • , Tianbao Zhao
  • , Xiaoyuan Liang
  • , Ye Yuan
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The rise of vast, unedited video content has made it crucial to find clips that match text queries accurately. The existing methods typically create clips to match the text queries by leveraging sliding windows or uniform sampling. However, these methods usually lead to inefficient and inaccurate retrieval due to their failure to effectively capture the interaction between query and video content at a fine-grained level, and they struggle with scalability when dealing with large-scale video datasets. To address these issues, we propose a novel Frame-Wise Multimodal Retrieval framework with contrastive learning for video moment retrieval (FCVR). Firstly, FCVR independently encodes text and video with unimodal encoding model, respectively. Secondly, a frame-level contrastive learning module and a video-level contrastive learning module are designed for further improving the efficiency and precision of video moment retrieval. Specifically, an internal-frame prediction module is designed to evaluate the similarity between frames by using the frame similarity score module, which significantly enhance the ability of locate video content related to text queries via fine-grained frame-level analysis. Extensive experiments demonstrate the superiority of FCVR over several state-of-the-art methods in terms of both accuracy and retrieval efficiency.

Original languageEnglish
Title of host publicationWeb and Big Data - 9th International Joint Conference, APWeb-WAIM 2025, Proceedings
EditorsJiajia Li, Richard Chbeir, Lei Li, Chuanyu Zong, Yanfeng Zhang, Mengxuan Zhang
PublisherSpringer Science and Business Media Deutschland GmbH
Pages194-203
Number of pages10
ISBN (Print)9789819557219
DOIs
Publication statusPublished - 2026
Externally publishedYes
Event9th Asia-Pacific Web and Web-Age Information Management Joint International Conference on Web and Big Data, APWeb-WAIM 2025 - Shenyang, China
Duration: 28 Aug 202530 Aug 2025

Publication series

NameLecture Notes in Computer Science
Volume16116 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference9th Asia-Pacific Web and Web-Age Information Management Joint International Conference on Web and Big Data, APWeb-WAIM 2025
Country/TerritoryChina
CityShenyang
Period28/08/2530/08/25

Keywords

  • Frame-wise Matching
  • Moment Localization
  • Multimodal Retrieval
  • Video Corpus Moment Retrieval
  • Video Moment

Fingerprint

Dive into the research topics of 'Frame-Wise Multimodal Retrieval in Video Corpus with Contrastive Learning'. Together they form a unique fingerprint.

Cite this