Improving Video Moment Retrieval by Auxiliary Moment-Query Pairs With Hyper-Interaction

Runhao Zeng, Yishen Zhuo, Jialiang Li, Yunjin Yang, Huisi Wu, Qi Chen*, Xiping Hu*, Victor C.M. Leung

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Most existing video moment retrieval (VMR) benchmark datasets face a common issue of sparse annotations-only a few moments being annotated. We argue that videos contain a broader range of meaningful moments that, if leveraged, could significantly enhance performance. Existing methods typically follow a generate-then-select paradigm, focusing primarily on generating moment-query pairs while neglecting the crucial aspect of selection. In this paper, we propose a new method, HyperAux, to yield auxiliary moment-query pairs by modeling the multi-modal hyper-interaction between video and language. Specifically, given a set of candidate moment-query pairs from a video, we construct a hypergraph with multiple hyperedges, each corresponding to a moment-query pair. Unlike traditional graphs where each edge connects only two nodes (frames or queries), each hyperedge connects multiple nodes, including all frames within a moment, semantically related frames outside the moment, and an input query. This design allows us to consider the frames within a moment as a whole, rather than modeling individual frame-query relationships separately. More importantly, constructing the relationships among all moment-query pairs within a video into a large hypergraph facilitates selecting higher-quality data from such pairs. On this hypergraph, we employ a hypergraph neural network to aggregate node information, update the hyperedge, and propagate video-language hyper-interactions to each connected node, resulting in context-aware node representations. This enables us to use node relevance to select high-quality moment-query pairs and refine the moments’ boundaries. We also exploit the discrepancy in semantic matching within and outside moments to construct a loss function for training the HGNN without human annotations. Our auxiliary data enhances the performance of twelve VMR models under fully-supervised, weakly-supervised, and zero-shot settings across three widely used VMR datasets: ActivityNet Captions, Charades-STA, and QVHighlights. We will release the source code and models publicly.

Original languageEnglish
Pages (from-to)3940-3954
Number of pages15
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume35
Issue number5
DOIs
Publication statusPublished - 2025
Externally publishedYes

Keywords

  • annotation generation
  • auxiliary moment-query pairs
  • hypergraph neural network
  • Video moment retrieval

Fingerprint

Dive into the research topics of 'Improving Video Moment Retrieval by Auxiliary Moment-Query Pairs With Hyper-Interaction'. Together they form a unique fingerprint.

Cite this