大语言模型引导的视频检索数据迭代优化

Translated title of the contribution: Iterative optimization for video retrieval data using large language model guidance

Runhao Zeng, Jialiang Li, Yishen Zhuo, Haihan Duan, Qi Chen*, Xiping Hu

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Objective In recent years, video-text cross-modal retrieval has garnered widespread attention from academia and industry due to its significant application value in areas such as video recommendation, public safety, sports analysis, and personalized advertising. This task primarily involves video retrieval(VR)and video moment retrieval(VMR), aiming to identify videos or video moments from a video library or a specific video that are semantically most similar to a given query text. The inherent heterogeneity between video and text, as they belong to different modalities, makes direct feature matching highly challenging. Thus, the key challenge in video-text cross-modal retrieval lies in effectively aligning these two cross-modal data types in the feature space to achieve precise semantic relevance calculation. Current methods primarily focus on enhancing semantic matching across modalities through cross-modal interactions on existing datasets to improve retrieval performance. Although improvement in modeling has seen significant progress, issues inherent to datasets remain unexplored. In the context of video-text cross-modal retrieval, this study observes an ill-posed problem during training with existing datasets, manifested as a single query text corresponding to multiple videos or video moments, leading to nonunique retrieval results. These one-to-many samples frequently lead to model confusion during training, hinder the alignment of cross-modal feature representations, and degrade overall model performance. For instance, if a query text describes a target video and a nontarget video, then retrieving the latter during training is penalized as incorrect, thereby artificially increasing the distance between the query text and the nontarget video in the feature space, despite their high semantic relevance. This paper defines these problematic one-to-many samples as hard samples, whereas one-to-one samples are defined as easy samples. To address this issue, this paper proposes an iterative optimization method for VR data using large language model guidance. By leveraging the built-in knowledge of large language models, this method augments one-to-many video-text pairs with fine-grained information and iteratively refines them into one-to-one mappings. Method Initially, the dataset is divided into easy and hard sample sets based on video-text similarity. Specifically, the similarity between the query text and all videos is calculated. If the similarity between the query text and the target video is not the highest, then the data pair is classified into the hard sample set;otherwise, it is classified into the easy sample set. For videos in the difficult sample set, several frames are uniformly sampled and inputted into an image-text generation model to produce frame-level descriptive texts. This process aims to capture fine-grained information, such as objects not described by the query text, detailed appearances, and color attributes in the video. However, given that multiple frames may contain similar scenes and objects, the extracted fine-grained textual descriptions are often redundant and noisy. To address this, an iterative optimization module based on video-text semantic associations is introduced. This module combines the original query text with fine-grained information extracted from the target video and integrates it with a carefully designed prompt template, which is inputted into a large language model. The model then generates a refined, fine-grained, and unique query text. The quality of the optimization results depends significantly on the design of the prompt templates. The templates include the following key elements:1)clear task descriptions;2)relevant examples that meet specified conditions;and 3)specific requirements, such as extracting co-occurring content across multiple frames during summarization. The emphasis on co-occurring content is justified by two key reasons:first, such content often carries critical and essential information;second, summarizing shared elements effectively reduces the likelihood of introducing erroneous descriptions. High-quality outputs from large language models typically result from multiple interactions with the user, as these models can refine their responses based on user feedback. Inspired by this, the study aims to automate the optimization process without requiring predefined interaction rounds. To further optimize the fine-grained query text, an iterative condition based on video-text semantic association is designed. Specifically, the optimized query text and corresponding video are encoded through an encoder. If the similarity of the extracted features in the feature space meets a predefined condition, then the optimized query text is deemed satisfactory, and the optimization process is terminated. Otherwise, if the condition is not met, then the current optimization results are used to update the prompt information, and the query text is further refined iteratively until the dataset no longer contains one-to-many issues for any query text. Finally, the optimized data are used to train the video-text cross-modal retrieval model. Result The effectiveness of the proposed method was validated on multiple mainstream video-text cross-modal retrieval datasets. In the VMR task, four neural network models trained on the Charades-STA dataset and optimized using the proposed method showed an average improvement of 2. 42% in the R@1, IoU = 0. 5 metric, with a maximum improvement of 3. 23%. When IoU = 0. 7, performance improvements reached up to 4. 38%. In the QVHighlights dataset, the performance of MomentDETR and QDDETR improved by 5. 48% and 1. 35%, respectively, with an average improvement of 3% when IoU = 0. 7. In the VR task, two methods demonstrated an average improvement of 1. 4% in the R@1 metric on the MSR-VTT dataset, with a maximum improvement of 1. 6%. These results demonstrate the proposed method’s effectiveness and its generalizability across different datasets. Conclusion The proposed iterative optimization method for VR data using large language model guidance effectively alleviates the one-to-many issue in datasets. A single optimization of the dataset can enhance the retrieval performance of multiple methods. This approach offers a novel perspective for video-text cross-modal retrieval research and promotes advancements in related technologies.

Translated title of the contributionIterative optimization for video retrieval data using large language model guidance
Original languageChinese (Traditional)
Pages (from-to)1257-1271
Number of pages15
JournalJournal of Image and Graphics
Volume30
Issue number5
DOIs
Publication statusPublished - May 2025
Externally publishedYes

Cite this