基于 CLIP 生成多事件表示的视频文本检索方法

Translated title of the contribution: CLIP Based Multi-Event Representation Generation for Video-Text Retrieval

Rongcheng Tu, Xianling Mao*, Weijie Kong, Chengfei Cai, Wenzhe Zhao, Hongfa Wang, Heyan Huang

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Video-text retrieval has been widely used in many real-world applications and attracted more and more research attention. Recently, many work has been proposed to leverage the visual-language matching knowledge of the pre-training models to further improve the retrieval performance. However, these methods ignore that video and text data are composed of events. If the fine-grained similarities between events in video and events in text can be captured well, it will help to calculate more accurate semantic similarities between texts and videos, and then improve the retrieval performance. Hence, in this paper, we propose a CLIP based multi-event representation generation for video-text retrieval, called CLIPMERG. Specifically, CLIPMERG first utilizes the video encoder and text encoder of pre-training model CLIP to transform the video and text inputs into video frame token sequences and word token sequences, respectively. Next, CLIPMERG uses a video (text) event generator to map the video frame (text word) token sequence into k video (text) event representations. Finally, CLIPMERG calculates the semantic similarities between videos and texts through capturing the fine-grained similarities between video event representations and text event representations. Extensive experimental results on three widely used benchmark datasets MSR-VTT, DiDeMo and LSMDC show that our proposed CLIPMERG achieves better performance than state-of-the-art baselines on the video-text retrieval tasks.

Translated title of the contributionCLIP Based Multi-Event Representation Generation for Video-Text Retrieval
Original languageChinese (Traditional)
Pages (from-to)2169-2179
Number of pages11
JournalJisuanji Yanjiu yu Fazhan/Computer Research and Development
Volume60
Issue number9
DOIs
Publication statusPublished - 2023

Fingerprint

Dive into the research topics of 'CLIP Based Multi-Event Representation Generation for Video-Text Retrieval'. Together they form a unique fingerprint.

Cite this