基于 CLIP 生成多事件表示的视频文本检索方法

Rongcheng Tu, Xianling Mao*, Weijie Kong, Chengfei Cai, Wenzhe Zhao, Hongfa Wang, Heyan Huang

*此作品的通讯作者

科研成果: 期刊稿件文章同行评审

摘要

Video-text retrieval has been widely used in many real-world applications and attracted more and more research attention. Recently, many work has been proposed to leverage the visual-language matching knowledge of the pre-training models to further improve the retrieval performance. However, these methods ignore that video and text data are composed of events. If the fine-grained similarities between events in video and events in text can be captured well, it will help to calculate more accurate semantic similarities between texts and videos, and then improve the retrieval performance. Hence, in this paper, we propose a CLIP based multi-event representation generation for video-text retrieval, called CLIPMERG. Specifically, CLIPMERG first utilizes the video encoder and text encoder of pre-training model CLIP to transform the video and text inputs into video frame token sequences and word token sequences, respectively. Next, CLIPMERG uses a video (text) event generator to map the video frame (text word) token sequence into k video (text) event representations. Finally, CLIPMERG calculates the semantic similarities between videos and texts through capturing the fine-grained similarities between video event representations and text event representations. Extensive experimental results on three widely used benchmark datasets MSR-VTT, DiDeMo and LSMDC show that our proposed CLIPMERG achieves better performance than state-of-the-art baselines on the video-text retrieval tasks.

投稿的翻译标题CLIP Based Multi-Event Representation Generation for Video-Text Retrieval
源语言繁体中文
页(从-至)2169-2179
页数11
期刊Jisuanji Yanjiu yu Fazhan/Computer Research and Development
60
9
DOI
出版状态已出版 - 2023

关键词

  • CLIP model
  • Transformer model
  • event representation
  • pre-training model
  • video-text retrieval

指纹

探究 '基于 CLIP 生成多事件表示的视频文本检索方法' 的科研主题。它们共同构成独一无二的指纹。

引用此