基于 CLIP 生成多事件表示的视频文本检索方法

Rongcheng Tu; Xianling Mao; Weijie Kong; Chengfei Cai; Wenzhe Zhao; Hongfa Wang; Heyan Huang

doi:10.7544/issn1000-1239.202220440

基于 CLIP 生成多事件表示的视频文本检索方法

Rongcheng Tu, Xianling Mao^*, Weijie Kong, Chengfei Cai, Wenzhe Zhao, Hongfa Wang, Heyan Huang

^*此作品的通讯作者

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

Video-text retrieval has been widely used in many real-world applications and attracted more and more research attention. Recently, many work has been proposed to leverage the visual-language matching knowledge of the pre-training models to further improve the retrieval performance. However, these methods ignore that video and text data are composed of events. If the fine-grained similarities between events in video and events in text can be captured well, it will help to calculate more accurate semantic similarities between texts and videos, and then improve the retrieval performance. Hence, in this paper, we propose a CLIP based multi-event representation generation for video-text retrieval, called CLIPMERG. Specifically, CLIPMERG first utilizes the video encoder and text encoder of pre-training model CLIP to transform the video and text inputs into video frame token sequences and word token sequences, respectively. Next, CLIPMERG uses a video (text) event generator to map the video frame (text word) token sequence into k video (text) event representations. Finally, CLIPMERG calculates the semantic similarities between videos and texts through capturing the fine-grained similarities between video event representations and text event representations. Extensive experimental results on three widely used benchmark datasets MSR-VTT, DiDeMo and LSMDC show that our proposed CLIPMERG achieves better performance than state-of-the-art baselines on the video-text retrieval tasks.

投稿的翻译标题	CLIP Based Multi-Event Representation Generation for Video-Text Retrieval
源语言	繁体中文
页（从-至）	2169-2179
页数	11
期刊	Jisuanji Yanjiu yu Fazhan/Computer Research and Development
卷	60
期	9
DOI	https://doi.org/10.7544/issn1000-1239.202220440
出版状态	已出版 - 2023

关键词

CLIP model
Transformer model
event representation
pre-training model
video-text retrieval

访问文件

10.7544/issn1000-1239.202220440

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{9e4b36fb60424e4ba021c1c8df0110a6,

title = "基于 CLIP 生成多事件表示的视频文本检索方法",

abstract = "Video-text retrieval has been widely used in many real-world applications and attracted more and more research attention. Recently, many work has been proposed to leverage the visual-language matching knowledge of the pre-training models to further improve the retrieval performance. However, these methods ignore that video and text data are composed of events. If the fine-grained similarities between events in video and events in text can be captured well, it will help to calculate more accurate semantic similarities between texts and videos, and then improve the retrieval performance. Hence, in this paper, we propose a CLIP based multi-event representation generation for video-text retrieval, called CLIPMERG. Specifically, CLIPMERG first utilizes the video encoder and text encoder of pre-training model CLIP to transform the video and text inputs into video frame token sequences and word token sequences, respectively. Next, CLIPMERG uses a video (text) event generator to map the video frame (text word) token sequence into k video (text) event representations. Finally, CLIPMERG calculates the semantic similarities between videos and texts through capturing the fine-grained similarities between video event representations and text event representations. Extensive experimental results on three widely used benchmark datasets MSR-VTT, DiDeMo and LSMDC show that our proposed CLIPMERG achieves better performance than state-of-the-art baselines on the video-text retrieval tasks.",

keywords = "CLIP model, Transformer model, event representation, pre-training model, video-text retrieval",

author = "Rongcheng Tu and Xianling Mao and Weijie Kong and Chengfei Cai and Wenzhe Zhao and Hongfa Wang and Heyan Huang",

year = "2023",

doi = "10.7544/issn1000-1239.202220440",

language = "繁体中文",

volume = "60",

pages = "2169--2179",

journal = "Jisuanji Yanjiu yu Fazhan/Computer Research and Development",

issn = "1000-1239",

publisher = "Science China Press",

number = "9",

}

TY - JOUR

T1 - 基于 CLIP 生成多事件表示的视频文本检索方法

AU - Tu, Rongcheng

AU - Mao, Xianling

AU - Kong, Weijie

AU - Cai, Chengfei

AU - Zhao, Wenzhe

AU - Wang, Hongfa

AU - Huang, Heyan

PY - 2023

Y1 - 2023

N2 - Video-text retrieval has been widely used in many real-world applications and attracted more and more research attention. Recently, many work has been proposed to leverage the visual-language matching knowledge of the pre-training models to further improve the retrieval performance. However, these methods ignore that video and text data are composed of events. If the fine-grained similarities between events in video and events in text can be captured well, it will help to calculate more accurate semantic similarities between texts and videos, and then improve the retrieval performance. Hence, in this paper, we propose a CLIP based multi-event representation generation for video-text retrieval, called CLIPMERG. Specifically, CLIPMERG first utilizes the video encoder and text encoder of pre-training model CLIP to transform the video and text inputs into video frame token sequences and word token sequences, respectively. Next, CLIPMERG uses a video (text) event generator to map the video frame (text word) token sequence into k video (text) event representations. Finally, CLIPMERG calculates the semantic similarities between videos and texts through capturing the fine-grained similarities between video event representations and text event representations. Extensive experimental results on three widely used benchmark datasets MSR-VTT, DiDeMo and LSMDC show that our proposed CLIPMERG achieves better performance than state-of-the-art baselines on the video-text retrieval tasks.

AB - Video-text retrieval has been widely used in many real-world applications and attracted more and more research attention. Recently, many work has been proposed to leverage the visual-language matching knowledge of the pre-training models to further improve the retrieval performance. However, these methods ignore that video and text data are composed of events. If the fine-grained similarities between events in video and events in text can be captured well, it will help to calculate more accurate semantic similarities between texts and videos, and then improve the retrieval performance. Hence, in this paper, we propose a CLIP based multi-event representation generation for video-text retrieval, called CLIPMERG. Specifically, CLIPMERG first utilizes the video encoder and text encoder of pre-training model CLIP to transform the video and text inputs into video frame token sequences and word token sequences, respectively. Next, CLIPMERG uses a video (text) event generator to map the video frame (text word) token sequence into k video (text) event representations. Finally, CLIPMERG calculates the semantic similarities between videos and texts through capturing the fine-grained similarities between video event representations and text event representations. Extensive experimental results on three widely used benchmark datasets MSR-VTT, DiDeMo and LSMDC show that our proposed CLIPMERG achieves better performance than state-of-the-art baselines on the video-text retrieval tasks.

KW - CLIP model

KW - Transformer model

KW - event representation

KW - pre-training model

KW - video-text retrieval

UR - http://www.scopus.com/inward/record.url?scp=85172727858&partnerID=8YFLogxK

U2 - 10.7544/issn1000-1239.202220440

DO - 10.7544/issn1000-1239.202220440

M3 - 文章

AN - SCOPUS:85172727858

SN - 1000-1239

VL - 60

SP - 2169

EP - 2179

JO - Jisuanji Yanjiu yu Fazhan/Computer Research and Development

JF - Jisuanji Yanjiu yu Fazhan/Computer Research and Development

IS - 9

ER -

基于 CLIP 生成多事件表示的视频文本检索方法

摘要

关键词

访问文件

其它文件与链接

指纹

引用此