基于 CLIP 生成多事件表示的视频文本检索方法

Rongcheng Tu; Xianling Mao; Weijie Kong; Chengfei Cai; Wenzhe Zhao; Hongfa Wang; Heyan Huang

doi:10.7544/issn1000-1239.202220440

基于 CLIP 生成多事件表示的视频文本检索方法

Translated title of the contribution: CLIP Based Multi-Event Representation Generation for Video-Text Retrieval

Rongcheng Tu, Xianling Mao^*, Weijie Kong, Chengfei Cai, Wenzhe Zhao, Hongfa Wang, Heyan Huang

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

Abstract

Video-text retrieval has been widely used in many real-world applications and attracted more and more research attention. Recently, many work has been proposed to leverage the visual-language matching knowledge of the pre-training models to further improve the retrieval performance. However, these methods ignore that video and text data are composed of events. If the fine-grained similarities between events in video and events in text can be captured well, it will help to calculate more accurate semantic similarities between texts and videos, and then improve the retrieval performance. Hence, in this paper, we propose a CLIP based multi-event representation generation for video-text retrieval, called CLIPMERG. Specifically, CLIPMERG first utilizes the video encoder and text encoder of pre-training model CLIP to transform the video and text inputs into video frame token sequences and word token sequences, respectively. Next, CLIPMERG uses a video (text) event generator to map the video frame (text word) token sequence into k video (text) event representations. Finally, CLIPMERG calculates the semantic similarities between videos and texts through capturing the fine-grained similarities between video event representations and text event representations. Extensive experimental results on three widely used benchmark datasets MSR-VTT, DiDeMo and LSMDC show that our proposed CLIPMERG achieves better performance than state-of-the-art baselines on the video-text retrieval tasks.

Translated title of the contribution	CLIP Based Multi-Event Representation Generation for Video-Text Retrieval
Original language	Chinese (Traditional)
Pages (from-to)	2169-2179
Number of pages	11
Journal	Jisuanji Yanjiu yu Fazhan/Computer Research and Development
Volume	60
Issue number	9
DOIs	https://doi.org/10.7544/issn1000-1239.202220440
Publication status	Published - 2023

Access to Document

10.7544/issn1000-1239.202220440

Cite this

@article{9e4b36fb60424e4ba021c1c8df0110a6,

title = "基于 CLIP 生成多事件表示的视频文本检索方法",

abstract = "Video-text retrieval has been widely used in many real-world applications and attracted more and more research attention. Recently, many work has been proposed to leverage the visual-language matching knowledge of the pre-training models to further improve the retrieval performance. However, these methods ignore that video and text data are composed of events. If the fine-grained similarities between events in video and events in text can be captured well, it will help to calculate more accurate semantic similarities between texts and videos, and then improve the retrieval performance. Hence, in this paper, we propose a CLIP based multi-event representation generation for video-text retrieval, called CLIPMERG. Specifically, CLIPMERG first utilizes the video encoder and text encoder of pre-training model CLIP to transform the video and text inputs into video frame token sequences and word token sequences, respectively. Next, CLIPMERG uses a video (text) event generator to map the video frame (text word) token sequence into k video (text) event representations. Finally, CLIPMERG calculates the semantic similarities between videos and texts through capturing the fine-grained similarities between video event representations and text event representations. Extensive experimental results on three widely used benchmark datasets MSR-VTT, DiDeMo and LSMDC show that our proposed CLIPMERG achieves better performance than state-of-the-art baselines on the video-text retrieval tasks.",

keywords = "CLIP model, Transformer model, event representation, pre-training model, video-text retrieval",

author = "Rongcheng Tu and Xianling Mao and Weijie Kong and Chengfei Cai and Wenzhe Zhao and Hongfa Wang and Heyan Huang",

year = "2023",

doi = "10.7544/issn1000-1239.202220440",

language = "繁体中文",

volume = "60",

pages = "2169--2179",

journal = "Jisuanji Yanjiu yu Fazhan/Computer Research and Development",

issn = "1000-1239",

publisher = "Science Press",

number = "9",

}

TY - JOUR

T1 - 基于 CLIP 生成多事件表示的视频文本检索方法

AU - Tu, Rongcheng

AU - Mao, Xianling

AU - Kong, Weijie

AU - Cai, Chengfei

AU - Zhao, Wenzhe

AU - Wang, Hongfa

AU - Huang, Heyan

PY - 2023

Y1 - 2023

N2 - Video-text retrieval has been widely used in many real-world applications and attracted more and more research attention. Recently, many work has been proposed to leverage the visual-language matching knowledge of the pre-training models to further improve the retrieval performance. However, these methods ignore that video and text data are composed of events. If the fine-grained similarities between events in video and events in text can be captured well, it will help to calculate more accurate semantic similarities between texts and videos, and then improve the retrieval performance. Hence, in this paper, we propose a CLIP based multi-event representation generation for video-text retrieval, called CLIPMERG. Specifically, CLIPMERG first utilizes the video encoder and text encoder of pre-training model CLIP to transform the video and text inputs into video frame token sequences and word token sequences, respectively. Next, CLIPMERG uses a video (text) event generator to map the video frame (text word) token sequence into k video (text) event representations. Finally, CLIPMERG calculates the semantic similarities between videos and texts through capturing the fine-grained similarities between video event representations and text event representations. Extensive experimental results on three widely used benchmark datasets MSR-VTT, DiDeMo and LSMDC show that our proposed CLIPMERG achieves better performance than state-of-the-art baselines on the video-text retrieval tasks.

AB - Video-text retrieval has been widely used in many real-world applications and attracted more and more research attention. Recently, many work has been proposed to leverage the visual-language matching knowledge of the pre-training models to further improve the retrieval performance. However, these methods ignore that video and text data are composed of events. If the fine-grained similarities between events in video and events in text can be captured well, it will help to calculate more accurate semantic similarities between texts and videos, and then improve the retrieval performance. Hence, in this paper, we propose a CLIP based multi-event representation generation for video-text retrieval, called CLIPMERG. Specifically, CLIPMERG first utilizes the video encoder and text encoder of pre-training model CLIP to transform the video and text inputs into video frame token sequences and word token sequences, respectively. Next, CLIPMERG uses a video (text) event generator to map the video frame (text word) token sequence into k video (text) event representations. Finally, CLIPMERG calculates the semantic similarities between videos and texts through capturing the fine-grained similarities between video event representations and text event representations. Extensive experimental results on three widely used benchmark datasets MSR-VTT, DiDeMo and LSMDC show that our proposed CLIPMERG achieves better performance than state-of-the-art baselines on the video-text retrieval tasks.

KW - CLIP model

KW - Transformer model

KW - event representation

KW - pre-training model

KW - video-text retrieval

UR - http://www.scopus.com/inward/record.url?scp=85172727858&partnerID=8YFLogxK

U2 - 10.7544/issn1000-1239.202220440

DO - 10.7544/issn1000-1239.202220440

M3 - 文章

AN - SCOPUS:85172727858

SN - 1000-1239

VL - 60

SP - 2169

EP - 2179

JO - Jisuanji Yanjiu yu Fazhan/Computer Research and Development

JF - Jisuanji Yanjiu yu Fazhan/Computer Research and Development

IS - 9

ER -

基于 CLIP 生成多事件表示的视频文本检索方法

Abstract

Access to Document

Other files and links

Fingerprint

Cite this