Enhancing Multi-modal Contrastive Learning via Optimal Transport-Based Consistent Modality Alignment

Sidan Zhu, Dixin Luo*

*此作品的通讯作者

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Multi-modal contrastive learning has gained significant attention in recent years due to the rapid growth of multi-modal data and the increasing application demands in practice, e.g., multi-modal pre-training, retrieval, and classification. Most existing multi-modal representation learning methods require well-aligned multi-modal data (e.g., image-text pairs). This setting, however, limits their applications because real-world multi-modal data are often partially-aligned, consisting of a small piece of well-aligned data and a massive amount of unaligned ones. In this study, we propose a novel optimal transport-based method to enhance multi-modal contrastive learning given partially-aligned multi-modal data, which provides an effective strategy to leverage the information hidden in the unaligned multi-modal data. The proposed method imposes an optimal transport (OT) regularizer in the multi-modal contrastive learning framework, aligning the latent representations of different modalities with consistency guarantees. We implement the OT regularizer in two ways, based on a consistency-regularized loop of pairwise Wasserstein distances and a Wasserstein barycenter problem, respectively. We analyze the rationality of our OT regularizer and compare its two implementations in-depth. Experiments show that combining our OT regularizer with state-of-the-art contrastive learning methods leads to better performance in the generalized zero-shot cross-modal retrieval and multi-modal classification tasks.

源语言英语
主期刊名Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings
编辑Zhouchen Lin, Hongbin Zha, Ming-Ming Cheng, Ran He, Cheng-Lin Liu, Kurban Ubul, Wushouer Silamu, Jie Zhou
出版商Springer Science and Business Media Deutschland GmbH
157-171
页数15
ISBN(印刷版)9789819787944
DOI
出版状态已出版 - 2025
活动7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024 - Urumqi, 中国
期限: 18 10月 202420 10月 2024

出版系列

姓名Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
15041 LNCS
ISSN(印刷版)0302-9743
ISSN(电子版)1611-3349

会议

会议7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024
国家/地区中国
Urumqi
时期18/10/2420/10/24

指纹

探究 'Enhancing Multi-modal Contrastive Learning via Optimal Transport-Based Consistent Modality Alignment' 的科研主题。它们共同构成独一无二的指纹。

引用此