Enhancing Multi-modal Contrastive Learning via Optimal Transport-Based Consistent Modality Alignment

Sidan Zhu, Dixin Luo*


科研成果: 书/报告/会议事项章节会议稿件同行评审


Multi-modal contrastive learning has gained significant attention in recent years due to the rapid growth of multi-modal data and the increasing application demands in practice, e.g., multi-modal pre-training, retrieval, and classification. Most existing multi-modal representation learning methods require well-aligned multi-modal data (e.g., image-text pairs). This setting, however, limits their applications because real-world multi-modal data are often partially-aligned, consisting of a small piece of well-aligned data and a massive amount of unaligned ones. In this study, we propose a novel optimal transport-based method to enhance multi-modal contrastive learning given partially-aligned multi-modal data, which provides an effective strategy to leverage the information hidden in the unaligned multi-modal data. The proposed method imposes an optimal transport (OT) regularizer in the multi-modal contrastive learning framework, aligning the latent representations of different modalities with consistency guarantees. We implement the OT regularizer in two ways, based on a consistency-regularized loop of pairwise Wasserstein distances and a Wasserstein barycenter problem, respectively. We analyze the rationality of our OT regularizer and compare its two implementations in-depth. Experiments show that combining our OT regularizer with state-of-the-art contrastive learning methods leads to better performance in the generalized zero-shot cross-modal retrieval and multi-modal classification tasks.

主期刊名Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings
编辑Zhouchen Lin, Hongbin Zha, Ming-Ming Cheng, Ran He, Cheng-Lin Liu, Kurban Ubul, Wushouer Silamu, Jie Zhou
出版商Springer Science and Business Media Deutschland GmbH
出版状态已出版 - 2025
活动7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024 - Urumqi, 中国
期限: 18 10月 202420 10月 2024


姓名Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
15041 LNCS


会议7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024


探究 'Enhancing Multi-modal Contrastive Learning via Optimal Transport-Based Consistent Modality Alignment' 的科研主题。它们共同构成独一无二的指纹。


Zhu, S., & Luo, D. (2025). Enhancing Multi-modal Contrastive Learning via Optimal Transport-Based Consistent Modality Alignment. 在 Z. Lin, H. Zha, M.-M. Cheng, R. He, C.-L. Liu, K. Ubul, W. Silamu, & J. Zhou (编辑), Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings (页码 157-171). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 卷 15041 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-97-8795-1_11