TY - GEN
T1 - Enhancing Multi-modal Contrastive Learning via Optimal Transport-Based Consistent Modality Alignment
AU - Zhu, Sidan
AU - Luo, Dixin
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.
PY - 2025
Y1 - 2025
N2 - Multi-modal contrastive learning has gained significant attention in recent years due to the rapid growth of multi-modal data and the increasing application demands in practice, e.g., multi-modal pre-training, retrieval, and classification. Most existing multi-modal representation learning methods require well-aligned multi-modal data (e.g., image-text pairs). This setting, however, limits their applications because real-world multi-modal data are often partially-aligned, consisting of a small piece of well-aligned data and a massive amount of unaligned ones. In this study, we propose a novel optimal transport-based method to enhance multi-modal contrastive learning given partially-aligned multi-modal data, which provides an effective strategy to leverage the information hidden in the unaligned multi-modal data. The proposed method imposes an optimal transport (OT) regularizer in the multi-modal contrastive learning framework, aligning the latent representations of different modalities with consistency guarantees. We implement the OT regularizer in two ways, based on a consistency-regularized loop of pairwise Wasserstein distances and a Wasserstein barycenter problem, respectively. We analyze the rationality of our OT regularizer and compare its two implementations in-depth. Experiments show that combining our OT regularizer with state-of-the-art contrastive learning methods leads to better performance in the generalized zero-shot cross-modal retrieval and multi-modal classification tasks.
AB - Multi-modal contrastive learning has gained significant attention in recent years due to the rapid growth of multi-modal data and the increasing application demands in practice, e.g., multi-modal pre-training, retrieval, and classification. Most existing multi-modal representation learning methods require well-aligned multi-modal data (e.g., image-text pairs). This setting, however, limits their applications because real-world multi-modal data are often partially-aligned, consisting of a small piece of well-aligned data and a massive amount of unaligned ones. In this study, we propose a novel optimal transport-based method to enhance multi-modal contrastive learning given partially-aligned multi-modal data, which provides an effective strategy to leverage the information hidden in the unaligned multi-modal data. The proposed method imposes an optimal transport (OT) regularizer in the multi-modal contrastive learning framework, aligning the latent representations of different modalities with consistency guarantees. We implement the OT regularizer in two ways, based on a consistency-regularized loop of pairwise Wasserstein distances and a Wasserstein barycenter problem, respectively. We analyze the rationality of our OT regularizer and compare its two implementations in-depth. Experiments show that combining our OT regularizer with state-of-the-art contrastive learning methods leads to better performance in the generalized zero-shot cross-modal retrieval and multi-modal classification tasks.
KW - Generalized Zero-shot Learning
KW - Modality Alignment
KW - Multi-modal Contrastive Learning
KW - Optimal Transport
UR - http://www.scopus.com/inward/record.url?scp=85209400201&partnerID=8YFLogxK
U2 - 10.1007/978-981-97-8795-1_11
DO - 10.1007/978-981-97-8795-1_11
M3 - Conference contribution
AN - SCOPUS:85209400201
SN - 9789819787944
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 157
EP - 171
BT - Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings
A2 - Lin, Zhouchen
A2 - Zha, Hongbin
A2 - Cheng, Ming-Ming
A2 - He, Ran
A2 - Liu, Cheng-Lin
A2 - Ubul, Kurban
A2 - Silamu, Wushouer
A2 - Zhou, Jie
PB - Springer Science and Business Media Deutschland GmbH
T2 - 7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024
Y2 - 18 October 2024 through 20 October 2024
ER -