Abstract
Multi-modal contrastive learning has gained significant attention in recent years due to the rapid growth of multi-modal data and the increasing application demands in practice, e.g., multi-modal pre-training, retrieval, and classification. Most existing multi-modal representation learning methods require well-aligned multi-modal data (e.g., image-text pairs). This setting, however, limits their applications because real-world multi-modal data are often partially-aligned, consisting of a small piece of well-aligned data and a massive amount of unaligned ones. In this study, we propose a novel optimal transport-based method to enhance multi-modal contrastive learning given partially-aligned multi-modal data, which provides an effective strategy to leverage the information hidden in the unaligned multi-modal data. The proposed method imposes an optimal transport (OT) regularizer in the multi-modal contrastive learning framework, aligning the latent representations of different modalities with consistency guarantees. We implement the OT regularizer in two ways, based on a consistency-regularized loop of pairwise Wasserstein distances and a Wasserstein barycenter problem, respectively. We analyze the rationality of our OT regularizer and compare its two implementations in-depth. Experiments show that combining our OT regularizer with state-of-the-art contrastive learning methods leads to better performance in the generalized zero-shot cross-modal retrieval and multi-modal classification tasks.
Supported in part by the National Natural Science Foundation of China (62102031), and the foundation of Key Laboratory of Artificial Intelligence, Ministry of Education, Shanghai, P.R. China (AI202409).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
For the simplification of analysis, here we ignore the \(\max (\cdot ,\cdot )\) operation in the original triplet loss.
- 2.
TCAF uses temporal features of the three datasets, while other baselines use average features.
References
Agueh, M., Carlier, G.: Barycenters in the Wasserstein space. SIAM J. Math. Anal. 43(2), 904–924 (2011)
Chen, L., Zhang, Y., Zhang, R., Tao, C., Gan, Z., Zhang, H., Li, B., Shen, D., Chen, C., Carin, L.: Improving sequence-to-sequence learning via optimal transport. arXiv preprint arXiv:1901.06283 (2019)
Chen, Z., Huang, Y., Chen, J., Geng, Y., Zhang, W., Fang, Y., Pan, J.Z., Song, W., Chen, H.: Duet: Cross-modal semantic grounding for contrastive zero-shot learning. arXiv preprint arXiv:2207.01328 (2022)
Chuang, C.Y., Robinson, J., Lin, Y.C., Torralba, A., Jegelka, S.: Debiased contrastive learning. Adv. Neural. Inf. Process. Syst. 33, 8765–8775 (2020)
Gao, J., Li, P., Laghari, A.A., Srivastava, G., Gadekallu, T.R., Abbas, S., Zhang, J.: Incomplete multiview clustering via semidiscrete optimal transport for multimedia data mining in iot. ACM Trans. Multimedia Comput. Commun. Appl. (2023)
Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: one embedding space to bind them all. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15180–15190 (2023)
Gong, F., Nie, Y., Xu, H.: Gromov-Wasserstein multi-modal alignment and clustering. In: Proceedings of the 31st ACM International Conference on Information and Knowledge Management, pp. 603–613 (2022)
Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304. JMLR Workshop and Conference Proceedings (2010)
Katageri, S., De, A., Devaguptapu, C., Prasad, V., Sharma, C., Kaul, M.: Synergizing contrastive learning and optimal transport for 3d point cloud domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2942–2951 (2024)
Li, Q., Hou, M., Lai, H., Yang, M.: Cross-modal distribution alignment embedding network for generalized zero-shot learning. Neural Netw. 148, 176–182 (2022)
Li, W., Ma, Z., Deng, L.J., Man, H., Fan, X.: Modality-fusion spiking transformer network for audio-visual zero-shot learning. In: 2023 IEEE International Conference on Multimedia and Expo (ICME), pp. 426–431. IEEE (2023)
Li, Y., Zhu, Q., He, H., Gu, Z., Zheng, C.: Moc: Multi-modal sentiment analysis via optimal transport and contrastive interactions. In: International Conference on Neural Information Processing, pp. 439–451. Springer (2023)
Luo, D., Wang, Y., Yue, A., Xu, H.: Weakly-supervised temporal action alignment driven by unbalanced spectral fused Gromov-Wasserstein distance. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 728–739 (2022)
Luo, D., Xu, H., Carin, L.: Differentiable hierarchical optimal transport for robust multi-view learning. IEEE Trans. Pattern Anal. Mach. Intell. (2022)
Mazumder, P., Singh, P., Parida, K.K., Namboodiri, V.P.: Avgzslnet: Audio-visual generalized zero-shot learning by reconstructing label features from multi-modal embeddings. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3090–3099 (2021)
Mercea, O.B., Hummel, T., Koepke, A.S., Akata, Z.: Temporal and cross-modal attention for audio-visual zero-shot learning. In: European Conference on Computer Vision, pp. 488–505. Springer (2022)
Mercea, O.B., Riesch, L., Koepke, A., Akata, Z.: Audio-visual generalised zero-shot learning with cross-modal attention and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10553–10563 (2022)
Parida, K., Matiyali, N., Guha, T., Sharma, G.: Coordinated joint multimodal embeddings for generalized audio-visual zero-shot classification and retrieval of videos. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3251–3260 (2020)
Peyré, G., Cuturi, M., et al.: Computational optimal transport: With applications to data science. Found. Trends® Mach. Learn. 11(5-6), 355–607 (2019)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Robinson, J., Chuang, C.Y., Sra, S., Jegelka, S.: Contrastive learning with hard negative samples. In: International Conference on Learning Representations (2021)
Villani, C., et al.: Optimal Transport: Old and New, vol. 338. Springer (2009)
Wang, Z., Zhao, Y., Huang, H., Liu, J., Yin, A., Tang, L., Li, L., Wang, Y., Zhang, Z., Zhao, Z.: Connecting multi-modal contrastive representations. Adv. Neural Inform. Process. Syst. 36 (2024)
Xu, H., Luo, D., Henao, R., Shah, S., Carin, L.: Learning autoencoders with relational regularization. In: International Conference on Machine Learning, pp. 10576–10586. PMLR (2020)
Zhang, R., Chen, C., Zhang, X., Bai, K., Carin, L.: Semantic matching for sequence-to-sequence learning. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 212–222 (2020)
Zheng, Q., Hong, J., Farazi, M.: A generative approach to audio-visual generalized zero-shot learning: combining contrastive and discriminative techniques. In: 2023 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2023)
Zhou, B., Parno, M.: Efficient and exact multimarginal optimal transport with pairwise costs. arXiv preprint arXiv:2208.03025 (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhu, S., Luo, D. (2025). Enhancing Multi-modal Contrastive Learning via Optimal Transport-Based Consistent Modality Alignment. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15041. Springer, Singapore. https://doi.org/10.1007/978-981-97-8795-1_11
Download citation
DOI: https://doi.org/10.1007/978-981-97-8795-1_11
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8794-4
Online ISBN: 978-981-97-8795-1
eBook Packages: Computer ScienceComputer Science (R0)