Enhancing Multi-modal Contrastive Learning via Optimal Transport-Based Consistent Modality Alignment

Zhu, Sidan; Luo, Dixin

doi:10.1007/978-981-97-8795-1_11

Sidan Zhu¹⁵ &
Dixin Luo^15,16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15041))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

186 Accesses

Abstract

Multi-modal contrastive learning has gained significant attention in recent years due to the rapid growth of multi-modal data and the increasing application demands in practice, e.g., multi-modal pre-training, retrieval, and classification. Most existing multi-modal representation learning methods require well-aligned multi-modal data (e.g., image-text pairs). This setting, however, limits their applications because real-world multi-modal data are often partially-aligned, consisting of a small piece of well-aligned data and a massive amount of unaligned ones. In this study, we propose a novel optimal transport-based method to enhance multi-modal contrastive learning given partially-aligned multi-modal data, which provides an effective strategy to leverage the information hidden in the unaligned multi-modal data. The proposed method imposes an optimal transport (OT) regularizer in the multi-modal contrastive learning framework, aligning the latent representations of different modalities with consistency guarantees. We implement the OT regularizer in two ways, based on a consistency-regularized loop of pairwise Wasserstein distances and a Wasserstein barycenter problem, respectively. We analyze the rationality of our OT regularizer and compare its two implementations in-depth. Experiments show that combining our OT regularizer with state-of-the-art contrastive learning methods leads to better performance in the generalized zero-shot cross-modal retrieval and multi-modal classification tasks.

Supported in part by the National Natural Science Foundation of China (62102031), and the foundation of Key Laboratory of Artificial Intelligence, Ministry of Education, Shanghai, P.R. China (AI202409).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Intramodal consistency in triplet-based cross-modal learning for image retrieval

Article Open access 28 February 2025

Semi-supervised cross-modal learning for cross modal retrieval and image annotation

Article 13 July 2018

Complementary expert balanced learning for long-tail cross-modal retrieval

Article 04 April 2024

Notes

1.
For the simplification of analysis, here we ignore the $\max (\cdot ,\cdot )$ operation in the original triplet loss.
2.
TCAF uses temporal features of the three datasets, while other baselines use average features.

References

Agueh, M., Carlier, G.: Barycenters in the Wasserstein space. SIAM J. Math. Anal. 43(2), 904–924 (2011)
Article MathSciNet Google Scholar
Chen, L., Zhang, Y., Zhang, R., Tao, C., Gan, Z., Zhang, H., Li, B., Shen, D., Chen, C., Carin, L.: Improving sequence-to-sequence learning via optimal transport. arXiv preprint arXiv:1901.06283 (2019)
Chen, Z., Huang, Y., Chen, J., Geng, Y., Zhang, W., Fang, Y., Pan, J.Z., Song, W., Chen, H.: Duet: Cross-modal semantic grounding for contrastive zero-shot learning. arXiv preprint arXiv:2207.01328 (2022)
Chuang, C.Y., Robinson, J., Lin, Y.C., Torralba, A., Jegelka, S.: Debiased contrastive learning. Adv. Neural. Inf. Process. Syst. 33, 8765–8775 (2020)
Google Scholar
Gao, J., Li, P., Laghari, A.A., Srivastava, G., Gadekallu, T.R., Abbas, S., Zhang, J.: Incomplete multiview clustering via semidiscrete optimal transport for multimedia data mining in iot. ACM Trans. Multimedia Comput. Commun. Appl. (2023)
Google Scholar
Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: one embedding space to bind them all. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15180–15190 (2023)
Google Scholar
Gong, F., Nie, Y., Xu, H.: Gromov-Wasserstein multi-modal alignment and clustering. In: Proceedings of the 31st ACM International Conference on Information and Knowledge Management, pp. 603–613 (2022)
Google Scholar
Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304. JMLR Workshop and Conference Proceedings (2010)
Google Scholar
Katageri, S., De, A., Devaguptapu, C., Prasad, V., Sharma, C., Kaul, M.: Synergizing contrastive learning and optimal transport for 3d point cloud domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2942–2951 (2024)
Google Scholar
Li, Q., Hou, M., Lai, H., Yang, M.: Cross-modal distribution alignment embedding network for generalized zero-shot learning. Neural Netw. 148, 176–182 (2022)
Article Google Scholar
Li, W., Ma, Z., Deng, L.J., Man, H., Fan, X.: Modality-fusion spiking transformer network for audio-visual zero-shot learning. In: 2023 IEEE International Conference on Multimedia and Expo (ICME), pp. 426–431. IEEE (2023)
Google Scholar
Li, Y., Zhu, Q., He, H., Gu, Z., Zheng, C.: Moc: Multi-modal sentiment analysis via optimal transport and contrastive interactions. In: International Conference on Neural Information Processing, pp. 439–451. Springer (2023)
Google Scholar
Luo, D., Wang, Y., Yue, A., Xu, H.: Weakly-supervised temporal action alignment driven by unbalanced spectral fused Gromov-Wasserstein distance. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 728–739 (2022)
Google Scholar
Luo, D., Xu, H., Carin, L.: Differentiable hierarchical optimal transport for robust multi-view learning. IEEE Trans. Pattern Anal. Mach. Intell. (2022)
Google Scholar
Mazumder, P., Singh, P., Parida, K.K., Namboodiri, V.P.: Avgzslnet: Audio-visual generalized zero-shot learning by reconstructing label features from multi-modal embeddings. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3090–3099 (2021)
Google Scholar
Mercea, O.B., Hummel, T., Koepke, A.S., Akata, Z.: Temporal and cross-modal attention for audio-visual zero-shot learning. In: European Conference on Computer Vision, pp. 488–505. Springer (2022)
Google Scholar
Mercea, O.B., Riesch, L., Koepke, A., Akata, Z.: Audio-visual generalised zero-shot learning with cross-modal attention and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10553–10563 (2022)
Google Scholar
Parida, K., Matiyali, N., Guha, T., Sharma, G.: Coordinated joint multimodal embeddings for generalized audio-visual zero-shot classification and retrieval of videos. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3251–3260 (2020)
Google Scholar
Peyré, G., Cuturi, M., et al.: Computational optimal transport: With applications to data science. Found. Trends® Mach. Learn. 11(5-6), 355–607 (2019)
Google Scholar
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Robinson, J., Chuang, C.Y., Sra, S., Jegelka, S.: Contrastive learning with hard negative samples. In: International Conference on Learning Representations (2021)
Google Scholar
Villani, C., et al.: Optimal Transport: Old and New, vol. 338. Springer (2009)
Google Scholar
Wang, Z., Zhao, Y., Huang, H., Liu, J., Yin, A., Tang, L., Li, L., Wang, Y., Zhang, Z., Zhao, Z.: Connecting multi-modal contrastive representations. Adv. Neural Inform. Process. Syst. 36 (2024)
Google Scholar
Xu, H., Luo, D., Henao, R., Shah, S., Carin, L.: Learning autoencoders with relational regularization. In: International Conference on Machine Learning, pp. 10576–10586. PMLR (2020)
Google Scholar
Zhang, R., Chen, C., Zhang, X., Bai, K., Carin, L.: Semantic matching for sequence-to-sequence learning. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 212–222 (2020)
Google Scholar
Zheng, Q., Hong, J., Farazi, M.: A generative approach to audio-visual generalized zero-shot learning: combining contrastive and discriminative techniques. In: 2023 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2023)
Google Scholar
Zhou, B., Parno, M.: Efficient and exact multimarginal optimal transport with pairwise costs. arXiv preprint arXiv:2208.03025 (2022)

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
Sidan Zhu & Dixin Luo
Key Laboratory of Artificial Intelligence, Ministry of Education, Shanghai, China
Dixin Luo

Authors

Sidan Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Dixin Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dixin Luo .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Zhouchen Lin
Nankai University, Tianjin, China
Ming-Ming Cheng
Chinese Academy of Sciences, Beijing, China
Ran He
Xinjiang University, Ürümqi, Xinjiang, China
Kurban Ubul
Xinjiang University, Ürümqi, China
Wushouer Silamu
Peking University, Beijing, China
Hongbin Zha
Tsinghua University, Beijing, China
Jie Zhou
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 617 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, S., Luo, D. (2025). Enhancing Multi-modal Contrastive Learning via Optimal Transport-Based Consistent Modality Alignment. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15041. Springer, Singapore. https://doi.org/10.1007/978-981-97-8795-1_11

Download citation

DOI: https://doi.org/10.1007/978-981-97-8795-1_11
Published: 03 November 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8794-4
Online ISBN: 978-981-97-8795-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics