Skip to main content

Enhancing Multi-modal Contrastive Learning via Optimal Transport-Based Consistent Modality Alignment

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15041))

Included in the following conference series:

  • 186 Accesses

Abstract

Multi-modal contrastive learning has gained significant attention in recent years due to the rapid growth of multi-modal data and the increasing application demands in practice, e.g., multi-modal pre-training, retrieval, and classification. Most existing multi-modal representation learning methods require well-aligned multi-modal data (e.g., image-text pairs). This setting, however, limits their applications because real-world multi-modal data are often partially-aligned, consisting of a small piece of well-aligned data and a massive amount of unaligned ones. In this study, we propose a novel optimal transport-based method to enhance multi-modal contrastive learning given partially-aligned multi-modal data, which provides an effective strategy to leverage the information hidden in the unaligned multi-modal data. The proposed method imposes an optimal transport (OT) regularizer in the multi-modal contrastive learning framework, aligning the latent representations of different modalities with consistency guarantees. We implement the OT regularizer in two ways, based on a consistency-regularized loop of pairwise Wasserstein distances and a Wasserstein barycenter problem, respectively. We analyze the rationality of our OT regularizer and compare its two implementations in-depth. Experiments show that combining our OT regularizer with state-of-the-art contrastive learning methods leads to better performance in the generalized zero-shot cross-modal retrieval and multi-modal classification tasks.

Supported in part by the National Natural Science Foundation of China (62102031), and the foundation of Key Laboratory of Artificial Intelligence, Ministry of Education, Shanghai, P.R. China (AI202409).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    For the simplification of analysis, here we ignore the \(\max (\cdot ,\cdot )\) operation in the original triplet loss.

  2. 2.

    TCAF uses temporal features of the three datasets, while other baselines use average features.

References

  1. Agueh, M., Carlier, G.: Barycenters in the Wasserstein space. SIAM J. Math. Anal. 43(2), 904–924 (2011)

    Article  MathSciNet  Google Scholar 

  2. Chen, L., Zhang, Y., Zhang, R., Tao, C., Gan, Z., Zhang, H., Li, B., Shen, D., Chen, C., Carin, L.: Improving sequence-to-sequence learning via optimal transport. arXiv preprint arXiv:1901.06283 (2019)

  3. Chen, Z., Huang, Y., Chen, J., Geng, Y., Zhang, W., Fang, Y., Pan, J.Z., Song, W., Chen, H.: Duet: Cross-modal semantic grounding for contrastive zero-shot learning. arXiv preprint arXiv:2207.01328 (2022)

  4. Chuang, C.Y., Robinson, J., Lin, Y.C., Torralba, A., Jegelka, S.: Debiased contrastive learning. Adv. Neural. Inf. Process. Syst. 33, 8765–8775 (2020)

    Google Scholar 

  5. Gao, J., Li, P., Laghari, A.A., Srivastava, G., Gadekallu, T.R., Abbas, S., Zhang, J.: Incomplete multiview clustering via semidiscrete optimal transport for multimedia data mining in iot. ACM Trans. Multimedia Comput. Commun. Appl. (2023)

    Google Scholar 

  6. Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: one embedding space to bind them all. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15180–15190 (2023)

    Google Scholar 

  7. Gong, F., Nie, Y., Xu, H.: Gromov-Wasserstein multi-modal alignment and clustering. In: Proceedings of the 31st ACM International Conference on Information and Knowledge Management, pp. 603–613 (2022)

    Google Scholar 

  8. Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304. JMLR Workshop and Conference Proceedings (2010)

    Google Scholar 

  9. Katageri, S., De, A., Devaguptapu, C., Prasad, V., Sharma, C., Kaul, M.: Synergizing contrastive learning and optimal transport for 3d point cloud domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2942–2951 (2024)

    Google Scholar 

  10. Li, Q., Hou, M., Lai, H., Yang, M.: Cross-modal distribution alignment embedding network for generalized zero-shot learning. Neural Netw. 148, 176–182 (2022)

    Article  Google Scholar 

  11. Li, W., Ma, Z., Deng, L.J., Man, H., Fan, X.: Modality-fusion spiking transformer network for audio-visual zero-shot learning. In: 2023 IEEE International Conference on Multimedia and Expo (ICME), pp. 426–431. IEEE (2023)

    Google Scholar 

  12. Li, Y., Zhu, Q., He, H., Gu, Z., Zheng, C.: Moc: Multi-modal sentiment analysis via optimal transport and contrastive interactions. In: International Conference on Neural Information Processing, pp. 439–451. Springer (2023)

    Google Scholar 

  13. Luo, D., Wang, Y., Yue, A., Xu, H.: Weakly-supervised temporal action alignment driven by unbalanced spectral fused Gromov-Wasserstein distance. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 728–739 (2022)

    Google Scholar 

  14. Luo, D., Xu, H., Carin, L.: Differentiable hierarchical optimal transport for robust multi-view learning. IEEE Trans. Pattern Anal. Mach. Intell. (2022)

    Google Scholar 

  15. Mazumder, P., Singh, P., Parida, K.K., Namboodiri, V.P.: Avgzslnet: Audio-visual generalized zero-shot learning by reconstructing label features from multi-modal embeddings. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3090–3099 (2021)

    Google Scholar 

  16. Mercea, O.B., Hummel, T., Koepke, A.S., Akata, Z.: Temporal and cross-modal attention for audio-visual zero-shot learning. In: European Conference on Computer Vision, pp. 488–505. Springer (2022)

    Google Scholar 

  17. Mercea, O.B., Riesch, L., Koepke, A., Akata, Z.: Audio-visual generalised zero-shot learning with cross-modal attention and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10553–10563 (2022)

    Google Scholar 

  18. Parida, K., Matiyali, N., Guha, T., Sharma, G.: Coordinated joint multimodal embeddings for generalized audio-visual zero-shot classification and retrieval of videos. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3251–3260 (2020)

    Google Scholar 

  19. Peyré, G., Cuturi, M., et al.: Computational optimal transport: With applications to data science. Found. Trends® Mach. Learn. 11(5-6), 355–607 (2019)

    Google Scholar 

  20. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  21. Robinson, J., Chuang, C.Y., Sra, S., Jegelka, S.: Contrastive learning with hard negative samples. In: International Conference on Learning Representations (2021)

    Google Scholar 

  22. Villani, C., et al.: Optimal Transport: Old and New, vol. 338. Springer (2009)

    Google Scholar 

  23. Wang, Z., Zhao, Y., Huang, H., Liu, J., Yin, A., Tang, L., Li, L., Wang, Y., Zhang, Z., Zhao, Z.: Connecting multi-modal contrastive representations. Adv. Neural Inform. Process. Syst. 36 (2024)

    Google Scholar 

  24. Xu, H., Luo, D., Henao, R., Shah, S., Carin, L.: Learning autoencoders with relational regularization. In: International Conference on Machine Learning, pp. 10576–10586. PMLR (2020)

    Google Scholar 

  25. Zhang, R., Chen, C., Zhang, X., Bai, K., Carin, L.: Semantic matching for sequence-to-sequence learning. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 212–222 (2020)

    Google Scholar 

  26. Zheng, Q., Hong, J., Farazi, M.: A generative approach to audio-visual generalized zero-shot learning: combining contrastive and discriminative techniques. In: 2023 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2023)

    Google Scholar 

  27. Zhou, B., Parno, M.: Efficient and exact multimarginal optimal transport with pairwise costs. arXiv preprint arXiv:2208.03025 (2022)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dixin Luo .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 617 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhu, S., Luo, D. (2025). Enhancing Multi-modal Contrastive Learning via Optimal Transport-Based Consistent Modality Alignment. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15041. Springer, Singapore. https://doi.org/10.1007/978-981-97-8795-1_11

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-8795-1_11

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-8794-4

  • Online ISBN: 978-981-97-8795-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics