Skip to main content
Log in

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

  • Research Article
  • Published:
Machine Intelligence Research Aims and scope Submit manuscript

Abstract

Cross-modal image-text retrieval is a fundamental task in bridging vision and language. It faces two main challenges that are typically not well addressed in previous works. 1) Generalizability: Existing methods often assume a strong semantic correlation between each text-image pair, which are thus difficult to generalize to real-world scenarios where the weak correlation dominates. 2) Efficiency: Many latest works adopt the single-tower architecture with heavy detectors, which are inefficient during the inference stage because the costly computation needs to be repeated for each text-image pair. In this work, to overcome these two challenges, we propose a two-tower cross-modal contrastive learning (CMCL) framework. Specifically, we first devise a two-tower architecture, which enables a unified feature space for the text and image modalities to be directly compared with each other, alleviating the heavy computation during inference. We further introduce a simple yet effective module named multi-grid split (MGS) to learn fine-grained image features without using detectors. Last but not the least, we deploy a cross-modal contrastive loss on the global image/text features to learn their weak correlation and thus achieve high generalizability. To validate that our CMCL can be readily generalized to real-world scenarios, we construct a large multi-source image-text dataset called weak semantic correlation dataset (WSCD). Extensive experiments show that our CMCL outperforms the state-of-the-arts while being much more efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. H. Chen, G. G. Ding, X. D. Liu, Z. J. Lin, J. Liu, J. G. Han. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 12652–12660, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.01267.

    Google Scholar 

  2. K. H. Lee, X. Chen, G. Hua, H. D. Hu, X. D. He. Stacked cross attention for image-text matching. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 212–228, 2018. DOI: https://doi.org/10.1007/978-3-030-01225-0_13.

    Google Scholar 

  3. H. Y. Lu, M. Y. Ding, N. Y. Fei, Y. Q. Huo, Z. W. Lu. LG-DN: Language-guided denoising network for video-language modeling. In Proceedings of Advances in Neural Information Processing Systems, 2022.

  4. O. Vinyals, A. Toshev, S. Bengio, D. Erhan. Show and h]A neural image caption generator. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, pp. 3156–3164, 2015. DOI: https://doi.org/10.1109/CVPR.2015.7298935.

  5. X. Jia, E. Gavves, B. Fernando, T. Tuytelaars. Guiding the long-short term memory model for image caption generation. In Proceedings of IEEE International Conference on Computer Vision, Santiago, Chile, pp. 2407–2415, 2015. DOI: https://doi.org/10.1109/ICCV.2015.277.

  6. J. Johnson, A. Gupta, L. Fei-Fei. Image generation from scene graphs. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 1219–1228, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00133.

    Google Scholar 

  7. T. T. Qiao, J. Zhang, D. Q. Xu, D. C. Tao. MirrorGAN: Learning text-to-image generation by redescription. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, pp. 1505–1514, 2019. DOI: https://doi.org/10.1109/CVPR.2019.00160.

  8. A. Karpathy, F. F. Li. Deep visual-semantic alignments for generating image descriptions. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 3128–3137, 2015. DOI: https://doi.org/10.1109/CVPR.2015.7298932.

    Google Scholar 

  9. Y. C. Chen, L. J. Li, L. C. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. J. Liu. UNITER: UNiversal image-TExt representation learning. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 104–120, 2020. DOI: https://doi.org/10.1007/978-3-030-58577-8_7.

    Google Scholar 

  10. R. Kiros, R. Salakhutdinov, R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. [Online], https://arxiv.org/abs/1411.2539, 2014.

  11. L. W. Wang, Y. Li, S. Lazebnik. Learning deep structure-preserving image-text embeddings. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 5005–5013, 2016. DOI: https://doi.org/10.1109/CVPR.2016.541.

  12. Y. Q. Huo, M. L. Zhang, G. Z. Liu, H. Y. Lu, Y. Z. Gao, G. X. Yang, J. Y. Wen, H. Zhang, B. G. Xu, W. H. Zheng, Z. Z. Xi, Y. Q. Yang, A. W. Hu, J. M. Zhao, R. C. Li, Y. D. Zhao, L. Zhang, Y. Q. Song, X. Hong, W. Q. Cui, D. Y. Hou, Y. Y. Li, J. Y. Li, P. Y. Liu, Z. Gong, C. H. Jin, Y. C. Sun, S. Z. Chen, Z. W. Lu, Z. C. Dou, Q. Jin, Y. Y. Lan, W. X. Zhao, R. H. Song, J. R. Wen. WenLan: Bridging vision and language by large-scale multi-modal pre-training. [Online], https://arxiv.org/abs/2103.06561, 2021.

  13. N. Y. Fei, Z. W. Lu, Y. Z. Gao, G. X. Yang, Y. Q. Huo, J. Y. Wen, H. Y. Lu, R. H. Song, X. Gao, T. Xiang, H. Sun, J. R. Wen. Towards artificial general intelligence via a multimodal foundation model. Nature Communications, vol. 13, no. 1, Article number 3094, 2022. DOI: https://doi.org/10.1038/s41467-022-30761-2.

  14. H. Y. Lu, N. Y. Fei, Y. Q. Huo, Y. Z. Gao, Z. W. Lu, J. R. Wen. COTS: Collaborative two-stream vision-language pre-training model for cross-modal retrieval. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 15671–15680, 2022. DOI: https://doi.org/10.1109/CVPR52688.2022.01524.

    Google Scholar 

  15. Y. L. Wu, S. H. Wang, G. L. Song, Q. M. Huang. Learning fragment self-attention embeddings for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia, ACM, Nice, France, pp. 2088–2096, 2019. DOI: https://doi.org/10.1145/3343031.3350940.

    Chapter  Google Scholar 

  16. H. W. Diao, Y. Zhang, L. Ma, H. C. Lu. Similarity reasoning and filtration for image-text matching. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 2, pp. 1218–1226, 2021. DOI: https://doi.org/10.1609/aaai.v35i2.16209.

    Article  Google Scholar 

  17. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. H. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, 2021.

  18. Z. R. Wu, Y. J. Xiong, S. X. Yu, D. H. Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 3733–3742, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00393.

    Google Scholar 

  19. A. van den Oord, Y. Z. Li, O. Vinyals. Representation learning with contrastive predictive coding. [Online], https://arxiv.org/abs/1807.03748, 2018.

  20. R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, Y. Bengio. Learning deep representations by mutual information estimation and maximization. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, USA, 2019.

  21. C. X. Zhuang, A. Zhai, D. Yamins. Local aggregation for unsupervised learning of visual embeddings. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Republic of Korea, pp. 6001–6011, 2019. DOI: https://doi.org/10.1109/ICCV.2019.00610.

    Google Scholar 

  22. P. Bachman, R. D. Hjelm, W. Buchwalter. Learning representations by maximizing mutual information across views. In Proceedings of the 33rd Conference on Neural Information Processing Systems, ACM, Vancouver, Canada, pp. 15509–15519, 2019.

    Google Scholar 

  23. T. Chen, S. Kornblith, M. Norouzi, G. Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, pp. 1597–1607, 2020.

  24. J. B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Á. Pires, Z. H. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, M. Valko. Bootstrap your own latent-a new approach to self-supervised learning. In Proceedings of the 34th Conference on Neural Information Processing Systems, ACM, Vancouver, Canada, pp. 21271–21284, 2020.

    Google Scholar 

  25. X. L. Chen, K. M. He. Exploring simple Siamese representation learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 15750–15758, 2021. DOI: https://doi.org/10.1109/CVPR46437.2021.01549.

    Google Scholar 

  26. D. Y. She, K. Xu. Contrastive self-supervised representation learning using synthetic data. International Journal of Automation and Computing, vol. 18, no. 4, pp. 556–567, 2021. DOI: https://doi.org/10.1007/s11633-021-1297-9.

    Article  Google Scholar 

  27. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems, ACM, Long Beach, USA, pp. 5998–6008, 2017.

    Google Scholar 

  28. T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dolláar, C. L. Zitnick. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision, Springer, Zurich, Switzerland, pp. 740–755, 2014. DOI: https://doi.org/10.1007/978-3-319-10602-1_48.

    Google Scholar 

  29. P. Young, A. Lai, M. Hodosh, J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, vol. 2, no. 1, pp. 67–78, 2014. DOI: https://doi.org/10.1162/tacl_a_00166.

    Article  Google Scholar 

  30. S. Q. Ren, K. M. He, R. Girshick, J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, ACM, Montreal, Canada, pp. 91–99, 2015.

    Google Scholar 

  31. K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 770–778, 2016. DOI: https://doi.org/10.1109/CVPR.2016.90.

  32. R. Girshick, J. Donahue, T. Darrell, J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, pp. 580–587, 2014. DOI: https://doi.org/10.1109/CVPR.2014.81.

  33. X. Wei, T. Z. Zhang, Y. Li, Y. D. Zhang, F. Wu. Multimodality cross attention network for image and sentence matching. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 10938–10947, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.01095.

    Google Scholar 

  34. P. Anderson, X. D. He, C. Buehler, D. Teney, M. Johnson, S. Gould, L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 6077–6086, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00636.

    Google Scholar 

  35. R. Krishna, Y. K. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. J. Li, D. A. Shamma, M. S. Bernstein, L. Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, vol. 123, no. 1, pp. 32–73, 2017. DOI: https://doi.org/10.1007/s11263-016-0981-7.

    Article  MathSciNet  Google Scholar 

  36. Z. H. Wang, X. H. Liu, H. S. Li, L. Sheng, J. J. Yan, X. G. Wang, J. Shao. CAMP: Cross-modal adaptive message passing for text-image retrieval. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Republic of Korea, pp. 5763–5772, 2019. DOI: https://doi.org/10.1109/ICCV.2019.00586.

    Google Scholar 

  37. Y. Zhang, H. C. Lu. Deep cross-modal projection learning for image-text matching. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 707–723, 2018. DOI: https://doi.org/10.1007/978-3-030-01246-5_42.

    Google Scholar 

  38. J. Devlin, M. W. Chang, K. Lee, K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, USA, pp. 4171–4186, 2019. DOI: https://doi.org/10.18653/v1/N19-1423.

  39. K. M. He, H. Q. Fan, Y. X. Wu, S. N. Xie, R. Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 9726–9735, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.00975.

    Google Scholar 

  40. M. U. Gutmann, A. Hyvärinen. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research, vol. 13, pp. 307–361, 2012.

    MathSciNet  MATH  Google Scholar 

  41. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 8748–8763, 2021.

  42. Y. H. Liu, M. Ott, N. Goyal, J. F. Du, M. Joshi, D. Q. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach. [Online], https://arxiv.org/abs/1907.11692, 2019.

  43. V. Nair, G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning, Omnipress, Haifa, Israel, pp. 807–814, 2010.

    Google Scholar 

  44. A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, T. Mikolov. DeViSE: A deep visual-semantic embedding model. In Proceedings of the 26th International Conference on Neural Information Processing Systems, ACM, Lake Tahoe, USA, pp. 2121–2129, 2013.

    Google Scholar 

  45. M. X. Tan, Q. Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, USA, pp. 6105–6114, 2019.

  46. Q. Zhang, Z. Lei, Z. X. Zhang, S. Z. Li. Context-aware attention network for image-text retrieval. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 3533–3542, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.00359.

    Google Scholar 

  47. J. C. Chen, H. X. Hu, H. Wu, Y. N. Jiang, C. H. Wang. Learning the best pooling strategy for visual semantic embedding. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 15789–15798, 2021. DOI: https://doi.org/10.1109/CVPR46437.2021.01553.

    Google Scholar 

  48. W. Kim, B. Son, I. Kim. ViLT: Vision-and-language transformer without convolution or region supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 5583–5594, 2021.

  49. Z. Y. Dou, Y. C. Xu, Z. Gan, J. F. Wang, S. H. Wang, L. J. Wang, C. G. Zhu, P. C. Zhang, L. Yuan, N. Y. Peng, Z. C. Liu, M. Zeng. An empirical study of training end-to-end vision-and-language transformers. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 18145–18155, 2022. DOI: https://doi.org/10.1109/CVPR52688.2022.01763.

    Google Scholar 

  50. X. J. Li, X. Yin, C. Y. Li, P. C. Zhang, X. W. Hu, L. Zhang, L. J. Wang, H. D. Hu, L. Dong, F. R. Wei, Y. J. Choi, J. F. Gao. OSCAR: Object-semantics aligned pretraining for vision-language tasks. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 121–137, 2020. DOI: https://doi.org/10.1007/978-3-030-58577-8_8.

    Google Scholar 

  51. Z. Ji, H. R. Wang, J. G. Han, Y. W. Pang. Saliency-guided attention network for image-sentence matching. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Republic of Korea, pp. 5753–5762, 2019. DOI: https://doi.org/10.1109/ICCV.2019.00585.

    Google Scholar 

  52. W. Li, C. Gao, G. C. Niu, X. Y. Xiao, H. Liu, J. C. Liu, H. Wu, H. F. Wang. UNIMO: Towards unified-modal understanding and generation via cross-modal contrastive learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 2592–2607, 2021. DOI: https://doi.org/10.18653/v1/2021.acl-long.202.

  53. Y. X. Wang, H. Yang, X. M. Qian, L. Ma, J. Lu, B. Li, X. Fan. Position focused attention network for image-text matching. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, pp. 3792–3798, 2019. DOI: https://doi.org/10.24963/ijcai.2019/526.

  54. F. Yan, K. Mikolajczyk. Deep correlation for matching images and text. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, pp. 3441–3450, 2015. DOI: https://doi.org/10.1109/CVPR.2015.7298966.

  55. Y. L. Song, M. Soleymani. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 1979–1988, 2019. DOI: https://doi.org/10.1109/CVPR.2019.00208.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhiwu Lu.

Additional information

Haoyu Lu received the B. Sc. degree in applied mathematics from Renmin University of China, China in 2021. Now, he is a Ph.D. degree candidate in artificial intelligence at Gaoling School of Artificial Intelligence, Renmin University of China, China.

His research interests include cross-modal pre-training model and video representation learning.

Yuqi Huo received the B. Sc. and Ph. D. degrees in computer science from Renmin University of China, China in 2017 and 2022, respectively.

His research interests include video representation learning and cross-modal pre-training.

Mingyu Ding received the B. Eng. and M. Eng. degrees in computer science and technology from Renmin University of China, China in 2017 and 2020, respectively. He is currently a Ph.D. degree candidate in computer science and technology at Department of Computer Science, University of Hong Kong, China.

His research interests include visual perception, neural architecture design, anf autonomous driving.

Nanyi Fei received the B. Sc. degree in computer science and technology from Renmin University of China, China in 2019. He is currently a Ph. D. degree candidate in computer science and technology at School of Information, Renmin University of China, China.

His research interests include multimodal learning and computer vision.

Zhiwu Lu received the M. Sc. degree in applied mathematics from Peking University, China in 2005, and the Ph. D. degree in computer science from City University of Hong Kong, China in 2011. He is currently a full professor at Gaoling School of Artificial Intelligence, Renmin University of China, China. He won the Best Paper Award at CGI 2014 and IBM SUR Award 2015.

His research interests lie in machine learning, computer vision, and multimodal pre-training.

Colored figures are available in the online version at https://link.springer.com/journal/11633

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lu, H., Huo, Y., Ding, M. et al. Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval. Mach. Intell. Res. 20, 569–582 (2023). https://doi.org/10.1007/s11633-022-1386-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11633-022-1386-4

Keywords

Navigation