Skip to main content
Log in

Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Recent days have seen significant improvements in multi-modal learning made by Vision-Language Pre-training (VLP) models. However, most of them employ the coarse-grained global alignment to overcome semantic gap for generating common representations, which makes them inadequate to capture intrinsic semantic correlations in image-text retrieval and consequently degrading the accuracy. Moreover, it is expensive to fine-tune a VLP model to perform image-text retrieval due to its large number of parameters. In this paper, we propose a simple yet effective image-text retrieval method, termed Cross-Modality Interaction Reasoning for enhancing Vision-Language Pre-training (CMIR-VLP). Specifically, a Cross Modality Interaction Reasoning (CMIR) module, which is designed to inject fine-grained image-text associations into semantic correlations learning, integrates the patch cues into the word reasoning with a multi-modal interaction encoder. Besides, we propose a cross-interaction process to associate each local text semantics with local visual information for fine-grained image-text alignment. Extensive experiments demonstrate our method gains 52 and 97.5 over state-of-the-art non-pre-training methods on two widely used datasets, and it also outperforms several mainstream fine-tuned VIP models. The related code repository in https://github.com/PSYGIM/CMIR-VLP.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

Data from MIRFlickr is available at https://press.liacs.nl/mirflickr/ (ref. [29]). MSCOCO data can be accessed at https://cocodataset.org/ (ref. [19]).

Code Availability

The code will be released when the paper is accepted. The related code repository in https://github.com/PSYGIM/CMIR-VLP.

References

  1. Chen H, Ding G, Liu X, Lin Z, Liu J, Han J (2020) Imram: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12655–12663

  2. Cheng Y, Zhu X, Qian J, Wen F, Liu P (2022) Cross-modal graph matching network for image-text retrieval. ACM Trans Multimed Comput Commun Appl 18(4):1–23

    Article  Google Scholar 

  3. Dey R, Salem F (2017) Gate-variants of gated recurrent unit (gru) neural networks. In: 2017 IEEE 60th international midwest symposium on circuits and systems, pp 1597–1600

  4. Diao H, Zhang Y, Ma L, Lu H (2021) Similarity reasoning and filtration for image-text retrieval. In: Proceedings of the AAAI conference on artificial intelligence, pp 1218-1226

  5. Faghri F, Fleet D, Kiros J, Fidler S (2018) Vse++: improving visual-semantic embeddings with hard negatives. Proceedings of the British machine vision conference

  6. Feng D, He X, Peng Y (2023) Mkvse: multimodal knowledge enhanced visual-semantic embedding for image-text retrieval. ACM Trans Multimed Comput Commun Appl 19(5):1–21

    Article  Google Scholar 

  7. Fu Z, Mao Z, Song Y, Zhang Y (2023) Learning semantic relationship among instances for image-text matching. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition, pp 15159–15168

  8. Ge X, Chen F, Jose J, Ji Z, Wu Z, Liu X (2021) Structured multi-modal feature embedding and alignment for image-sentence retrieval. In: Proceedings of the 29th ACM international conference on multimedia, pp 5185–5193

  9. Hu Z, Luo Y, Lin J, Yan Y, Chen J (2019) Multi-level visual-semantic alignments with relation-wise dual attention network for image and text matching. In: International joint conferences on artificial intelligence, pp 789–795

  10. Ji Z, Chen K, Wang H (2021) Step-wise hierarchical alignment network for image-text retrieval. Proceedings of the thirtieth international joint conference on artificial intelligence

  11. Karpathy A, Joulin A, Fei-Fei L (2014) Deep fragment embeddings for bidirectional image sentence mapping. Advances in neural information processing systems, pp 27

  12. Kim W, Son B, Kim I (2021) Vilt: vision-and-language transformer without convolution or region supervision. In: International conference on machine learning, pp 5583–5594

  13. Kim D, Kim N, Kwak S (2023) Improving cross-modal retrieval with set of diverse embeddings. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 23422–23431

  14. Klein B, Lev G, Sadeh G, Wolf L (2015) Associating neural word embeddings with deep image representations using fisher vectors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4437–4446

  15. Lee K, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text retrieval. In: Proceedings of the European conference on computer vision, pp 201–216

  16. Li J, Niu L, Zhang L (2022) Action-aware embedding enhancement for image-text retrieval. In: Proceedings of the AAAI conference on artificial intelligence, pp 1323–1331

  17. Li J, Li D, Xiong C, Hoi S (2022) Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International conference on machine learning, pp 12888–12900

  18. Li K, Zhang Y, Li K, Li Y, Fu Y (2019) Visual semantic reasoning for image-text retrieval. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 4654–4662

  19. Lin T, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C (2014) Microsoft coco: common objects in context. In: Computer vision–ECCV, pp 740–755

  20. Liu C, Mao Z, Liu A, Zhang T, Wang B, Zhang Y (2019) Focus your attention: a bidirectional focal attention network for image-text retrieval. In: Proceedings of the 27th ACM international conference on multimedia, pp 3–11

  21. Liu Y, Liu H, Wang H, Liu M (2022) Regularizing visual semantic embedding with contrastive learning for image-text retrieval. IEEE Signal Processing Letters pp 29:1332–1336

  22. Li K, Zhang Y, Li K, Li Y, Fu Y (2022) Image-text embedding learning via visual and textual semantic reasoning. IEEE transactions on pattern analysis and machine intelligence pp 45:641–656

  23. Long S, Han S, Wan X, Poon J (2022) Gradual: Graph-based dual-modal representation for image-text matching. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3459–3468

  24. Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, vol 32

  25. Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. Advances in neural information processing systems vol 27

  26. Nie L, Qu L, Meng D, Zhang M, Tian Q, Bimbo A (2022) Search-oriented micro-video captioning. In: Proceedings of the 30th ACM international conference on multimedia, pp 3234–3243

  27. Pan Z, Wu F, Zhang B (2023) Fine-grained image-text matching by cross-modal hard aligning network. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19275–19284

  28. Peng L, Qian J, Wang C, Liu B, Dong Y (2023) Swin transformer-based supervised hashing. Applied intelligence, pp 1–13

  29. Plummer B, Wang L, Cervantes C, Caicedo J, Hockenmaier J, Lazebnik S (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE international conference on computer vision, pp. 2641–2649

  30. Radford A, Kim J, Hallacy C, Ramesh A, Goh G, Agarwal S,Sastry G, Askell A, Mishkin P, Clark J, et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, pp 8748–8763

  31. Shu Z, Li L, Yu J, Zhang D, Yu Z, Wu X (2023) Online supervised collective matrix factorization hashing for cross-modal retrieval. Applied intelligence, pp 14201–14218

  32. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition

  33. Wang H, Zhang Y, Ji Z, Pang Y, Ma L (2020) Consensus-aware visual-semantic embedding for image-text retrieval. In: Computer vision–ECCV, pp 18–34

  34. Wang J, Zhou P, Shou M, Yan S (2023) Position-guided text prompt for vision-language pre-training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 23242–23251

  35. Wang S, Wang R, Yao Z, Shan S, Chen X (2020) Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1508–1517

  36. Wang Z, Liu X, Li H, Sheng L, Yan J, Wang X, Shao J (2019) Camp: cross-modal adaptive message passing for text-image retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 5764–5773

  37. Wu Y, Wang S, Song G, Huang Q (2019) Learning fragment self-attention embeddings for image-text retrieval. In: Proceedings of the 27th ACM international conference on multimedia, pp 2088–2096

  38. Wu H, Liu Y, Cai H, He S (2022) Learning transferable perturbations for image captioning. ACM Transactions on Multimedia Computing, Communications, and Applications, pp 1–18

  39. Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6281–6290

  40. Yu R, Jin F, Qiao Z, Yuan Y, Wang G (2023) Multi-scale image-text matching network for scene and spatio-temporal images. Future Generation Computer Systems, pp 292–300

  41. Zhang K, Mao Z, Wang Q, Zhang Y (2022) Negative-aware attention framework for image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15661–15670

  42. Zhang Q, Lei Z, Zhang Z, Li S (2020) Context-aware attention network for image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3536–3545

  43. Zhu J, Li Z, Zeng Y, Wei J, Ma H (2022) Image-text retrieval with fine-grained relational dependency and bidirectional attention-based generative networks. In: Proceedings of the 30th ACM international conference on multimedia, pp 395–403

Download references

Acknowledgements

This work is supported by the Natural Science Foundation of Shandong Province (Grant No. ZR2023MF031), the National Natural Science Foundation of China (No.62102186, No.62171209), and the Shandong Province Science and Technology based Minor Enterprises Innovation Capability Enhancement Project (2023TSGC0877).

Author information

Authors and Affiliations

Authors

Contributions

Tao Yao: Visualization, Investigation, Funding acquisition. Shouyong Peng: Conceptualization, Methodology, Software, Implementation of the computer code and supporting algorithms, Writing- Original draft preparation. Lili Wang: Writing - Review and Editing. Ying Li: Software, Validation. Yujuan Sun: Writing- Reviewing and Editing.

Corresponding author

Correspondence to Tao Yao.

Ethics declarations

Competing Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Ethics Approval

Not applicable.

Ethical Standards

Not applicable.

Consent to Participate

Not applicable.

Consent for Publication

Not applicable.

Ethical and Informed Consent for Data Used

Not applicable.

Ethical Standards

The authors declare that this manuscript is original, has not been published before, and is not currently being considered for publication elsewhere. The authors confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. The authors further confirm that the order of authors listed in the manuscript has been approved by all of us. The authors understand that the Corresponding Author is the sole contact for the Editorial process. He/she is responsible for communicating with the other authors about progress, submissions of revisions, and the final approval of proofs.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yao, T., Peng, S., Wang, L. et al. Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval. Appl Intell 54, 12230–12245 (2024). https://doi.org/10.1007/s10489-024-05823-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05823-1

Keywords