Abstract
Recent days have seen significant improvements in multi-modal learning made by Vision-Language Pre-training (VLP) models. However, most of them employ the coarse-grained global alignment to overcome semantic gap for generating common representations, which makes them inadequate to capture intrinsic semantic correlations in image-text retrieval and consequently degrading the accuracy. Moreover, it is expensive to fine-tune a VLP model to perform image-text retrieval due to its large number of parameters. In this paper, we propose a simple yet effective image-text retrieval method, termed Cross-Modality Interaction Reasoning for enhancing Vision-Language Pre-training (CMIR-VLP). Specifically, a Cross Modality Interaction Reasoning (CMIR) module, which is designed to inject fine-grained image-text associations into semantic correlations learning, integrates the patch cues into the word reasoning with a multi-modal interaction encoder. Besides, we propose a cross-interaction process to associate each local text semantics with local visual information for fine-grained image-text alignment. Extensive experiments demonstrate our method gains 52 and 97.5 over state-of-the-art non-pre-training methods on two widely used datasets, and it also outperforms several mainstream fine-tuned VIP models. The related code repository in https://github.com/PSYGIM/CMIR-VLP.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
Data from MIRFlickr is available at https://press.liacs.nl/mirflickr/ (ref. [29]). MSCOCO data can be accessed at https://cocodataset.org/ (ref. [19]).
Code Availability
The code will be released when the paper is accepted. The related code repository in https://github.com/PSYGIM/CMIR-VLP.
References
Chen H, Ding G, Liu X, Lin Z, Liu J, Han J (2020) Imram: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12655–12663
Cheng Y, Zhu X, Qian J, Wen F, Liu P (2022) Cross-modal graph matching network for image-text retrieval. ACM Trans Multimed Comput Commun Appl 18(4):1–23
Dey R, Salem F (2017) Gate-variants of gated recurrent unit (gru) neural networks. In: 2017 IEEE 60th international midwest symposium on circuits and systems, pp 1597–1600
Diao H, Zhang Y, Ma L, Lu H (2021) Similarity reasoning and filtration for image-text retrieval. In: Proceedings of the AAAI conference on artificial intelligence, pp 1218-1226
Faghri F, Fleet D, Kiros J, Fidler S (2018) Vse++: improving visual-semantic embeddings with hard negatives. Proceedings of the British machine vision conference
Feng D, He X, Peng Y (2023) Mkvse: multimodal knowledge enhanced visual-semantic embedding for image-text retrieval. ACM Trans Multimed Comput Commun Appl 19(5):1–21
Fu Z, Mao Z, Song Y, Zhang Y (2023) Learning semantic relationship among instances for image-text matching. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition, pp 15159–15168
Ge X, Chen F, Jose J, Ji Z, Wu Z, Liu X (2021) Structured multi-modal feature embedding and alignment for image-sentence retrieval. In: Proceedings of the 29th ACM international conference on multimedia, pp 5185–5193
Hu Z, Luo Y, Lin J, Yan Y, Chen J (2019) Multi-level visual-semantic alignments with relation-wise dual attention network for image and text matching. In: International joint conferences on artificial intelligence, pp 789–795
Ji Z, Chen K, Wang H (2021) Step-wise hierarchical alignment network for image-text retrieval. Proceedings of the thirtieth international joint conference on artificial intelligence
Karpathy A, Joulin A, Fei-Fei L (2014) Deep fragment embeddings for bidirectional image sentence mapping. Advances in neural information processing systems, pp 27
Kim W, Son B, Kim I (2021) Vilt: vision-and-language transformer without convolution or region supervision. In: International conference on machine learning, pp 5583–5594
Kim D, Kim N, Kwak S (2023) Improving cross-modal retrieval with set of diverse embeddings. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 23422–23431
Klein B, Lev G, Sadeh G, Wolf L (2015) Associating neural word embeddings with deep image representations using fisher vectors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4437–4446
Lee K, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text retrieval. In: Proceedings of the European conference on computer vision, pp 201–216
Li J, Niu L, Zhang L (2022) Action-aware embedding enhancement for image-text retrieval. In: Proceedings of the AAAI conference on artificial intelligence, pp 1323–1331
Li J, Li D, Xiong C, Hoi S (2022) Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International conference on machine learning, pp 12888–12900
Li K, Zhang Y, Li K, Li Y, Fu Y (2019) Visual semantic reasoning for image-text retrieval. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 4654–4662
Lin T, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C (2014) Microsoft coco: common objects in context. In: Computer vision–ECCV, pp 740–755
Liu C, Mao Z, Liu A, Zhang T, Wang B, Zhang Y (2019) Focus your attention: a bidirectional focal attention network for image-text retrieval. In: Proceedings of the 27th ACM international conference on multimedia, pp 3–11
Liu Y, Liu H, Wang H, Liu M (2022) Regularizing visual semantic embedding with contrastive learning for image-text retrieval. IEEE Signal Processing Letters pp 29:1332–1336
Li K, Zhang Y, Li K, Li Y, Fu Y (2022) Image-text embedding learning via visual and textual semantic reasoning. IEEE transactions on pattern analysis and machine intelligence pp 45:641–656
Long S, Han S, Wan X, Poon J (2022) Gradual: Graph-based dual-modal representation for image-text matching. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3459–3468
Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, vol 32
Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. Advances in neural information processing systems vol 27
Nie L, Qu L, Meng D, Zhang M, Tian Q, Bimbo A (2022) Search-oriented micro-video captioning. In: Proceedings of the 30th ACM international conference on multimedia, pp 3234–3243
Pan Z, Wu F, Zhang B (2023) Fine-grained image-text matching by cross-modal hard aligning network. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19275–19284
Peng L, Qian J, Wang C, Liu B, Dong Y (2023) Swin transformer-based supervised hashing. Applied intelligence, pp 1–13
Plummer B, Wang L, Cervantes C, Caicedo J, Hockenmaier J, Lazebnik S (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE international conference on computer vision, pp. 2641–2649
Radford A, Kim J, Hallacy C, Ramesh A, Goh G, Agarwal S,Sastry G, Askell A, Mishkin P, Clark J, et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, pp 8748–8763
Shu Z, Li L, Yu J, Zhang D, Yu Z, Wu X (2023) Online supervised collective matrix factorization hashing for cross-modal retrieval. Applied intelligence, pp 14201–14218
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition
Wang H, Zhang Y, Ji Z, Pang Y, Ma L (2020) Consensus-aware visual-semantic embedding for image-text retrieval. In: Computer vision–ECCV, pp 18–34
Wang J, Zhou P, Shou M, Yan S (2023) Position-guided text prompt for vision-language pre-training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 23242–23251
Wang S, Wang R, Yao Z, Shan S, Chen X (2020) Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1508–1517
Wang Z, Liu X, Li H, Sheng L, Yan J, Wang X, Shao J (2019) Camp: cross-modal adaptive message passing for text-image retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 5764–5773
Wu Y, Wang S, Song G, Huang Q (2019) Learning fragment self-attention embeddings for image-text retrieval. In: Proceedings of the 27th ACM international conference on multimedia, pp 2088–2096
Wu H, Liu Y, Cai H, He S (2022) Learning transferable perturbations for image captioning. ACM Transactions on Multimedia Computing, Communications, and Applications, pp 1–18
Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6281–6290
Yu R, Jin F, Qiao Z, Yuan Y, Wang G (2023) Multi-scale image-text matching network for scene and spatio-temporal images. Future Generation Computer Systems, pp 292–300
Zhang K, Mao Z, Wang Q, Zhang Y (2022) Negative-aware attention framework for image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15661–15670
Zhang Q, Lei Z, Zhang Z, Li S (2020) Context-aware attention network for image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3536–3545
Zhu J, Li Z, Zeng Y, Wei J, Ma H (2022) Image-text retrieval with fine-grained relational dependency and bidirectional attention-based generative networks. In: Proceedings of the 30th ACM international conference on multimedia, pp 395–403
Acknowledgements
This work is supported by the Natural Science Foundation of Shandong Province (Grant No. ZR2023MF031), the National Natural Science Foundation of China (No.62102186, No.62171209), and the Shandong Province Science and Technology based Minor Enterprises Innovation Capability Enhancement Project (2023TSGC0877).
Author information
Authors and Affiliations
Contributions
Tao Yao: Visualization, Investigation, Funding acquisition. Shouyong Peng: Conceptualization, Methodology, Software, Implementation of the computer code and supporting algorithms, Writing- Original draft preparation. Lili Wang: Writing - Review and Editing. Ying Li: Software, Validation. Yujuan Sun: Writing- Reviewing and Editing.
Corresponding author
Ethics declarations
Competing Interests
The authors have no competing interests to declare that are relevant to the content of this article.
Ethics Approval
Not applicable.
Ethical Standards
Not applicable.
Consent to Participate
Not applicable.
Consent for Publication
Not applicable.
Ethical and Informed Consent for Data Used
Not applicable.
Ethical Standards
The authors declare that this manuscript is original, has not been published before, and is not currently being considered for publication elsewhere. The authors confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. The authors further confirm that the order of authors listed in the manuscript has been approved by all of us. The authors understand that the Corresponding Author is the sole contact for the Editorial process. He/she is responsible for communicating with the other authors about progress, submissions of revisions, and the final approval of proofs.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yao, T., Peng, S., Wang, L. et al. Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval. Appl Intell 54, 12230–12245 (2024). https://doi.org/10.1007/s10489-024-05823-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05823-1