Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval

Yao, Tao; Peng, Shouyong; Wang, Lili; Li, Ying; Sun, Yujuan

doi:10.1007/s10489-024-05823-1

Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval

Published: 11 September 2024

Volume 54, pages 12230–12245, (2024)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Tao Yao ORCID: orcid.org/0000-0003-2660-1050^1,2,
Shouyong Peng¹,
Lili Wang¹,
Ying Li³ &
…
Yujuan Sun¹

335 Accesses
Explore all metrics

Abstract

Recent days have seen significant improvements in multi-modal learning made by Vision-Language Pre-training (VLP) models. However, most of them employ the coarse-grained global alignment to overcome semantic gap for generating common representations, which makes them inadequate to capture intrinsic semantic correlations in image-text retrieval and consequently degrading the accuracy. Moreover, it is expensive to fine-tune a VLP model to perform image-text retrieval due to its large number of parameters. In this paper, we propose a simple yet effective image-text retrieval method, termed Cross-Modality Interaction Reasoning for enhancing Vision-Language Pre-training (CMIR-VLP). Specifically, a Cross Modality Interaction Reasoning (CMIR) module, which is designed to inject fine-grained image-text associations into semantic correlations learning, integrates the patch cues into the word reasoning with a multi-modal interaction encoder. Besides, we propose a cross-interaction process to associate each local text semantics with local visual information for fine-grained image-text alignment. Extensive experiments demonstrate our method gains 52 and 97.5 over state-of-the-art non-pre-training methods on two widely used datasets, and it also outperforms several mainstream fine-tuned VIP models. The related code repository in https://github.com/PSYGIM/CMIR-VLP.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

Article 03 May 2023

ITContrast: contrastive learning with hard negative synthesis for image-text matching

Article 15 February 2024

UNITER: UNiversal Image-TExt Representation Learning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

Data from MIRFlickr is available at https://press.liacs.nl/mirflickr/ (ref. [29]). MSCOCO data can be accessed at https://cocodataset.org/ (ref. [19]).

Code Availability

The code will be released when the paper is accepted. The related code repository in https://github.com/PSYGIM/CMIR-VLP.

References

Chen H, Ding G, Liu X, Lin Z, Liu J, Han J (2020) Imram: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12655–12663
Cheng Y, Zhu X, Qian J, Wen F, Liu P (2022) Cross-modal graph matching network for image-text retrieval. ACM Trans Multimed Comput Commun Appl 18(4):1–23
Article Google Scholar
Dey R, Salem F (2017) Gate-variants of gated recurrent unit (gru) neural networks. In: 2017 IEEE 60th international midwest symposium on circuits and systems, pp 1597–1600
Diao H, Zhang Y, Ma L, Lu H (2021) Similarity reasoning and filtration for image-text retrieval. In: Proceedings of the AAAI conference on artificial intelligence, pp 1218-1226
Faghri F, Fleet D, Kiros J, Fidler S (2018) Vse++: improving visual-semantic embeddings with hard negatives. Proceedings of the British machine vision conference
Feng D, He X, Peng Y (2023) Mkvse: multimodal knowledge enhanced visual-semantic embedding for image-text retrieval. ACM Trans Multimed Comput Commun Appl 19(5):1–21
Article Google Scholar
Fu Z, Mao Z, Song Y, Zhang Y (2023) Learning semantic relationship among instances for image-text matching. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition, pp 15159–15168
Ge X, Chen F, Jose J, Ji Z, Wu Z, Liu X (2021) Structured multi-modal feature embedding and alignment for image-sentence retrieval. In: Proceedings of the 29th ACM international conference on multimedia, pp 5185–5193
Hu Z, Luo Y, Lin J, Yan Y, Chen J (2019) Multi-level visual-semantic alignments with relation-wise dual attention network for image and text matching. In: International joint conferences on artificial intelligence, pp 789–795
Ji Z, Chen K, Wang H (2021) Step-wise hierarchical alignment network for image-text retrieval. Proceedings of the thirtieth international joint conference on artificial intelligence
Karpathy A, Joulin A, Fei-Fei L (2014) Deep fragment embeddings for bidirectional image sentence mapping. Advances in neural information processing systems, pp 27
Kim W, Son B, Kim I (2021) Vilt: vision-and-language transformer without convolution or region supervision. In: International conference on machine learning, pp 5583–5594
Kim D, Kim N, Kwak S (2023) Improving cross-modal retrieval with set of diverse embeddings. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 23422–23431
Klein B, Lev G, Sadeh G, Wolf L (2015) Associating neural word embeddings with deep image representations using fisher vectors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4437–4446
Lee K, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text retrieval. In: Proceedings of the European conference on computer vision, pp 201–216
Li J, Niu L, Zhang L (2022) Action-aware embedding enhancement for image-text retrieval. In: Proceedings of the AAAI conference on artificial intelligence, pp 1323–1331
Li J, Li D, Xiong C, Hoi S (2022) Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International conference on machine learning, pp 12888–12900
Li K, Zhang Y, Li K, Li Y, Fu Y (2019) Visual semantic reasoning for image-text retrieval. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 4654–4662
Lin T, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C (2014) Microsoft coco: common objects in context. In: Computer vision–ECCV, pp 740–755
Liu C, Mao Z, Liu A, Zhang T, Wang B, Zhang Y (2019) Focus your attention: a bidirectional focal attention network for image-text retrieval. In: Proceedings of the 27th ACM international conference on multimedia, pp 3–11
Liu Y, Liu H, Wang H, Liu M (2022) Regularizing visual semantic embedding with contrastive learning for image-text retrieval. IEEE Signal Processing Letters pp 29:1332–1336
Li K, Zhang Y, Li K, Li Y, Fu Y (2022) Image-text embedding learning via visual and textual semantic reasoning. IEEE transactions on pattern analysis and machine intelligence pp 45:641–656
Long S, Han S, Wan X, Poon J (2022) Gradual: Graph-based dual-modal representation for image-text matching. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3459–3468
Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, vol 32
Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. Advances in neural information processing systems vol 27
Nie L, Qu L, Meng D, Zhang M, Tian Q, Bimbo A (2022) Search-oriented micro-video captioning. In: Proceedings of the 30th ACM international conference on multimedia, pp 3234–3243
Pan Z, Wu F, Zhang B (2023) Fine-grained image-text matching by cross-modal hard aligning network. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19275–19284
Peng L, Qian J, Wang C, Liu B, Dong Y (2023) Swin transformer-based supervised hashing. Applied intelligence, pp 1–13
Plummer B, Wang L, Cervantes C, Caicedo J, Hockenmaier J, Lazebnik S (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE international conference on computer vision, pp. 2641–2649
Radford A, Kim J, Hallacy C, Ramesh A, Goh G, Agarwal S,Sastry G, Askell A, Mishkin P, Clark J, et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, pp 8748–8763
Shu Z, Li L, Yu J, Zhang D, Yu Z, Wu X (2023) Online supervised collective matrix factorization hashing for cross-modal retrieval. Applied intelligence, pp 14201–14218
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition
Wang H, Zhang Y, Ji Z, Pang Y, Ma L (2020) Consensus-aware visual-semantic embedding for image-text retrieval. In: Computer vision–ECCV, pp 18–34
Wang J, Zhou P, Shou M, Yan S (2023) Position-guided text prompt for vision-language pre-training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 23242–23251
Wang S, Wang R, Yao Z, Shan S, Chen X (2020) Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1508–1517
Wang Z, Liu X, Li H, Sheng L, Yan J, Wang X, Shao J (2019) Camp: cross-modal adaptive message passing for text-image retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 5764–5773
Wu Y, Wang S, Song G, Huang Q (2019) Learning fragment self-attention embeddings for image-text retrieval. In: Proceedings of the 27th ACM international conference on multimedia, pp 2088–2096
Wu H, Liu Y, Cai H, He S (2022) Learning transferable perturbations for image captioning. ACM Transactions on Multimedia Computing, Communications, and Applications, pp 1–18
Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6281–6290
Yu R, Jin F, Qiao Z, Yuan Y, Wang G (2023) Multi-scale image-text matching network for scene and spatio-temporal images. Future Generation Computer Systems, pp 292–300
Zhang K, Mao Z, Wang Q, Zhang Y (2022) Negative-aware attention framework for image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15661–15670
Zhang Q, Lei Z, Zhang Z, Li S (2020) Context-aware attention network for image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3536–3545
Zhu J, Li Z, Zeng Y, Wei J, Ma H (2022) Image-text retrieval with fine-grained relational dependency and bidirectional attention-based generative networks. In: Proceedings of the 30th ACM international conference on multimedia, pp 395–403

Download references

Acknowledgements

This work is supported by the Natural Science Foundation of Shandong Province (Grant No. ZR2023MF031), the National Natural Science Foundation of China (No.62102186, No.62171209), and the Shandong Province Science and Technology based Minor Enterprises Innovation Capability Enhancement Project (2023TSGC0877).

Author information

Authors and Affiliations

Information and Electrical Engineering, Ludong University, Yantai, 264025, Shangdong, China
Tao Yao, Shouyong Peng, Lili Wang & Yujuan Sun
Yantai Research Institute of New Generation Information Technology, Southwest Jiaotong University, Yantai, 264003, Shangdong, China
Tao Yao
School of Computer and Electronic Information and the School of Artificial Intelligence, Nanjing Normal University, Nanjing, 210023, Jiangsu, China
Ying Li

Authors

Tao Yao
View author publications
You can also search for this author inPubMed Google Scholar
Shouyong Peng
View author publications
You can also search for this author inPubMed Google Scholar
Lili Wang
View author publications
You can also search for this author inPubMed Google Scholar
Ying Li
View author publications
You can also search for this author inPubMed Google Scholar
Yujuan Sun
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Tao Yao: Visualization, Investigation, Funding acquisition. Shouyong Peng: Conceptualization, Methodology, Software, Implementation of the computer code and supporting algorithms, Writing- Original draft preparation. Lili Wang: Writing - Review and Editing. Ying Li: Software, Validation. Yujuan Sun: Writing- Reviewing and Editing.

Corresponding author

Correspondence to Tao Yao.

Ethics declarations

Competing Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Ethics Approval

Not applicable.

Ethical Standards

Not applicable.

Consent to Participate

Not applicable.

Consent for Publication

Not applicable.

Ethical and Informed Consent for Data Used

Not applicable.

Ethical Standards

The authors declare that this manuscript is original, has not been published before, and is not currently being considered for publication elsewhere. The authors confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. The authors further confirm that the order of authors listed in the manuscript has been approved by all of us. The authors understand that the Corresponding Author is the sole contact for the Editorial process. He/she is responsible for communicating with the other authors about progress, submissions of revisions, and the final approval of proofs.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yao, T., Peng, S., Wang, L. et al. Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval. Appl Intell 54, 12230–12245 (2024). https://doi.org/10.1007/s10489-024-05823-1

Download citation

Accepted: 28 August 2024
Published: 11 September 2024
Issue Date: December 2024
DOI: https://doi.org/10.1007/s10489-024-05823-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

ITContrast: contrastive learning with hard negative synthesis for image-text matching

UNITER: UNiversal Image-TExt Representation Learning

Explore related subjects

Data Availability

Code Availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Ethics Approval

Ethical Standards

Consent to Participate

Consent for Publication

Ethical and Informed Consent for Data Used

Ethical Standards

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now