Abstract
Image–text matching has become a research hotspot for multimodal matching tasks. Existing image–text matching models suffer from feature redundancy in intra-modal modeling of images. This redundant information not only dilutes the important region weights but also is introduced as irrelevant or distracting information. In addition, the syntactic structure of the text is made difficult to analyze by the model due to the lack of syntactic information, which affects the dependencies between words. To address the above problems, the syntactic-guided optimization of image–text matching for intra-modal modeling (SGIM) is proposed. Multi-view filtering of image features is used based on the differences in information richness in each region of the image. The importance weights of each region are adjusted. To obtain an enhanced representation of the image features, multi-view information is fused. A syntactic dependency enhancement method is proposed based on the contextual relationships between text words. To avoid the loss of long-range textual context, the attention distribution of the entire sentence is adjusted. The experimental results show that the SGIM model achieves a minimum improvement of 4.8%, 1.7%, and 2.1% in recall sum (Rsum) compared to MAG, MSR, MMCA, SGRAF, and ReSG on the publicly available datasets Flickr30K, MSCOCO 1K, and MSCOCO 5K.











Similar content being viewed by others
Data availability
Datasets derived from public resources and made available with the article.
References
Yi Y, Tian Y, He C, Fan Y, Hu X, Xu Y (2023) Dbt: multimodal emotion recognition based on dual-branch transformer. J Supercomput 79(8):8611–8633
Shi X, Yu Z, Wang X, Li Y, Niu Y (2023) Text-image matching for multi-model machine translation. J Supercomput 79(16):17810–17823
Kayani M, Ghafoor A, Riaz MM (2023) Multi-modal text recognition and encryption in scanned document images. J Supercomput 79(7):7916–7936
Diao H, Zhang Y, Ma L, Lu H (2021) Similarity reasoning and filtration for image-text matching. Proc AAAI Conf Artif Intell 35:1218–1226
Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539
Ma L, Lu Z, Shang L, Li H (2015) Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2623–2631
Liu L, Gou T (2021) Cross-modal retrieval combining deep canonical correlation analysis and adversarial learning. Comput Sci 48:200–207
Ge X, Chen F, Xu S, Tao F, Jose JM (2023) Cross-modal semantic enhanced interaction for image-sentence retrieval. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 1022–1031
Wang Z, Xu X, Wei J, Xie N, Shao J, Yang Y (2023) Quaternion representation learning for cross-modal matching. Knowl-Based Syst 270:110505
Xu G, Hu M, Wang X Yang J, Li N, Zhang Q (2023) Location attention knowledge embedding model for image-text matching. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pp 408–421 Springer
Cheng Y, Zhu X, Qian J, Wen F, Liu P (2022) Cross-modal graph matching network for image-text retrieval. ACM Trans Multimed Comput, Commun, Appl (TOMM) 18(4):1–23
Li K, Zhang Y, Li K, Li Y, Fu Y (2022) Image-text embedding learning via visual and textual semantic reasoning. IEEE Trans Pattern Anal Mach Intell 45(1):641–656
Li Y, Yao T, Zhang L, Sun Y, Fu H (2024) Image-text matching algorithm based on multi-level semantic alignment. J Beijing Univ Aeronaut Astronaut 50:551–558. https://doi.org/10.13700/j.bh.1001-5965.2022.0385
Zhang H, Mao Z, Zhang K, Zhang Y (2022) Show your faith: cross-modal confidence-aware network for image-text matching. Proc AAAI Conf Artif Intell 36:3262–3270
Qin X, Li L, Pang G (2024) Multi-scale motivated neural network for image-text matching. Multimed Tools Appl 83(2):4383–4407
Zhang K, Hu B, Zhang H, Li Z, Mao Z (2023) Enhanced semantic similarity learning framework for image-text matching. IEEE Trans Circuits Syst Video Technol 34:2973–2988
Yao T, Li Y, Li Y, Zhu Y, Wang G, Yue J (2023) Cross-modal semantically augmented network for image-text matching. ACM Trans Multimed Comput Commun Appl 20(4):1–18
Jiang D, Ye M (2023) Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2787–2797
Chen C, Zhang B, Cao L, Shen J, Gunter T, Jose AM, Toshev A, Shlens J, Pang R, Yang Y (2023) Stair: Learning sparse text and image representation in grounded tokens. arXiv preprint arXiv:2301.13081
Liu X, He Y, Cheung Y-M, Xu X, Wang N (2022) Learning relationship-enhanced semantic graph for fine-grained image-text matching. IEEE Trans Cybern 54(2):948–961
Shang H, Zhao G, Shi J, Qian X (2023) Multi-view text imagination network based on latent alignment for image-text matching. IEEE Intell Syst 38(3):41–50
Fu Z, Mao Z, Song Y, Zhang Y (2023) Learning semantic relationship among instances for image-text matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 15159–15168
Long S, Han SC, Wan X, Poon J (2022) Gradual: Graph-based dual-modal representation for image-text matching. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 3459–3468
Pei J, Zhong K, Yu Z, Wang L, Lakshmanna K (2023) Scene graph semantic inference for image and text matching. ACM Trans Asian Low-Resour Lang Inf Process 22(5):1–23
Dong X, Zhang H, Zhu L, Nie L, Liu L (2022) Hierarchical feature aggregation based on transformer for image-text matching. IEEE Trans Circuits Syst Video Technol 32(9):6437–6447
Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 201–216
Salini Y, Eswaraiah P, Brahmam MV, Sirisha U (2023) Word embedding for text classification: Efficient cnn and bi-gru fusion multi attention mechanism. EAI Endorsed Trans Scalable Inf Syst 10(6):66
Zhou J, Zhao H (2019) Head-driven phrase structure grammar parsing on penn treebank. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp 2396–2408
Gong D, Chen H, Chen S, Bao Y, Ding G (2021) Matching with agreement for cross-modal image-text retrieval. CAAI Trans Intell Syst 16:1143–1150
Sun H, Qin X, Liu X (2023) Image-text matching using multi-subspace joint representation. Multimed Syst 29(3):1057–1071
Wei X, Zhang T, Li Y, Zhang Y, Wu F (2020) Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10941–10950
Liu X, He Y, Cheung Y, Xu X, Wang N (2024) Learning relationship-enhanced semantic graph for fine-grained image-text matching. IEEE Trans Cybern 54(2):948–961
Acknowledgements
The authors look forward to the insightful comments and suggestions of the anonymous reviewers and editors, which will go a long way toward improving the quality of this paper.
Author information
Authors and Affiliations
Contributions
Di Wu and Le Zhang wrote the main manuscript, and Yao Chen realized the experimental part. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethics approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Consent to participate
Informed consent was obtained from all individual participants included in the study.
Consent for publication
Publication consent was obtained from all individual participants included in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wu, D., Zhang, L. & Chen, Y. Syntactic-guided optimization of image–text matching for intra-modal modeling. J Supercomput 81, 367 (2025). https://doi.org/10.1007/s11227-024-06840-0
Accepted:
Published:
DOI: https://doi.org/10.1007/s11227-024-06840-0