Skip to main content

Advertisement

Log in

Syntactic-guided optimization of image–text matching for intra-modal modeling

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Image–text matching has become a research hotspot for multimodal matching tasks. Existing image–text matching models suffer from feature redundancy in intra-modal modeling of images. This redundant information not only dilutes the important region weights but also is introduced as irrelevant or distracting information. In addition, the syntactic structure of the text is made difficult to analyze by the model due to the lack of syntactic information, which affects the dependencies between words. To address the above problems, the syntactic-guided optimization of image–text matching for intra-modal modeling (SGIM) is proposed. Multi-view filtering of image features is used based on the differences in information richness in each region of the image. The importance weights of each region are adjusted. To obtain an enhanced representation of the image features, multi-view information is fused. A syntactic dependency enhancement method is proposed based on the contextual relationships between text words. To avoid the loss of long-range textual context, the attention distribution of the entire sentence is adjusted. The experimental results show that the SGIM model achieves a minimum improvement of 4.8%, 1.7%, and 2.1% in recall sum (Rsum) compared to MAG, MSR, MMCA, SGRAF, and ReSG on the publicly available datasets Flickr30K, MSCOCO 1K, and MSCOCO 5K.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Algorithm 1
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

Datasets derived from public resources and made available with the article.

References

  1. Yi Y, Tian Y, He C, Fan Y, Hu X, Xu Y (2023) Dbt: multimodal emotion recognition based on dual-branch transformer. J Supercomput 79(8):8611–8633

    Article  Google Scholar 

  2. Shi X, Yu Z, Wang X, Li Y, Niu Y (2023) Text-image matching for multi-model machine translation. J Supercomput 79(16):17810–17823

    Article  MATH  Google Scholar 

  3. Kayani M, Ghafoor A, Riaz MM (2023) Multi-modal text recognition and encryption in scanned document images. J Supercomput 79(7):7916–7936

    Article  MATH  Google Scholar 

  4. Diao H, Zhang Y, Ma L, Lu H (2021) Similarity reasoning and filtration for image-text matching. Proc AAAI Conf Artif Intell 35:1218–1226

    MATH  Google Scholar 

  5. Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664

    Article  MATH  Google Scholar 

  6. Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539

  7. Ma L, Lu Z, Shang L, Li H (2015) Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2623–2631

  8. Liu L, Gou T (2021) Cross-modal retrieval combining deep canonical correlation analysis and adversarial learning. Comput Sci 48:200–207

    MATH  Google Scholar 

  9. Ge X, Chen F, Xu S, Tao F, Jose JM (2023) Cross-modal semantic enhanced interaction for image-sentence retrieval. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 1022–1031

  10. Wang Z, Xu X, Wei J, Xie N, Shao J, Yang Y (2023) Quaternion representation learning for cross-modal matching. Knowl-Based Syst 270:110505

    Article  MATH  Google Scholar 

  11. Xu G, Hu M, Wang X Yang J, Li N, Zhang Q (2023) Location attention knowledge embedding model for image-text matching. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pp 408–421 Springer

  12. Cheng Y, Zhu X, Qian J, Wen F, Liu P (2022) Cross-modal graph matching network for image-text retrieval. ACM Trans Multimed Comput, Commun, Appl (TOMM) 18(4):1–23

    Article  MATH  Google Scholar 

  13. Li K, Zhang Y, Li K, Li Y, Fu Y (2022) Image-text embedding learning via visual and textual semantic reasoning. IEEE Trans Pattern Anal Mach Intell 45(1):641–656

    Article  MATH  Google Scholar 

  14. Li Y, Yao T, Zhang L, Sun Y, Fu H (2024) Image-text matching algorithm based on multi-level semantic alignment. J Beijing Univ Aeronaut Astronaut 50:551–558. https://doi.org/10.13700/j.bh.1001-5965.2022.0385

  15. Zhang H, Mao Z, Zhang K, Zhang Y (2022) Show your faith: cross-modal confidence-aware network for image-text matching. Proc AAAI Conf Artif Intell 36:3262–3270

    MATH  Google Scholar 

  16. Qin X, Li L, Pang G (2024) Multi-scale motivated neural network for image-text matching. Multimed Tools Appl 83(2):4383–4407

    Article  MATH  Google Scholar 

  17. Zhang K, Hu B, Zhang H, Li Z, Mao Z (2023) Enhanced semantic similarity learning framework for image-text matching. IEEE Trans Circuits Syst Video Technol 34:2973–2988

    Article  MATH  Google Scholar 

  18. Yao T, Li Y, Li Y, Zhu Y, Wang G, Yue J (2023) Cross-modal semantically augmented network for image-text matching. ACM Trans Multimed Comput Commun Appl 20(4):1–18

    Article  MATH  Google Scholar 

  19. Jiang D, Ye M (2023) Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2787–2797

  20. Chen C, Zhang B, Cao L, Shen J, Gunter T, Jose AM, Toshev A, Shlens J, Pang R, Yang Y (2023) Stair: Learning sparse text and image representation in grounded tokens. arXiv preprint arXiv:2301.13081

  21. Liu X, He Y, Cheung Y-M, Xu X, Wang N (2022) Learning relationship-enhanced semantic graph for fine-grained image-text matching. IEEE Trans Cybern 54(2):948–961

    Article  MATH  Google Scholar 

  22. Shang H, Zhao G, Shi J, Qian X (2023) Multi-view text imagination network based on latent alignment for image-text matching. IEEE Intell Syst 38(3):41–50

    Article  MATH  Google Scholar 

  23. Fu Z, Mao Z, Song Y, Zhang Y (2023) Learning semantic relationship among instances for image-text matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 15159–15168

  24. Long S, Han SC, Wan X, Poon J (2022) Gradual: Graph-based dual-modal representation for image-text matching. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 3459–3468

  25. Pei J, Zhong K, Yu Z, Wang L, Lakshmanna K (2023) Scene graph semantic inference for image and text matching. ACM Trans Asian Low-Resour Lang Inf Process 22(5):1–23

    Article  MATH  Google Scholar 

  26. Dong X, Zhang H, Zhu L, Nie L, Liu L (2022) Hierarchical feature aggregation based on transformer for image-text matching. IEEE Trans Circuits Syst Video Technol 32(9):6437–6447

    Article  MATH  Google Scholar 

  27. Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 201–216

  28. Salini Y, Eswaraiah P, Brahmam MV, Sirisha U (2023) Word embedding for text classification: Efficient cnn and bi-gru fusion multi attention mechanism. EAI Endorsed Trans Scalable Inf Syst 10(6):66

    Google Scholar 

  29. Zhou J, Zhao H (2019) Head-driven phrase structure grammar parsing on penn treebank. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp 2396–2408

  30. Gong D, Chen H, Chen S, Bao Y, Ding G (2021) Matching with agreement for cross-modal image-text retrieval. CAAI Trans Intell Syst 16:1143–1150

    MATH  Google Scholar 

  31. Sun H, Qin X, Liu X (2023) Image-text matching using multi-subspace joint representation. Multimed Syst 29(3):1057–1071

    Article  MATH  Google Scholar 

  32. Wei X, Zhang T, Li Y, Zhang Y, Wu F (2020) Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10941–10950

  33. Liu X, He Y, Cheung Y, Xu X, Wang N (2024) Learning relationship-enhanced semantic graph for fine-grained image-text matching. IEEE Trans Cybern 54(2):948–961

    Article  MATH  Google Scholar 

Download references

Acknowledgements

The authors look forward to the insightful comments and suggestions of the anonymous reviewers and editors, which will go a long way toward improving the quality of this paper.

Author information

Authors and Affiliations

Authors

Contributions

Di Wu and Le Zhang wrote the main manuscript, and Yao Chen realized the experimental part. All authors reviewed the manuscript.

Corresponding author

Correspondence to Di Wu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethics approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Consent to participate

Informed consent was obtained from all individual participants included in the study.

Consent for publication

Publication consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, D., Zhang, L. & Chen, Y. Syntactic-guided optimization of image–text matching for intra-modal modeling. J Supercomput 81, 367 (2025). https://doi.org/10.1007/s11227-024-06840-0

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11227-024-06840-0

Keywords