Syntactic-guided optimization of image–text matching for intra-modal modeling

Wu, Di; Zhang, Le; Chen, Yao

doi:10.1007/s11227-024-06840-0

Syntactic-guided optimization of image–text matching for intra-modal modeling

Published: 06 January 2025

Volume 81, article number 367, (2025)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Di Wu¹,
Le Zhang¹ &
Yao Chen¹

156 Accesses
Explore all metrics

Abstract

Image–text matching has become a research hotspot for multimodal matching tasks. Existing image–text matching models suffer from feature redundancy in intra-modal modeling of images. This redundant information not only dilutes the important region weights but also is introduced as irrelevant or distracting information. In addition, the syntactic structure of the text is made difficult to analyze by the model due to the lack of syntactic information, which affects the dependencies between words. To address the above problems, the syntactic-guided optimization of image–text matching for intra-modal modeling (SGIM) is proposed. Multi-view filtering of image features is used based on the differences in information richness in each region of the image. The importance weights of each region are adjusted. To obtain an enhanced representation of the image features, multi-view information is fused. A syntactic dependency enhancement method is proposed based on the contextual relationships between text words. To avoid the loss of long-range textual context, the attention distribution of the entire sentence is adjusted. The experimental results show that the SGIM model achieves a minimum improvement of 4.8%, 1.7%, and 2.1% in recall sum (Rsum) compared to MAG, MSR, MMCA, SGRAF, and ReSG on the publicly available datasets Flickr30K, MSCOCO 1K, and MSCOCO 5K.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SSM: Semantic Selection and Multi-view Alignment for Image-Text Retrieval

Facet-Aware Multimodal Summarization via Cross-Modal Alignment

SAM: cross-modal semantic alignments module for image-text retrieval

Article 26 June 2023

Data availability

Datasets derived from public resources and made available with the article.

References

Yi Y, Tian Y, He C, Fan Y, Hu X, Xu Y (2023) Dbt: multimodal emotion recognition based on dual-branch transformer. J Supercomput 79(8):8611–8633
Article Google Scholar
Shi X, Yu Z, Wang X, Li Y, Niu Y (2023) Text-image matching for multi-model machine translation. J Supercomput 79(16):17810–17823
Article MATH Google Scholar
Kayani M, Ghafoor A, Riaz MM (2023) Multi-modal text recognition and encryption in scanned document images. J Supercomput 79(7):7916–7936
Article MATH Google Scholar
Diao H, Zhang Y, Ma L, Lu H (2021) Similarity reasoning and filtration for image-text matching. Proc AAAI Conf Artif Intell 35:1218–1226
MATH Google Scholar
Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664
Article MATH Google Scholar
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539
Ma L, Lu Z, Shang L, Li H (2015) Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2623–2631
Liu L, Gou T (2021) Cross-modal retrieval combining deep canonical correlation analysis and adversarial learning. Comput Sci 48:200–207
MATH Google Scholar
Ge X, Chen F, Xu S, Tao F, Jose JM (2023) Cross-modal semantic enhanced interaction for image-sentence retrieval. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 1022–1031
Wang Z, Xu X, Wei J, Xie N, Shao J, Yang Y (2023) Quaternion representation learning for cross-modal matching. Knowl-Based Syst 270:110505
Article MATH Google Scholar
Xu G, Hu M, Wang X Yang J, Li N, Zhang Q (2023) Location attention knowledge embedding model for image-text matching. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pp 408–421 Springer
Cheng Y, Zhu X, Qian J, Wen F, Liu P (2022) Cross-modal graph matching network for image-text retrieval. ACM Trans Multimed Comput, Commun, Appl (TOMM) 18(4):1–23
Article MATH Google Scholar
Li K, Zhang Y, Li K, Li Y, Fu Y (2022) Image-text embedding learning via visual and textual semantic reasoning. IEEE Trans Pattern Anal Mach Intell 45(1):641–656
Article MATH Google Scholar
Li Y, Yao T, Zhang L, Sun Y, Fu H (2024) Image-text matching algorithm based on multi-level semantic alignment. J Beijing Univ Aeronaut Astronaut 50:551–558. https://doi.org/10.13700/j.bh.1001-5965.2022.0385
Zhang H, Mao Z, Zhang K, Zhang Y (2022) Show your faith: cross-modal confidence-aware network for image-text matching. Proc AAAI Conf Artif Intell 36:3262–3270
MATH Google Scholar
Qin X, Li L, Pang G (2024) Multi-scale motivated neural network for image-text matching. Multimed Tools Appl 83(2):4383–4407
Article MATH Google Scholar
Zhang K, Hu B, Zhang H, Li Z, Mao Z (2023) Enhanced semantic similarity learning framework for image-text matching. IEEE Trans Circuits Syst Video Technol 34:2973–2988
Article MATH Google Scholar
Yao T, Li Y, Li Y, Zhu Y, Wang G, Yue J (2023) Cross-modal semantically augmented network for image-text matching. ACM Trans Multimed Comput Commun Appl 20(4):1–18
Article MATH Google Scholar
Jiang D, Ye M (2023) Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2787–2797
Chen C, Zhang B, Cao L, Shen J, Gunter T, Jose AM, Toshev A, Shlens J, Pang R, Yang Y (2023) Stair: Learning sparse text and image representation in grounded tokens. arXiv preprint arXiv:2301.13081
Liu X, He Y, Cheung Y-M, Xu X, Wang N (2022) Learning relationship-enhanced semantic graph for fine-grained image-text matching. IEEE Trans Cybern 54(2):948–961
Article MATH Google Scholar
Shang H, Zhao G, Shi J, Qian X (2023) Multi-view text imagination network based on latent alignment for image-text matching. IEEE Intell Syst 38(3):41–50
Article MATH Google Scholar
Fu Z, Mao Z, Song Y, Zhang Y (2023) Learning semantic relationship among instances for image-text matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 15159–15168
Long S, Han SC, Wan X, Poon J (2022) Gradual: Graph-based dual-modal representation for image-text matching. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 3459–3468
Pei J, Zhong K, Yu Z, Wang L, Lakshmanna K (2023) Scene graph semantic inference for image and text matching. ACM Trans Asian Low-Resour Lang Inf Process 22(5):1–23
Article MATH Google Scholar
Dong X, Zhang H, Zhu L, Nie L, Liu L (2022) Hierarchical feature aggregation based on transformer for image-text matching. IEEE Trans Circuits Syst Video Technol 32(9):6437–6447
Article MATH Google Scholar
Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 201–216
Salini Y, Eswaraiah P, Brahmam MV, Sirisha U (2023) Word embedding for text classification: Efficient cnn and bi-gru fusion multi attention mechanism. EAI Endorsed Trans Scalable Inf Syst 10(6):66
Google Scholar
Zhou J, Zhao H (2019) Head-driven phrase structure grammar parsing on penn treebank. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp 2396–2408
Gong D, Chen H, Chen S, Bao Y, Ding G (2021) Matching with agreement for cross-modal image-text retrieval. CAAI Trans Intell Syst 16:1143–1150
MATH Google Scholar
Sun H, Qin X, Liu X (2023) Image-text matching using multi-subspace joint representation. Multimed Syst 29(3):1057–1071
Article MATH Google Scholar
Wei X, Zhang T, Li Y, Zhang Y, Wu F (2020) Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10941–10950
Liu X, He Y, Cheung Y, Xu X, Wang N (2024) Learning relationship-enhanced semantic graph for fine-grained image-text matching. IEEE Trans Cybern 54(2):948–961
Article MATH Google Scholar

Download references

Acknowledgements

The authors look forward to the insightful comments and suggestions of the anonymous reviewers and editors, which will go a long way toward improving the quality of this paper.

Author information

Authors and Affiliations

School of Information and Electronic Engineering, Hebei University of Engineering, No. 19 Taiji Road, Handan, 056000, Hebei, China
Di Wu, Le Zhang & Yao Chen

Authors

Di Wu
View author publications
You can also search for this author inPubMed Google Scholar
Le Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Yao Chen
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Di Wu and Le Zhang wrote the main manuscript, and Yao Chen realized the experimental part. All authors reviewed the manuscript.

Corresponding author

Correspondence to Di Wu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethics approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Consent to participate

Informed consent was obtained from all individual participants included in the study.

Consent for publication

Publication consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wu, D., Zhang, L. & Chen, Y. Syntactic-guided optimization of image–text matching for intra-modal modeling. J Supercomput 81, 367 (2025). https://doi.org/10.1007/s11227-024-06840-0

Download citation

Accepted: 17 December 2024
Published: 06 January 2025
DOI: https://doi.org/10.1007/s11227-024-06840-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Syntactic-guided optimization of image–text matching for intra-modal modeling

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SSM: Semantic Selection and Multi-view Alignment for Image-Text Retrieval

Facet-Aware Multimodal Summarization via Cross-Modal Alignment

SAM: cross-modal semantic alignments module for image-text retrieval

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now