Skip to main content
Log in

SAM: cross-modal semantic alignments module for image-text retrieval

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Cross-modal image-text retrieval has gained increasing attention due to its ability to combine computer vision with natural language processing. Previously, image and text features were extracted and concatenated to feed the transformer-based retrieval network. However, these approaches implicitly aligned the image and text modalities since the self-attention mechanism computes attention coefficients for all input features. In this paper, we propose cross-modal Semantic Alignments Module (SAM) to establish an explicit alignment through enhancing an inter-modal relationship. Firstly, visual and textual representations were extracted from an image and text pair. Secondly, we constructed a bipartite graph by representing the image regions and words in the sentence as nodes, and the relationship between them as edges. Then our proposed SAM allows the model to compute attention coefficients based on the edges in the graph. This process helps explicitly align the two modalities. Finally, a binary classifier was used to determine whether the given image-text pair is aligned. We reported extensive experiments on MS-COCO and Flickr30K test sets, showing that SAM could capture the joint representation between the two modalities and could be applied to the existing retrieval networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

  1. Ba J, Kiros JR, Hinton GE (2016) Layer normalization. ArXiv arXiv:1607.06450

  2. Bi B, Li C, Wu C, et al (2020) Palm: pre-training an autoencodingautoregressive language model for context-conditioned generation. In: Conference on empirical methods in natural language processing

  3. Chen S, Jin Q, Wang P et al (2020) Say as you wish: fine-grained control of image caption generation with abstract scene graphs. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020:9959–9968

    Google Scholar 

  4. Chen T, Tian R, Ding Z (2021) Visual reasoning using graph convolutional networks for predicting pedestrian crossing intention. IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) 2021:3096–3102

    Article  Google Scholar 

  5. Chen YC, Li L, Yu L, et al (2019) Uniter: learning universal image-text representations. ArXiv arXiv:1909.11740

  6. Dai Z, Lai G, Yang Y, et al (2020) Funnel-transformer: filtering out sequential redundancy for efficient language processing. ArXiv arXiv:2006.03236

  7. Devlin J, Chang MW, Lee K, et al (2019) Bert: pre-training of deep bidirectional transformers for language understanding. ArXiv arXiv:1810.04805

  8. Diao H, Zhang Y, Ma L, et al (2021) Similarity reasoning and filtration for image-text matching. ArXiv arXiv:2101.01368

  9. Dong X, Long C, Xu W, et al (2021) Dual graph convolutional networks with transformer and curriculum learning for image captioning. Proceedings of the 29th ACM International Conference on Multimedia

  10. Faghri F, Fleet DJ, Kiros JR, et al (2017) Vse++: improving visual-semantic embeddings with hard negatives. In: British machine vision conference

  11. Frisoni G, Mizutani M, Moro G, et al (2022) Bioreader: a retrieval-enhanced text-to-text transformer for biomedical literature. In: Proceedings of the 2022 conference on empirical methods in natural language processing, pp 5770–5793

  12. Gao H, Xu K, Cao M et al (2021) The deep features and attention mechanism-based method to dish healthcare under social iot systems: an empirical study with a hand-deep local-global net. IEEE Transactions on Computational Social Systems 9(1):336–347

    Article  Google Scholar 

  13. Gao H, Fang D, Xiao J, et al (2022a) Camrl: a joint method of channel attention and multidimensional regression loss for 3d object detection in automated vehicles. IEEE Trans Intell Transp Syst

  14. Gao H, Xiao J, Yin Y, et al (2022b) A mutually supervised graph attention network for few-shot segmentation: the perspective of fully utilizing limited samples. IEEE Transactions on Neural Networks and Learning Systems

  15. Guo D, Xu C, Tao D (2021) Bilinear graph networks for visual question answering. IEEE Transactions on Neural Networks and Learning Systems PP

  16. Guo J, Lu S, Cai H, et al (2018) Long text generation via adversarial training with leaked information. In: Proceedings of the AAAI conference on artificial intelligence

  17. Hamilton WL, Ying Z, Leskovec J (2017) Inductive representation learning on large graphs. In: NIPS

  18. Henderson M, Casanueva I, Mrkvsi’c N, et al (2019) Convert: efficient and accurate conversational representations from transformers. ArXiv arXiv:1911.03688

  19. Ji Z, Chen K, Wang H (2021) Step-wise hierarchical alignment network for image-text matching. ArXiv arXiv:2106.06509

  20. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137

  21. Kayhan N, Fekri-Ershad S (2021) Content based image retrieval based on weighted fusion of texture and color features derived from modified local binary patterns and local neighborhood difference patterns. Multimed Tools Appl 80(21–23):32763–32790

    Article  Google Scholar 

  22. Kim J, Yoon S, Kim D, et al (2021a) Structured co-reference graph attention for video-grounded dialogue. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 1789–1797

  23. Kim W, Son B, Kim I (2021b) Vilt: Vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning

  24. Kipf T, Welling M (2017) Semi-supervised classification with graph convolutional networks. ArXiv arXiv:1609.02907

  25. Krishna R, Zhu Y, Groth O et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73

    Article  MathSciNet  Google Scholar 

  26. Lee KH, Chen X, Hua G, et al (2018) Stacked cross attention for image-text matching. ArXiv arXiv:1803.08024

  27. Li LH, Yatskar M, Yin D, et al (2019) Visualbert: a simple and performant baseline for vision and language. ArXiv arXiv:1908.03557

  28. Li X, Yin X, Li C, et al (2020) Oscar: object-semantics aligned pre-training for vision-language tasks. In: European conference on computer vision

  29. Lin J (2022) A proposed conceptual framework for a representational approach to information retrieval. ACM SIGIR Forum. ACM, New York, pp 1–29

    Google Scholar 

  30. Lin TY, Maire M, Belongie SJ, et al (2014) Microsoft coco: common objects in context. In: European conference on computer vision

  31. Ling H, Kreis K, Li D et al (2021) Editgan: high-precision semantic image editing. Adv Neural Inf Proces Syst 34:16331–16345

    Google Scholar 

  32. Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. In: International conference on learning representations

  33. Lu J, Batra D, Parikh D, et al (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Neural information processing systems

  34. Lu J, Goswami V, Rohrbach M et al (2020) 12-in-1: multi-task vision and language representation learning. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020:10434–10443

    Google Scholar 

  35. Lu X, Zhao T, Lee K (2021) Visualsparta: an embarrassingly simple approach to large-scale text-to-image search with weighted bag-of-words. In: ACL

  36. Maas AL, Hannun AY, Ng AY, et al (2013) Rectifier nonlinearities improve neural network acoustic models. In: Proc. Icml, Atlanta, p 3

  37. Messina N, Amato G, Esuli A et al (2021) Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans Multimed Comput Commun Appl (TOMM) 17:1–23

    Article  Google Scholar 

  38. Plummer BA, Wang L, Cervantes CM, et al (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE international conference on computer vision, pp 2641–2649

  39. Ren S, He K, Girshick RB et al (2015) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39:1137–1149

    Article  Google Scholar 

  40. Shah R, Bhatti N, Akhtar N et al (2020) Random patterns clothing image retrieval using convolutional neural network. International Conference on Emerging Trends in Smart Technologies (ICETST) 2020:1–6

    Google Scholar 

  41. Song K, Tan X, Qin T, et al (2020) Mpnet: masked and permuted pre-training for language understanding. ArXiv arXiv:2004.09297

  42. Song X, Jing L, Lin D, et al (2022) V2p: vision-to-prompt based multi-modal product summary generation. In: Proceedings of the 45th International ACM SIGIR conference on research and development in information retrieval, pp 992–1001

  43. Su W, Zhu X, Cao Y, et al (2020) Vl-bert: pre-training of generic visual-linguistic representations. ArXiv arXiv:1908.08530

  44. Takase S, Kiyono S (2021) Lessons on parameter sharing across layers in transformers. arXiv preprint arXiv:2104.06022

  45. Tan HH, Bansal M (2019) Lxmert: learning cross-modality encoder representations from transformers. ArXiv arXiv:1908.07490

  46. Toker A, Zhou Q, Maximov M et al (2021) Coming down to earth: satellite-to-street view synthesis for geo-localization. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021:6484–6493

    Google Scholar 

  47. Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Adv Neural Inf Proces Syst 30

  48. Velickovic P, Cucurull G, Casanova A, et al (2017) Graph attention networks. ArXiv arXiv:1710.10903

  49. Wang W, Zheng H, Lin Z (2020) Self-attention and retrieval enhanced neural networks for essay generation. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 8199–8203

  50. Wang Y, Yang H, Qian X, et al (2019a) Position focused attention network for image-text matching. ArXiv arXiv:1907.09748

  51. Wang Z, Liu X, Li H et al (2019) Camp: cross-modal adaptive message passing for text-image retrieval. IEEE/CVF International Conference on Computer Vision (ICCV) 2019:5763–5772

    Article  Google Scholar 

  52. Yang D, Wu D, Zhang W, et al (2020) Deep semantic-alignment hashing for unsupervised cross-modal retrieval. Proceedings of the 2020 International conference on multimedia retrieval

  53. Yang Z, Dai Z, Yang Y, et al (2019) Xlnet: generalized autoregressive pretraining for language understanding. Adv Neural Inf Proces Syst 32

  54. Ye Y, Ji S (2021) Sparse graph attention networks. IEEE Trans Knowl Data Eng

  55. Yu T, Liu J, Jin Z, et al (2022) Multi-scale multi-modal dictionary bert for effective text-image retrieval in multimedia advertising. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pp 4655–4660

  56. Zhang S, Dinan E, Urbanek J, et al (2018) Personalizing dialogue agents: I have a dog, do you have pets too? In: Annual meeting of the association for computational linguistics

  57. Zhang Y, Zhou W, Wang M et al (2020) Deep relation embedding for cross-modal retrieval. IEEE Trans Image Process 30:617–627

    Article  Google Scholar 

  58. Zheng Z, Zheng L, Garrett M et al (2020) Dual-path convolutional image-text embeddings with instance loss. ACM Trans Multimed Comput Commun Appl (TOMM) 16(2):1–23

    Article  Google Scholar 

Download references

Acknowledgements

This research was supported in part by Institute for Information & communications Technology Planning & Evaluation (IITP) through the Korea government (MSIT) under Grant No. 2021-0-01341 (Artificial Intelligence Graduate School Program (Chung-Ang University), Contribution Rate: 33%), the National Research Foundation of Korea (NRF) through the Korea government (MSIT) under Grant No. NRF- 2022R1C1C1008534 (Contribution Rate: 33%), and Culture Technology R &D Program through the Korea Creative Content Agency grant funded by Ministry of Culture, Sports and Tourism in 2021 (Project Name: A Specialist Training of Content R &D based on Virtual Production, Project Number: R2021040044, Contribution Rate: 34%).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Youngbin Kim.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Park, P., Jang, S., Cho, Y. et al. SAM: cross-modal semantic alignments module for image-text retrieval. Multimed Tools Appl 83, 12363–12377 (2024). https://doi.org/10.1007/s11042-023-15798-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15798-9

Keywords