Skip to main content
Log in

Cross-modal alignment with graph reasoning for image-text retrieval

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Image-text retrieval task has received a lot of attention in the modern research field of artificial intelligence. It still remains challenging since image and text are heterogeneous cross-modal data. The key issue of image-text retrieval is how to learn a common feature space while semantic correspondence between image and text remains. Existing works cannot gain fine cross-modal feature representation because the semantic relation between local features is not effectively utilized and the noise information is not suppressed. In order to address these issues, we propose a Cross-modal Alignment with Graph Reasoning (CAGR) model, in which the refined cross-modal features in the common feature space are learned and then a fine-grained cross-modal alignment method is implemented. Specifically, we introduce a graph reasoning module to explore semantic connection for local elements in each modality and measure their importance by self-attention mechanism. In a multi-step reasoning manner, the visual semantic graph and textual semantic graph can be effectively learned and the refined visual and textual features can be obtained. Finally, to measure the similarity between image and text, a novel alignment approach named cross-modal attentional fine-grained alignment is used to compute similarity score between two sets of features. Our model achieves the competitive performance compared with the state-of-the-art methods on Flickr30K dataset and MS-COCO dataset. Extensive experiments demonstrate the effectiveness of our model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086

  2. Chen H, Ding G, Liu X, Lin Z, Liu J, Han J (2020) Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12655–12663

  3. Chen T, Luo J (2020) Expressing objects just like words: Recurrent visual embedding for image-text matching. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 10583–10590

  4. Dong J, Li X, Xu C, Ji S, He Y, Yang G, Wang X (2019) Dual encoding for zero-example video retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9346–9355

  5. Eisenschtat A, Wolf L (2017) Linking image and text with 2-way nets. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4601–4611

  6. Faghri F, Fleet DJ, Kiros JR, Fidler S (2017) Vse++: Improving visual-semantic embeddings with hard negatives. arXiv:1707.05612

  7. Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the 22nd ACM international conference on multimedia, pp 7–16

  8. Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T (2013) Devise: a deep visual-semantic embedding model. In: Advances in neural information processing systems, pp 2121–2129

  9. Gabeur V, Sun C, Alahari K, Schmid C (2020) Multi-modal transformer for video retrieval. In: European conference on computer vision (ECCV), vol 5. Springer

  10. Ging S, Zolfaghari M, Pirsiavash H, Brox T (2020) Coot: Cooperative hierarchical transformer for video-text representation learning. arXiv:2011.00597

  11. Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial networks. arXiv:1406.2661

  12. Gu J, Cai J, Joty SR, Niu L, Wang G (2018) Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7181–7189

  13. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  14. Hotelling H (1992) Relations between two sets of variates. In: Breakthroughs in statistics, pp 162–190. Springer

  15. Huang F, Zhang X, Zhao Z, Li Z (2018) Bi-directional spatial-semantic attention networks for image-text matching. IEEE Trans Image Process 28(4):2008–2020

    Article  MathSciNet  Google Scholar 

  16. Huang Y, Wang W, Wang L (2017) Instance-aware image and sentence matching with selective multimodal lstm. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2310–2318

  17. Huang Y, Wu Q, Song C, Wang L (2018) Learning semantic concepts and order for image and sentence matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6163–6171

  18. Johnson J, Hariharan B, van der Maaten L, Fei-Fei L, Lawrence Zitnick C, Girshick R (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2901–2910

  19. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137

  20. Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv:1412.6980

  21. Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv:1609.02907

  22. Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539

  23. Klein B, Lev G, Sadeh G, Wolf L (2015) Associating neural word embeddings with deep image representations using fisher vectors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4437–4446

  24. Klein B, Lev G, Sadeh G, Wolf L (2015) Associating neural word embeddings with deep image representations using fisher vectors. In: 2015 IEEE Conference on computer vision and pattern recognition (CVPR)

  25. Lee KH, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the european conference on computer vision (ECCV), pp 201–216

  26. Li X, Xu C, Yang G, Chen Z, Dong J (2019) W2vv++ fully deep learning for ad-hoc video search. In: Proceedings of the 27th ACM international conference on multimedia, pp 1786–1794

  27. Lin T, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Zitnick C (2014) Microsoft coco: Common objects in context in european conference on computer vision. Springer, Cham, pp 740–755. [Google Scholar]

    Google Scholar 

  28. Liu C, Mao Z, Liu AA, Zhang T, Wang B, Zhang Y (2019) Focus your attention: a bidirectional focal attention network for image-text matching. In: Proceedings of the 27th ACM international conference on multimedia, pp 3–11

  29. Liu C, Mao Z, Zhang T, Xie H, Wang B, Zhang Y (2020) Graph structured network for image-text matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10921–10930

  30. Liu Y, Albanie S, Nagrani A, Zisserman A (2019) Use what you have: Video retrieval using representations from collaborative experts. arXiv:1907.13487

  31. Liu Y, Guo Y, Bakker EM, Lew MS (2017) Learning a recurrent residual fusion network for multimodal matching. In: Proceedings of the IEEE international conference on computer vision, pp 4107–4116

  32. Ma L, Jiang W, Jie Z, Wang X (2019) Bidirectional image-sentence retrieval by local and global deep matching. Neurocomputing 345:36–44

    Article  Google Scholar 

  33. Nam H, Ha JW, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 299–307

  34. Niu Z, Zhou M, Wang L, Gao X, Hua G (2017) Hierarchical multimodal lstm for dense visual-semantic embedding. In: Proceedings of the IEEE international conference on computer vision, pp 1881–1889

  35. Peng Y, Qi J, Yuan Y (2017) Cm-gans: Cross-modal generative adversarial networks for common representation learning. ACM Transactions on Multimedia Computing Communications and Applications 15(1)

  36. Perez E, Strub F, De Vries H, Dumoulin V, Courville A (2017) Film: Visual reasoning with a general conditioning layer. arXiv:1709.07871

  37. Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM international conference on multimedia, pp 251–260

  38. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99

  39. Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Sig Process 45(11):2673–2681

    Article  Google Scholar 

  40. Sharma A, Kumar A, Daume H, Jacobs DW (2012) Generalized multiview analysis: a discriminative latent space. In: 2012 IEEE Conference on computer vision and pattern recognition. IEEE, pp 2160–2167

  41. Song Y, Soleymani M (2019) Polysemous visual-semantic embedding for cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1979–1988

  42. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

  43. Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM international conference on multimedia, pp 154–162

  44. Wang J, He Y, Kang C, Xiang S, Pan C (2015) Image-text cross-modal retrieval via modality-specific feature learning. In: Proceedings of the 5th ACM on international conference on multimedia retrieval, pp 347–354

  45. Wang L, Li Y, Lazebnik S (2016) Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5005–5013

  46. Wang Z, Liu X, Li H, Sheng L, Yan J, Wang X, Shao J (2019) Camp: Cross-modal adaptive message passing for text-image retrieval. In: Proceedings of the IEEE international conference on computer vision, pp 5764–5773

  47. Wu J, Wu C, Lu J, Wang L, Cui X (2021) Region reinforcement network with topic constraint for image-text matching. IEEE Transactions on Circuits and Systems for Video Technology

  48. Yang X, Dong J, Cao Y, Wang X, Wang M, Chua TS (2020) Tree-augmented cross-modal encoding for complex-query video retrieval. In: Proceedings of the 43rd International ACM SIGIR conference on research and development in information retrieval, pp 1339–1348

  49. Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78

    Article  Google Scholar 

  50. Zhai D, Chang H, Shan S, Chen X, Gao W (2012) Multiview metric learning with global consistency and local smoothness. Acm Trans Intell Syst Technol 3(3):1–22

    Article  Google Scholar 

  51. Zhang Q, Lei Z, Zhang Z, Li SZ (2020) Context-aware attention network for image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3536–3545

  52. Zheng Z, Zheng L, Garrett M, Yang Y, Xu M, Shen YD (2020) Dual-path convolutional image-text embeddings with instance loss. ACM Trans Multimed Comput Commun Appl 16(2):1–23

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by National Key R&D Program of China (No.2021ZD0111900), Natural Science Foundation of China (U21B2038, U1811463, U19B2039).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yongli Hu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cui, Z., Hu, Y., Sun, Y. et al. Cross-modal alignment with graph reasoning for image-text retrieval. Multimed Tools Appl 81, 23615–23632 (2022). https://doi.org/10.1007/s11042-022-12444-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-12444-8

Keywords

Navigation