Skip to main content
Log in

Heterogeneous memory enhanced graph reasoning network for cross-modal retrieval

  • Research Paper
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

Cross-modal retrieval (CMR) aims to retrieve the instances of a specific modality that are relevant to a given query from another modality, which has drawn much attention because of its importance in bridging vision with language. A key to the success of CMR is to learn more discriminative and robust representations for both visual and textual instances to further reduce the heterogeneous discrepancy existing in different modalities. In this paper, we address this challenging issue by proposing a heterogeneous memory enhanced graph reasoning network, named HMGR, to connect the semantic correlations between vision and language. On the one hand, we design a novel dual-path network architecture to generate relationship enhanced global representations by employing modality-specific graph reasoning on extracted local features for each instance. In this way, the topological interdependencies of both visual and textual intra-instance local fragments are fully mined to achieve a deeper semantic understanding of the relationships between them. On the other hand, we focus on utilizing inter-instance semantic correlated knowledge to enhance the discriminability of the final learned representations, which is achieved by introducing a joint heterogeneous memory network to iteratively restore both visual and textual instance-level information. Through interacting with long-term contextual multimodal knowledge, an encouraging shared latent feature space for mitigating the heterogeneous gap across different modalities can be learned. Extensive experiments under both image-text retrieval and video-text retrieval scenarios on three benchmark datasets demonstrate the effectiveness of our proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

  1. Chen Y, Huang R, Chang H, et al. Cross-modal knowledge adaptation for language-based person search. IEEE Trans Image Process, 2021, 30: 4057–4069

    Article  Google Scholar 

  2. Zhang L, Ma B, Li G, et al. Generalized semi-supervised and structured subspace learning for cross-modal retrieval. IEEE Trans Multimedia, 2018, 20: 128–141

    Article  Google Scholar 

  3. Ji Z, Wang H, Han J, et al. SMAN: stacked multimodal attention network for cross-modal image-text retrieval. IEEE Trans Cybern, 2022, 52: 1086–1097

    Article  Google Scholar 

  4. Ji Z, Yan J T, Wang Q, et al. Triple discriminator generative adversarial network for zero-shot image classification. Sci China Inf Sci, 2021, 64: 120101

    Article  MathSciNet  Google Scholar 

  5. Wang Z H, Liu X, Lin J W, et al. Multi-attention based cross-domain beauty product image retrieval. Sci China Inf Sci, 2020, 63: 120112

    Article  Google Scholar 

  6. Karpathy A, Li F-F. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 3128–3137

  7. Faghri F, Fleet D J, Kiros J R, et al. VSE++: improving visual-semantic embeddings with hard negatives. 2017. ArXiv:1707.05612

  8. Lee K H, Chen X, Hua G, et al. Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), 2018. 201–216

  9. Hu Z, Luo Y, Lin J, et al. Multi-level visual-semantic alignments with relation-wise dual attention network for image and text matching. In: Proceedings of International Joint Conference on Artificial Intelligence, 2019. 789–795

  10. Frome A, Corrado G, Shlens J, et al. DeViSE: a deep visual-semantic embedding model. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, 2013

  11. Ma L, Lu Z, Shang L, et al. Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE International Conference on Computer Vision, 2015. 2623–2631

  12. Kiros R, Salakhutdinov R, Zemel R S. Unifying visual-semantic embeddings with multimodal neural language models. 2014. ArXiv:1411.2539

  13. Wei X, Zhang T, Li Y, et al. Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 10941–10950

  14. Mithun N C, Li J, Metze F, et al. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, 2018. 19–27

  15. Song Y, Soleymani M. Polysemous visual-semantic embedding for cross-modal retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 1979–1988

  16. Li Y, Tarlow D, Brockschmidt M, et al. Gated graph sequence neural networks. 2015. ArXiv:1511.05493

  17. Jiang J, Wei Y, Feng Y, et al. Dynamic hypergraph neural networks. In: Proceedings of International Joint Conference on Artificial Intelligence, 2019. 2635–2641

  18. Veličković P, Cucurull G, Casanova A, et al. Graph attention networks. 2017. ArXiv:1710.10903

  19. Li K, Zhang Y, Li K, et al. Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019. 4654–4662

  20. Graves A, Wayne G, Danihelka I. Neural turing machines. 2014. ArXiv:1410.5401

  21. Sukhbaatar S, Szlam A, Weston J, et al. End-to-end memory networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, 2015. 2: 2440–2448

    Google Scholar 

  22. Xiong C, Merity S, Socher R. Dynamic memory networks for visual and textual question answering. In: Proceedings of International Conference on Machine Learning, 2016. 2397–2406

  23. Fan C, Zhang X, Zhang S, et al. Heterogeneous memory enhanced multimodal attention model for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. 1999–2007

  24. Huang Y, Wang L. ACMM: aligned cross-modal memory for few-shot image and sentence matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019. 5774–5783

  25. Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 6077–6086

  26. Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding. 2018. ArXiv:1810.04805

  27. Liu C, Mao Z, Zhang T, et al. Graph structured network for image-text matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 10921–10930

  28. Song G, Wang D, Tan X. Deep memory network for cross-modal retrieval. IEEE Trans Multimedia, 2019, 21: 1261–1275

    Article  Google Scholar 

  29. Sarafianos N, Xu X, Kakadiaris I A. Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019. 5814–5824

  30. Huang Y, Wu Q, Song C, et al. Learning semantic concepts and order for image and sentence matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 6163–6171

  31. Wang Z, Liu X, Li H, et al. CAMP: cross-modal adaptive message passing for text-image retrieval. In: Proceedings of Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019. 5764–5773

  32. Wu Y, Wang S, Song G, et al. Learning fragment self-attention embeddings for image-text matching. In: Proceedings of the 27th ACM International Conference on Multimedia, 2019. 2088–2096

  33. Zheng Z, Zheng L, Garrett M, et al. Dual-path convolutional image-text embeddings with instance loss. ACM Trans Multimedia Comput Commun Appl, 2020, 16: 1–23

    Article  Google Scholar 

  34. Chen H, Ding G, Liu X, et al. IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 12655–12663

  35. Vendrov I, Kiros R, Fidler S, et al. Order-embeddings of images and language. 2015. ArXiv:1511.06361

  36. Feng F, Wang X, Li R. Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the 22nd ACM International Conference on Multimedia, 2014. 7–16

Download references

Acknowledgements

This work was supported by Natural Science Foundation of Tianjin (Grant No. 19JCYBJC16000) and National Natural Science Foundation of China (Grant No. 61771329).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Zhong Ji or Yuqing He.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ji, Z., Chen, K., He, Y. et al. Heterogeneous memory enhanced graph reasoning network for cross-modal retrieval. Sci. China Inf. Sci. 65, 172104 (2022). https://doi.org/10.1007/s11432-021-3367-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11432-021-3367-y

Keywords

Navigation