Heterogeneous memory enhanced graph reasoning network for cross-modal retrieval

Ji, Zhong; Chen, Kexin; He, Yuqing; Pang, Yanwei; Li, Xuelong

doi:10.1007/s11432-021-3367-y

Heterogeneous memory enhanced graph reasoning network for cross-modal retrieval

Research Paper
Published: 20 June 2022

Volume 65, article number 172104, (2022)
Cite this article

Science China Information Sciences Aims and scope Submit manuscript

Zhong Ji¹,
Kexin Chen¹,
Yuqing He¹,
Yanwei Pang¹ &
…
Xuelong Li²

249 Accesses
4 Citations
Explore all metrics

Abstract

Cross-modal retrieval (CMR) aims to retrieve the instances of a specific modality that are relevant to a given query from another modality, which has drawn much attention because of its importance in bridging vision with language. A key to the success of CMR is to learn more discriminative and robust representations for both visual and textual instances to further reduce the heterogeneous discrepancy existing in different modalities. In this paper, we address this challenging issue by proposing a heterogeneous memory enhanced graph reasoning network, named HMGR, to connect the semantic correlations between vision and language. On the one hand, we design a novel dual-path network architecture to generate relationship enhanced global representations by employing modality-specific graph reasoning on extracted local features for each instance. In this way, the topological interdependencies of both visual and textual intra-instance local fragments are fully mined to achieve a deeper semantic understanding of the relationships between them. On the other hand, we focus on utilizing inter-instance semantic correlated knowledge to enhance the discriminability of the final learned representations, which is achieved by introducing a joint heterogeneous memory network to iteratively restore both visual and textual instance-level information. Through interacting with long-term contextual multimodal knowledge, an encouraging shared latent feature space for mitigating the heterogeneous gap across different modalities can be learned. Extensive experiments under both image-text retrieval and video-text retrieval scenarios on three benchmark datasets demonstrate the effectiveness of our proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Chen Y, Huang R, Chang H, et al. Cross-modal knowledge adaptation for language-based person search. IEEE Trans Image Process, 2021, 30: 4057–4069
Article Google Scholar
Zhang L, Ma B, Li G, et al. Generalized semi-supervised and structured subspace learning for cross-modal retrieval. IEEE Trans Multimedia, 2018, 20: 128–141
Article Google Scholar
Ji Z, Wang H, Han J, et al. SMAN: stacked multimodal attention network for cross-modal image-text retrieval. IEEE Trans Cybern, 2022, 52: 1086–1097
Article Google Scholar
Ji Z, Yan J T, Wang Q, et al. Triple discriminator generative adversarial network for zero-shot image classification. Sci China Inf Sci, 2021, 64: 120101
Article MathSciNet Google Scholar
Wang Z H, Liu X, Lin J W, et al. Multi-attention based cross-domain beauty product image retrieval. Sci China Inf Sci, 2020, 63: 120112
Article Google Scholar
Karpathy A, Li F-F. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 3128–3137
Faghri F, Fleet D J, Kiros J R, et al. VSE++: improving visual-semantic embeddings with hard negatives. 2017. ArXiv:1707.05612
Lee K H, Chen X, Hua G, et al. Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), 2018. 201–216
Hu Z, Luo Y, Lin J, et al. Multi-level visual-semantic alignments with relation-wise dual attention network for image and text matching. In: Proceedings of International Joint Conference on Artificial Intelligence, 2019. 789–795
Frome A, Corrado G, Shlens J, et al. DeViSE: a deep visual-semantic embedding model. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, 2013
Ma L, Lu Z, Shang L, et al. Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE International Conference on Computer Vision, 2015. 2623–2631
Kiros R, Salakhutdinov R, Zemel R S. Unifying visual-semantic embeddings with multimodal neural language models. 2014. ArXiv:1411.2539
Wei X, Zhang T, Li Y, et al. Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 10941–10950
Mithun N C, Li J, Metze F, et al. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, 2018. 19–27
Song Y, Soleymani M. Polysemous visual-semantic embedding for cross-modal retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 1979–1988
Li Y, Tarlow D, Brockschmidt M, et al. Gated graph sequence neural networks. 2015. ArXiv:1511.05493
Jiang J, Wei Y, Feng Y, et al. Dynamic hypergraph neural networks. In: Proceedings of International Joint Conference on Artificial Intelligence, 2019. 2635–2641
Veličković P, Cucurull G, Casanova A, et al. Graph attention networks. 2017. ArXiv:1710.10903
Li K, Zhang Y, Li K, et al. Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019. 4654–4662
Graves A, Wayne G, Danihelka I. Neural turing machines. 2014. ArXiv:1410.5401
Sukhbaatar S, Szlam A, Weston J, et al. End-to-end memory networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, 2015. 2: 2440–2448
Google Scholar
Xiong C, Merity S, Socher R. Dynamic memory networks for visual and textual question answering. In: Proceedings of International Conference on Machine Learning, 2016. 2397–2406
Fan C, Zhang X, Zhang S, et al. Heterogeneous memory enhanced multimodal attention model for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. 1999–2007
Huang Y, Wang L. ACMM: aligned cross-modal memory for few-shot image and sentence matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019. 5774–5783
Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 6077–6086
Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding. 2018. ArXiv:1810.04805
Liu C, Mao Z, Zhang T, et al. Graph structured network for image-text matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 10921–10930
Song G, Wang D, Tan X. Deep memory network for cross-modal retrieval. IEEE Trans Multimedia, 2019, 21: 1261–1275
Article Google Scholar
Sarafianos N, Xu X, Kakadiaris I A. Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019. 5814–5824
Huang Y, Wu Q, Song C, et al. Learning semantic concepts and order for image and sentence matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 6163–6171
Wang Z, Liu X, Li H, et al. CAMP: cross-modal adaptive message passing for text-image retrieval. In: Proceedings of Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019. 5764–5773
Wu Y, Wang S, Song G, et al. Learning fragment self-attention embeddings for image-text matching. In: Proceedings of the 27th ACM International Conference on Multimedia, 2019. 2088–2096
Zheng Z, Zheng L, Garrett M, et al. Dual-path convolutional image-text embeddings with instance loss. ACM Trans Multimedia Comput Commun Appl, 2020, 16: 1–23
Article Google Scholar
Chen H, Ding G, Liu X, et al. IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 12655–12663
Vendrov I, Kiros R, Fidler S, et al. Order-embeddings of images and language. 2015. ArXiv:1511.06361
Feng F, Wang X, Li R. Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the 22nd ACM International Conference on Multimedia, 2014. 7–16

Download references

Acknowledgements

This work was supported by Natural Science Foundation of Tianjin (Grant No. 19JCYBJC16000) and National Natural Science Foundation of China (Grant No. 61771329).

Author information

Authors and Affiliations

School of Electrical and Information Engineering, Tianjin University, Tianjin, 300072, China
Zhong Ji, Kexin Chen, Yuqing He & Yanwei Pang
Center for Optical Imagery Analysis and Learning, Northwestern Polytechnical University, Xi’an, 710129, China
Xuelong Li

Authors

Zhong Ji
View author publications
You can also search for this author in PubMed Google Scholar
Kexin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yuqing He
View author publications
You can also search for this author in PubMed Google Scholar
Yanwei Pang
View author publications
You can also search for this author in PubMed Google Scholar
Xuelong Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Zhong Ji or Yuqing He.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ji, Z., Chen, K., He, Y. et al. Heterogeneous memory enhanced graph reasoning network for cross-modal retrieval. Sci. China Inf. Sci. 65, 172104 (2022). https://doi.org/10.1007/s11432-021-3367-y

Download citation

Received: 11 April 2021
Revised: 16 July 2021
Accepted: 15 October 2021
Published: 20 June 2022
DOI: https://doi.org/10.1007/s11432-021-3367-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Heterogeneous memory enhanced graph reasoning network for cross-modal retrieval

Abstract

Access this article

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation