COREN: Multi-Modal Co-Occurrence Transformer Reasoning Network for Image-Text Retrieval

Wang, Yaodong; Ji, Zhong; Chen, Kexin; Pang, Yanwei; Zhang, Zhongfei

doi:10.1007/s11063-022-11121-z

COREN: Multi-Modal Co-Occurrence Transformer Reasoning Network for Image-Text Retrieval

Published: 22 December 2022

Volume 55, pages 5959–5978, (2023)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Yaodong Wang¹,
Zhong Ji ORCID: orcid.org/0000-0002-2197-3739¹,
Kexin Chen¹,
Yanwei Pang¹ &
…
Zhongfei Zhang²

392 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Cross-modal image-text retrieval aims at retrieving the images according to the given query texts and vice versa, which is a challenging task due to the inherent heterogeneous gap between computer vision and natural language processing. Most previous methods mine the intra-modal interactions and inter-modal interactions independently, which may lead to a fragmented understanding of the visual-linguistic modalities. Different from them, in this paper, we address this challenge by proposing a unified multi-modal Co-Occurrence transformer Reasoning Network, dubbed as COREN, to comprehensively discover the semantic correlations of the two modalities. Specifically, we resort to a unified multi-modal transformer encoder to decompose the intra-modal and inter-modal co-occurrence relationships reasoning into a two-stage learning architecture. In the first learning stage, we utilize the multi-modal transformer as a shared siamese encoder for both visual and textual branch to reason the intra-modal co-occurrence relationships. In this way, we obtain modality-specific contextualized representations for each input image and text instance, and the model is equipped with the representation and reasoning ability of both visual and textual entities. In the second learning stage, we stack the visual and textual features together and jointly feed them into the same multi-modal transformer encoder to reason the inter-modal co-occurrence relationships between the two modalities. Additionally, we propose a novel Adaptive Similarity Aggregation (ASA) module to achieve a more accurate cross-modal similarity measurement based on the generated contextualized representations. The experimental results on benchmark datasets demonstrate the effectiveness and superiority of our proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

Learning with Noisy Correspondence

Article 13 April 2024

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Data availability

Enquiries about data availability should be directed to the authors.

References

Miao Y, Cheng W, He S, Jiang H (2022) Research on visual question answering based on gat relational reasoning. Neural Process Lett 54(2):1435–1448
Article Google Scholar
Sharma H, Jalal AS (2022) An improved attention and hybrid optimization technique for visual question answering. Neural Process Lett 54(1):709–730
Article Google Scholar
Zellers R, Bisk Y, Farhadi A, Choi Y (2019) From recognition to cognition: visual commonsense reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6720–6731
Lyu F, Feng W, Wang S (2020) vtgraphnet: learning weakly-supervised scene graph for complex visual grounding. Neurocomputing 413:51–60
Article Google Scholar
Yu Z, Song Y, Yu J, Wang M, Huang Q (2020) Intra-and inter-modal multilinear pooling with multitask learning for video grounding. Neural Process Lett 52(3):1863–1879
Article Google Scholar
Xiao F, Xue W, Shen Y, Gao X (2022) A new attention-based lstm for image captioning. Neural Process Lett, 1–15
Li P, Zhang M, Lin P, Wan J, Jiang M (2022) Conditional embedding pre-training language model for image captioning. Neural Process. Lett 1–17
Zhu H, Wang R, Zhang X (2021) Image captioning with dense fusion connection and improved stacked attention module. Neural Process Lett 53(2):1101–1118
Article Google Scholar
Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T (2013) Devise: a deep visual-semantic embedding model. Adv Neural Inf Process Syst 26:2121–2129
Google Scholar
Ma L, Lu Z, Shang L et al (2015) Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE International Conference on Computer Vision pp 2623–2631
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. CoRR arXiv:1411.2539
Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision pp 201–216
Wei X, Zhang T, Li Y et al (2020) Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp 10941–10950
Nam H, Ha J-W, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 299–307
Zhang Y, Li K, Li K et al (2019) Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp 4654–4662
Faghri F, Fleet DJ, Kiros JR, Fidler S (2018) VSE++: improving visual-semantic embeddings with hard negatives. In: British Machine Vision Conference p 12
Zheng Z, Zheng L, Garrett M et al (2020) Dual-path convolutional image-text embeddings with instance loss. ACM Trans Multim Comput Commun Appl 16(2):1–23
Article Google Scholar
Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM International Conference on Multimedia, pp 154–162
Hu P, Peng D, Wang X, Xiang Y (2019) Multimodal adversarial network for cross-modal retrieval. Knowl Based Syst 180:38–50
Article Google Scholar
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 3128–3137
Zhang Q, Lei Z, Zhang Z, Li SZ (2020) Context-aware attention network for image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp 3536–3545
Ji Z, Chen K, Wang H (2021) Step-wise hierarchical alignment network for image-text matching. In: Proceedings of the International Joint Conference on Artificial Intelligence pp 765–771
Liu C, Mao Z, Zhang T et al (2020) Graph structured network for image-text matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp 10921–10930
Ji Z, Chen K, He Y, Pang Y, Li X (2022) Heterogeneous memory enhanced graph reasoning network for cross-modal retrieval. Inf Sci 65(172104):1–172104
MathSciNet Google Scholar
Li G, Duan N, Fang Y, Gong M, Jiang D (2020) Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. Proc AAAI Conf Artif Intell 34:11336–11344
Google Scholar
Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265
Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2019) Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 5998–6008
Wang L, Li Y, Huang J, Lazebnik S (2018) Learning two-branch neural networks for image-text matching tasks. IEEE Trans Pattern Anal Mach Intell 41(2):394–407
Article Google Scholar
Sarafianos N, Xu X, Kakadiaris IA (2019) Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp 5814–5824
Ji Z, Wang H, Han J, Pang Y (2022) Sman: stacked multimodal attention network for cross-modal image-text retrieval. IEEE Trans Cybern 52(2):1086–1097
Article Google Scholar
Wang Z, Liu X, Li H et al (2019) Camp: Cross-modal adaptive message passing for text-image retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp 5764–5773
Wang Y, Yang H, Qian X et al (2019) Position focused attention network for image-text matching. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence pp 3792–3798
Wu Y, Wang S, Song G, et al (2019) Learning fragment self-attention embeddings for image-text matching. In: Proceedings of the 27th ACM International Conference on Multimedia, pp 2088–2096
Chen H, Ding G, Liu X et al (2020) Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp 12655–12663
Wang Y, Zhang T, Zhang X, Cui Z, Huang Y, Shen P, Li S, Yang J (2021) Wasserstein coupled graph learning for cross-modal retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp 1813–1822
Messina N, Amato G, Esuli A, Falchi F, Gennaro C, Marchand-Maillet S (2021) Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans Multim Comput Commun Appl 17(4):1–23
Article Google Scholar
Qu L, Liu M, Cao D, Nie L, Tian Q (2020) Context-aware multi-view summarization network for image-text matching. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1047–1055
Li J, Liu L, Niu L, Zhang L (2021) Memorize, associate and match: embedding enhancement via fine-grained alignment for image-text retrieval. IEEE Trans Image Process 30:9193–9207
Article Google Scholar
Diao H, Zhang Y, Ma L, Lu H (2021) Similarity reasoning and filtration for image-text matching. Proc AAAI Conf Artif Intell 35:1218–1226
Google Scholar
Messina N, Falchi F, Esuli A, Amato G (2021) Transformer reasoning network for image-text matching and retrieval. In: 25th International Conference on Pattern Recognition, pp 5222–5229
Huang Y, Wu Q, Song C et al (2018) Learning semantic concepts and order for image and sentence matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 6163–6171

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (NSFC) under Grant 62176178.

Author information

Authors and Affiliations

School of Electrical and Information Engineering, Tianjin University, Tianjin, 300072, China
Yaodong Wang, Zhong Ji, Kexin Chen & Yanwei Pang
Department of Computer Science, Binghamton University, Binghamton, USA
Zhongfei Zhang

Authors

Yaodong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhong Ji
View author publications
You can also search for this author in PubMed Google Scholar
Kexin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yanwei Pang
View author publications
You can also search for this author in PubMed Google Scholar
Zhongfei Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhong Ji.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, Y., Ji, Z., Chen, K. et al. COREN: Multi-Modal Co-Occurrence Transformer Reasoning Network for Image-Text Retrieval. Neural Process Lett 55, 5959–5978 (2023). https://doi.org/10.1007/s11063-022-11121-z

Download citation

Accepted: 10 December 2022
Published: 22 December 2022
Issue Date: October 2023
DOI: https://doi.org/10.1007/s11063-022-11121-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

COREN: Multi-Modal Co-Occurrence Transformer Reasoning Network for Image-Text Retrieval

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Learning with Noisy Correspondence

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

COREN: Multi-Modal Co-Occurrence Transformer Reasoning Network for Image-Text Retrieval

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Learning with Noisy Correspondence

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation