Abstract
Cross-modal image-text retrieval aims at retrieving the images according to the given query texts and vice versa, which is a challenging task due to the inherent heterogeneous gap between computer vision and natural language processing. Most previous methods mine the intra-modal interactions and inter-modal interactions independently, which may lead to a fragmented understanding of the visual-linguistic modalities. Different from them, in this paper, we address this challenge by proposing a unified multi-modal Co-Occurrence transformer Reasoning Network, dubbed as COREN, to comprehensively discover the semantic correlations of the two modalities. Specifically, we resort to a unified multi-modal transformer encoder to decompose the intra-modal and inter-modal co-occurrence relationships reasoning into a two-stage learning architecture. In the first learning stage, we utilize the multi-modal transformer as a shared siamese encoder for both visual and textual branch to reason the intra-modal co-occurrence relationships. In this way, we obtain modality-specific contextualized representations for each input image and text instance, and the model is equipped with the representation and reasoning ability of both visual and textual entities. In the second learning stage, we stack the visual and textual features together and jointly feed them into the same multi-modal transformer encoder to reason the inter-modal co-occurrence relationships between the two modalities. Additionally, we propose a novel Adaptive Similarity Aggregation (ASA) module to achieve a more accurate cross-modal similarity measurement based on the generated contextualized representations. The experimental results on benchmark datasets demonstrate the effectiveness and superiority of our proposed method.
Similar content being viewed by others
Data availability
Enquiries about data availability should be directed to the authors.
References
Miao Y, Cheng W, He S, Jiang H (2022) Research on visual question answering based on gat relational reasoning. Neural Process Lett 54(2):1435–1448
Sharma H, Jalal AS (2022) An improved attention and hybrid optimization technique for visual question answering. Neural Process Lett 54(1):709–730
Zellers R, Bisk Y, Farhadi A, Choi Y (2019) From recognition to cognition: visual commonsense reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6720–6731
Lyu F, Feng W, Wang S (2020) vtgraphnet: learning weakly-supervised scene graph for complex visual grounding. Neurocomputing 413:51–60
Yu Z, Song Y, Yu J, Wang M, Huang Q (2020) Intra-and inter-modal multilinear pooling with multitask learning for video grounding. Neural Process Lett 52(3):1863–1879
Xiao F, Xue W, Shen Y, Gao X (2022) A new attention-based lstm for image captioning. Neural Process Lett, 1–15
Li P, Zhang M, Lin P, Wan J, Jiang M (2022) Conditional embedding pre-training language model for image captioning. Neural Process. Lett 1–17
Zhu H, Wang R, Zhang X (2021) Image captioning with dense fusion connection and improved stacked attention module. Neural Process Lett 53(2):1101–1118
Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T (2013) Devise: a deep visual-semantic embedding model. Adv Neural Inf Process Syst 26:2121–2129
Ma L, Lu Z, Shang L et al (2015) Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE International Conference on Computer Vision pp 2623–2631
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. CoRR arXiv:1411.2539
Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision pp 201–216
Wei X, Zhang T, Li Y et al (2020) Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp 10941–10950
Nam H, Ha J-W, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 299–307
Zhang Y, Li K, Li K et al (2019) Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp 4654–4662
Faghri F, Fleet DJ, Kiros JR, Fidler S (2018) VSE++: improving visual-semantic embeddings with hard negatives. In: British Machine Vision Conference p 12
Zheng Z, Zheng L, Garrett M et al (2020) Dual-path convolutional image-text embeddings with instance loss. ACM Trans Multim Comput Commun Appl 16(2):1–23
Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM International Conference on Multimedia, pp 154–162
Hu P, Peng D, Wang X, Xiang Y (2019) Multimodal adversarial network for cross-modal retrieval. Knowl Based Syst 180:38–50
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 3128–3137
Zhang Q, Lei Z, Zhang Z, Li SZ (2020) Context-aware attention network for image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp 3536–3545
Ji Z, Chen K, Wang H (2021) Step-wise hierarchical alignment network for image-text matching. In: Proceedings of the International Joint Conference on Artificial Intelligence pp 765–771
Liu C, Mao Z, Zhang T et al (2020) Graph structured network for image-text matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp 10921–10930
Ji Z, Chen K, He Y, Pang Y, Li X (2022) Heterogeneous memory enhanced graph reasoning network for cross-modal retrieval. Inf Sci 65(172104):1–172104
Li G, Duan N, Fang Y, Gong M, Jiang D (2020) Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. Proc AAAI Conf Artif Intell 34:11336–11344
Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265
Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2019) Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 5998–6008
Wang L, Li Y, Huang J, Lazebnik S (2018) Learning two-branch neural networks for image-text matching tasks. IEEE Trans Pattern Anal Mach Intell 41(2):394–407
Sarafianos N, Xu X, Kakadiaris IA (2019) Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp 5814–5824
Ji Z, Wang H, Han J, Pang Y (2022) Sman: stacked multimodal attention network for cross-modal image-text retrieval. IEEE Trans Cybern 52(2):1086–1097
Wang Z, Liu X, Li H et al (2019) Camp: Cross-modal adaptive message passing for text-image retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp 5764–5773
Wang Y, Yang H, Qian X et al (2019) Position focused attention network for image-text matching. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence pp 3792–3798
Wu Y, Wang S, Song G, et al (2019) Learning fragment self-attention embeddings for image-text matching. In: Proceedings of the 27th ACM International Conference on Multimedia, pp 2088–2096
Chen H, Ding G, Liu X et al (2020) Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp 12655–12663
Wang Y, Zhang T, Zhang X, Cui Z, Huang Y, Shen P, Li S, Yang J (2021) Wasserstein coupled graph learning for cross-modal retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp 1813–1822
Messina N, Amato G, Esuli A, Falchi F, Gennaro C, Marchand-Maillet S (2021) Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans Multim Comput Commun Appl 17(4):1–23
Qu L, Liu M, Cao D, Nie L, Tian Q (2020) Context-aware multi-view summarization network for image-text matching. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1047–1055
Li J, Liu L, Niu L, Zhang L (2021) Memorize, associate and match: embedding enhancement via fine-grained alignment for image-text retrieval. IEEE Trans Image Process 30:9193–9207
Diao H, Zhang Y, Ma L, Lu H (2021) Similarity reasoning and filtration for image-text matching. Proc AAAI Conf Artif Intell 35:1218–1226
Messina N, Falchi F, Esuli A, Amato G (2021) Transformer reasoning network for image-text matching and retrieval. In: 25th International Conference on Pattern Recognition, pp 5222–5229
Huang Y, Wu Q, Song C et al (2018) Learning semantic concepts and order for image and sentence matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 6163–6171
Acknowledgements
This work was supported by the National Natural Science Foundation of China (NSFC) under Grant 62176178.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, Y., Ji, Z., Chen, K. et al. COREN: Multi-Modal Co-Occurrence Transformer Reasoning Network for Image-Text Retrieval. Neural Process Lett 55, 5959–5978 (2023). https://doi.org/10.1007/s11063-022-11121-z
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-022-11121-z