Skip to main content
Log in

COREN: Multi-Modal Co-Occurrence Transformer Reasoning Network for Image-Text Retrieval

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Cross-modal image-text retrieval aims at retrieving the images according to the given query texts and vice versa, which is a challenging task due to the inherent heterogeneous gap between computer vision and natural language processing. Most previous methods mine the intra-modal interactions and inter-modal interactions independently, which may lead to a fragmented understanding of the visual-linguistic modalities. Different from them, in this paper, we address this challenge by proposing a unified multi-modal Co-Occurrence transformer Reasoning Network, dubbed as COREN, to comprehensively discover the semantic correlations of the two modalities. Specifically, we resort to a unified multi-modal transformer encoder to decompose the intra-modal and inter-modal co-occurrence relationships reasoning into a two-stage learning architecture. In the first learning stage, we utilize the multi-modal transformer as a shared siamese encoder for both visual and textual branch to reason the intra-modal co-occurrence relationships. In this way, we obtain modality-specific contextualized representations for each input image and text instance, and the model is equipped with the representation and reasoning ability of both visual and textual entities. In the second learning stage, we stack the visual and textual features together and jointly feed them into the same multi-modal transformer encoder to reason the inter-modal co-occurrence relationships between the two modalities. Additionally, we propose a novel Adaptive Similarity Aggregation (ASA) module to achieve a more accurate cross-modal similarity measurement based on the generated contextualized representations. The experimental results on benchmark datasets demonstrate the effectiveness and superiority of our proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

Enquiries about data availability should be directed to the authors.

References

  1. Miao Y, Cheng W, He S, Jiang H (2022) Research on visual question answering based on gat relational reasoning. Neural Process Lett 54(2):1435–1448

    Article  Google Scholar 

  2. Sharma H, Jalal AS (2022) An improved attention and hybrid optimization technique for visual question answering. Neural Process Lett 54(1):709–730

    Article  Google Scholar 

  3. Zellers R, Bisk Y, Farhadi A, Choi Y (2019) From recognition to cognition: visual commonsense reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6720–6731

  4. Lyu F, Feng W, Wang S (2020) vtgraphnet: learning weakly-supervised scene graph for complex visual grounding. Neurocomputing 413:51–60

    Article  Google Scholar 

  5. Yu Z, Song Y, Yu J, Wang M, Huang Q (2020) Intra-and inter-modal multilinear pooling with multitask learning for video grounding. Neural Process Lett 52(3):1863–1879

    Article  Google Scholar 

  6. Xiao F, Xue W, Shen Y, Gao X (2022) A new attention-based lstm for image captioning. Neural Process Lett, 1–15

  7. Li P, Zhang M, Lin P, Wan J, Jiang M (2022) Conditional embedding pre-training language model for image captioning. Neural Process. Lett 1–17

  8. Zhu H, Wang R, Zhang X (2021) Image captioning with dense fusion connection and improved stacked attention module. Neural Process Lett 53(2):1101–1118

    Article  Google Scholar 

  9. Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T (2013) Devise: a deep visual-semantic embedding model. Adv Neural Inf Process Syst 26:2121–2129

    Google Scholar 

  10. Ma L, Lu Z, Shang L et al (2015) Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE International Conference on Computer Vision pp 2623–2631

  11. Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. CoRR arXiv:1411.2539

  12. Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision pp 201–216

  13. Wei X, Zhang T, Li Y et al (2020) Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp 10941–10950

  14. Nam H, Ha J-W, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 299–307

  15. Zhang Y, Li K, Li K et al (2019) Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp 4654–4662

  16. Faghri F, Fleet DJ, Kiros JR, Fidler S (2018) VSE++: improving visual-semantic embeddings with hard negatives. In: British Machine Vision Conference p 12

  17. Zheng Z, Zheng L, Garrett M et al (2020) Dual-path convolutional image-text embeddings with instance loss. ACM Trans Multim Comput Commun Appl 16(2):1–23

    Article  Google Scholar 

  18. Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM International Conference on Multimedia, pp 154–162

  19. Hu P, Peng D, Wang X, Xiang Y (2019) Multimodal adversarial network for cross-modal retrieval. Knowl Based Syst 180:38–50

    Article  Google Scholar 

  20. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 3128–3137

  21. Zhang Q, Lei Z, Zhang Z, Li SZ (2020) Context-aware attention network for image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp 3536–3545

  22. Ji Z, Chen K, Wang H (2021) Step-wise hierarchical alignment network for image-text matching. In: Proceedings of the International Joint Conference on Artificial Intelligence pp 765–771

  23. Liu C, Mao Z, Zhang T et al (2020) Graph structured network for image-text matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp 10921–10930

  24. Ji Z, Chen K, He Y, Pang Y, Li X (2022) Heterogeneous memory enhanced graph reasoning network for cross-modal retrieval. Inf Sci 65(172104):1–172104

    MathSciNet  Google Scholar 

  25. Li G, Duan N, Fang Y, Gong M, Jiang D (2020) Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. Proc AAAI Conf Artif Intell 34:11336–11344

    Google Scholar 

  26. Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265

  27. Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2019) Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530

  28. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 5998–6008

  29. Wang L, Li Y, Huang J, Lazebnik S (2018) Learning two-branch neural networks for image-text matching tasks. IEEE Trans Pattern Anal Mach Intell 41(2):394–407

    Article  Google Scholar 

  30. Sarafianos N, Xu X, Kakadiaris IA (2019) Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp 5814–5824

  31. Ji Z, Wang H, Han J, Pang Y (2022) Sman: stacked multimodal attention network for cross-modal image-text retrieval. IEEE Trans Cybern 52(2):1086–1097

    Article  Google Scholar 

  32. Wang Z, Liu X, Li H et al (2019) Camp: Cross-modal adaptive message passing for text-image retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp 5764–5773

  33. Wang Y, Yang H, Qian X et al (2019) Position focused attention network for image-text matching. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence pp 3792–3798

  34. Wu Y, Wang S, Song G, et al (2019) Learning fragment self-attention embeddings for image-text matching. In: Proceedings of the 27th ACM International Conference on Multimedia, pp 2088–2096

  35. Chen H, Ding G, Liu X et al (2020) Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp 12655–12663

  36. Wang Y, Zhang T, Zhang X, Cui Z, Huang Y, Shen P, Li S, Yang J (2021) Wasserstein coupled graph learning for cross-modal retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp 1813–1822

  37. Messina N, Amato G, Esuli A, Falchi F, Gennaro C, Marchand-Maillet S (2021) Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans Multim Comput Commun Appl 17(4):1–23

    Article  Google Scholar 

  38. Qu L, Liu M, Cao D, Nie L, Tian Q (2020) Context-aware multi-view summarization network for image-text matching. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1047–1055

  39. Li J, Liu L, Niu L, Zhang L (2021) Memorize, associate and match: embedding enhancement via fine-grained alignment for image-text retrieval. IEEE Trans Image Process 30:9193–9207

    Article  Google Scholar 

  40. Diao H, Zhang Y, Ma L, Lu H (2021) Similarity reasoning and filtration for image-text matching. Proc AAAI Conf Artif Intell 35:1218–1226

    Google Scholar 

  41. Messina N, Falchi F, Esuli A, Amato G (2021) Transformer reasoning network for image-text matching and retrieval. In: 25th International Conference on Pattern Recognition, pp 5222–5229

  42. Huang Y, Wu Q, Song C et al (2018) Learning semantic concepts and order for image and sentence matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 6163–6171

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (NSFC) under Grant 62176178.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhong Ji.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Ji, Z., Chen, K. et al. COREN: Multi-Modal Co-Occurrence Transformer Reasoning Network for Image-Text Retrieval. Neural Process Lett 55, 5959–5978 (2023). https://doi.org/10.1007/s11063-022-11121-z

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-022-11121-z

Keywords

Navigation