Abstract
Enabling image–text matching is important to understand both vision and language. Existing methods utilize the cross-attention mechanism to explore deep semantic information. However, the majority of these methods need to perform two types of alignment, which is extremely time-consuming. In addition, current methods do not consider the digital information within the image or text, which may lead to a reduction in retrieval performance. In this paper, we propose a multi-level network, which is based on the transformer encoder for fine-grained, image–text matching. First, we use the transformer encoder to extract intra-modality relations within the image and text and perform the alignment through an efficient aggregating method, rendering the alignment more efficient and the intra-modality information fully utilized. Second, we capture the discriminative digital information within the image and text to make the representation more distinguishable. Finally, we utilize the global information of the image and text as complementary information to enhance the representation. According to our experimental results, significant improvements in terms of retrieval tasks and runtime estimation can be achieved compared with state-of-the-art algorithms. The source code is available at https://github.com/CQULab/MNTE.









Similar content being viewed by others
Data availability
These datasets during the current study were derived from the following public domain resources: http://shannon.cs.illinois.edu/DenotationGraph/ and https://cocodataset.org/.
Notes
Our proposed method is implemented by PyTorch framework with an NVIDIA GeForce GTX 1080Ti GPU.
References
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N: A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 251–260 (2010)
Feng F, Wang X, Li R: Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 7–16 (2014)
Klein B, Lev G, Sadeh G, Wolf L: Associating neural word embeddings with deep image representations using fisher vectors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4437–4446 (2015)
Karpathy A, Fei-Fei L: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Yan F, Mikolajczyk K: Deep correlation for matching images and text. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3441–3450 (2015)
Faghri F, Fleet DJ, Kiros JR, Fidler S: Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017)
Nam H, Ha J-W, Kim J: Dual attention networks for multimodal reasoning and matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 299–307 (2017)
Wu Y, Wang S, Song G, Huang Q: Learning fragment self-attention embeddings for image-text matching. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 2088–2096 (2019)
Jiang, A., Li, H., Li, Y., Wang, M.: Learning discriminative representations for semantical crossmodal retrieval. Multimed. Syst. 24(1), 111–121 (2018)
Ge X, Chen F, Jose JM, Ji Z, Wu Z, Liu X: Structured multi-modal feature embedding and alignment for image-sentence retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 5185–5193 (2021)
Huang, F., Zhang, X.: Zhao: Bi-directional spatial-semantic attention networks for image-text matching. IEEE Trans. Image Proc. 28(4), 2008–2020 (2018)
Plummer BA, Kordas P, Kiapour MH, Zheng S, Piramuthu R, Lazebnik S: Conditional image-text embedding networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 249–264 (2018)
Karpathy A, Joulin A, Fei-Fei LF: Deep fragment embeddings for bidirectional image sentence mapping. Adv Neural Inf Proc Syst. 27, 1889–1897 (2014)
Lee K-H, Chen X, Hua G, Hu H, He X: Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–216 (2018)
Zhang, Y., Zhou, W., Wang, M., Tian, Q., Li, H.: Deep relation embedding for cross-modal retrieval. IEEE Trans. Image Proc. 30, 617–627 (2020)
Messina N, Amato G, Esuli A, Falchi F, Gennaro C, Marchand-Maillet S: Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM). 17(4):1–23 (2021)
Liu C, Mao Z, Zhang T, Xie H, Wang B: Graph structured network for image-text matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10921–10930 (2020)
Qu L, Liu M, Cao D, Nie L, Tian Q: Context-aware multi-view summarization network for image-text matching. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1047–1055 (2020)
Chen H, Ding G, Liu X, Lin Z, Liu J, Han J: Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12655–12663 (2020)
Peng, Y., Qi, J., Zhuo, Y.: Mava: Multi-level adaptive visual-textual alignment by cross-media bi-attention mechanism. IEEE Trans. Image Proc. 29, 2728–2741 (2019)
Wei X, Zhang T, Li Y, Wu F: Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10941–10950 (2020)
Qu L, Liu M Wu, J, Gao Z, Nie L: Dynamic modality interaction modeling for image-text retrieval. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1104–1113 (2021)
Yilmaz, T., Yazici, A., Kitsuregawa, M.: Relief-mm: effective modality weighting for multimedia information retrieval. Multimed. Syst. 20(4), 389–413 (2014)
Jiang, S., Song, X., Huang, Q.: Relative image similarity learning with contextual information for internet cross-media retrieval. Multimed. Syst. 20(6), 645–657 (2014)
Eisenschtat A, Wolf L: Linking image and text with 2-way nets. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4601–4611 (2017)
Gu J, Cai J, Joty SR, Niu L, Wang G: Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7181–7189 (2018)
Liu Y, Guo Y, Bakker EM, Lew, MS: Learning a recurrent residual fusion network for multimodal matching. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4107–4116 (2017)
Mithun NC, Panda R, Papalexakis EE, Roy-Chowdhury AK: Webly supervised joint embedding for cross-modal image-text retrieval. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 1856–1864 (2018)
Wang, L., Li, Y., Huang, J., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Patt. Analy. Mach. Intell. 41(2), 394–407 (2018)
Wu Y, Wang S, Huang Q: Learning semantic structure-preserved embeddings for cross-modal retrieval. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 825–833 (2018)
Zhang Y, Lu H: Deep cross-modal projection learning for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 686–701 (2018)
Hotelling H: Relations between two sets of variates. In: Breakthroughs in Statistics, pp. 162–190 (1992)
Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)
Wei, Y., Zhao, Y., Lu, C., Wei, S., Liu, L., Zhu, Z., Yan, S.: Cross-modal retrieval with cnn visual features: A new baseline. IEEE Trans. Cybernet. 47(2), 449–460 (2016)
Zhang L, Ma B, Li G, Huang Q, Tian Q: Multi-networks joint learning for large-scale cross-modal retrieval. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 907–915 (2017)
Peng, Y., Qi, J.: Cm-gans: Cross-modal generative adversarial networks for common representation learning. ACM Trans. Multimed. Comput Commun. Appl. (TOMM). 15(1), 1–24 (2019)
Wang B, Yang Y, Xu X, Hanjalic A, Shen HT: Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 154–162 (2017)
Ji Z, Wang H, Han J, Pang Y: Saliency-guided attention network for image-sentence matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5754–5763 (2019)
Ma L, Lu Z, Shang L, Li H: Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2623–2631 (2015)
He K, Zhang X, Ren S, Sun J: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Devlin J, Chang M-W, Lee K, Toutanova K: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Young, P., Lai, A., Hodosh, M.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnick CL: Microsoft coco: Common objects in context. In: European Conference on Computer Vision, pp. 740–755 (2014). Springer
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. Advanc. Neu. Inform. Proc. Syst. 28, 91–99 (2015)
Acknowledgements
This work was supported by National Nature Science Foundation of China (No. 62262006), National Natural Science Foundation of China by Mingliang Zhou (No. 62176027), Zhejiang Lab (No. 2021KE0AB01), Open Fund of Key Laboratory of Monitoring, Evaluation and Early Warning of Territorial Spatial Planning Implementation, Ministry of Natural Resources (No. LMEE-KF2021008), Technology Innovation and Application Development Key Project of Chongqing (No. cstc2021jscx-gksbX0058), and Guangxi Key Laboratory of Trusted Software (No. kx202006).
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yang, L., Feng, Y., Zhou, M. et al. Multi-level network based on transformer encoder for fine-grained image–text matching. Multimedia Systems 29, 1981–1994 (2023). https://doi.org/10.1007/s00530-023-01079-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-023-01079-w