Abstract
Deep multimodal learning has attracted increasing attention in artificial intelligence since it bridges vision and language. Most existing works only focus on specific multimodal tasks, which limits the ability to generalize to other tasks. Furthermore, these works only learn coarse-grained interactions at the object-level in images and the word-level in text, while ignoring to learn fine-grained interactions at relation-level and attribute-level. In this paper, to alleviate these issues, we propose a Semantic-aware Multi-Branch Interaction (SeMBI) network for various multimodal learning tasks. The SeMBI mainly consists of three modules, Multi-Branch Visual Semantics (MBVS) module, Multi-Branch Textual Semantics (MBTS) module and Multi-Branch Cross-modal Alignment (MBCA) module. The MBVS enhances the visual features and performs reasoning through three parallel branches, corresponding to the latent relationship branch, explicit relationship branch and attribute branch. The MBTS learns relation-level language context and attribute-level language context by textual relationship branch and textual attribute branch, respectively. The enhanced visual features then passed into MBCA to learn fine-grained cross-modal correspondence under the guidance of relation-level and attribute-level language context. We demonstrate the generalizability and effectiveness of the proposed SeMBI by applying it to three deep multimodal learning tasks, including Visual Question Answering (VQA), Referring Expression Comprehension (REC) and Cross-Modal Retrieval (CMR). Extensive experiments conducted on five common benchmark datasets indicate superior performance comparing with state-of-the-art works.







Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Data availability
The datasets analysed during the current study are available from the corresponding author on reasonable request.
Notes
If there is no relationship between words \(t_i\) and \(t_j\), the relational embedding between them is padded with 0.
References
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition 6077–6086
Antol S, Agrawal A, Jiasen L, Mitchell M, Dhruv BC, Zitnick L, Parikh D (2015) Vqa: visual question answering. In Proceedings of the IEEE international conference on computer vision 2425–2433
Cadene R, Ben-Younes H, Cord M, Thome N (2019) Murel: multimodal relational reasoning for visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 1989–1998
Chen H Ding G, Liu X, Lin Z, Liu J, Han J (2020) Imram: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 12652–12660
Chen L, Ma W, Xiao J, Zhang H, Chang SF (2021) Ref-nms: breaking proposal bottlenecks in two-stage referring expression grounding. Proceed AAAI Conf Artif Intell 35:1036–1044
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. Proceedings of the 2014 conference on empirical methods in natural language processing
Gao P, Jiang Z, You H, Pan Lu, Hoi Steven CH, Wang X, Li H (2019) Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 6639–6648
Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6904–6913
Guo D, Chang X, Tao D (2021) Bilinear graph networks for visual question answering. IEEE Transactions on neural networks and learning systems 1–12
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition 770–778
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Han H, Jiayuan G, Zhang Z, Dai J, Wei Y (2018) Relation networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition 3588–3597
Huang Q, Wei J, Cai Y, Zheng C, Chen J, Leung HF, Li Q (2020). Aligned dual channel graph convolutional network for visual question answering. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 7166–7176,
Karpathy A, Joulin A, Fei-Fei L (2014) Deep fragment embeddings for bidirectional image sentence mapping. Advances in neural information processing systems 27: Annual conference on neural information processing systems
Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. Advances in Neural information processing systems 31: annual conference on neural information processing systems
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. International conference on learning representations
Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. International conference on learning representations
Knyazev B, de Vries H, Cangea C, Taylor GW, Courville A, Belilovsky E (2020) Graph density-aware losses for novel compositions in scene graph generation. British machine vision conference
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma David A et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comp Vis 123(1):32–73
Lee KH, Chen X, Hua G, Houdong H, He X (2018) Stacked cross attention for image-text matching. In Proceedings of the European conference on computer vision (ECCV) 201–216
Li J, Liu L, Niu L, Zhang L (2021) Memorize, associate and match: embedding enhancement via fine-grained alignment for image-text retrieval. IEEE Trans Image Process 30:9193–9207
Li K, Zhang Y, Li K, Li Y, Yun F (2019) Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF international conference on computer vision. pp 4654–4662
Li L, Gan Z, Cheng Y, Liu J (2019) Relation-aware graph attention network for visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision. pp 10313–10322
Tsung-Yi L, Michael M, Serge B, James H, Pietro P, Deva R, Piotr D, Lawrence ZC (2014). Common objects in context Microsoft coco. In: European conference on computer vision 10: 740–755. Springer
Lin Z, Kang Z, Zhang L, Tian L (2021) Multi-view attributed graph clustering. IEEE Transactions on knowledge and data engineering
Liu C, Mao Z, Zhang T, Xie H, Wang B, Zhang Y (2020) Graph structured network for image-text matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 10921–10930
Liu D, Zhang H, Feng W, Zha ZJ (2019) Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE/CVF international conference on computer vision 4673–4682
Liu F, Liu J, Fang Z, Hong R, Hanqing L (2021) Visual question answering with dense inter- and intra-modality interactions. IEEE Trans Multim 23:3518–3529
Liu X, Wang Z, Shao J, Wang X, Li H (2019) Improving referring expression grounding with cross-modal attention-guided erasing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 1950–1959
Liu Y, Wang H, Meng F, Liu M, Liu H (2021) Attend, correct and focus: a bidirectional correct attention network for image-text matching. In: 2021 IEEE International conference on image processing (ICIP), pages 2673–2677
Luo G, Zhou Y, Sun X, Cao L, Chenglin W, Deng C, Ji R (2020) Multi-task collaborative network for joint referring expression comprehension and segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 10034–10043
Ma J, Liu J, Lin Q, Bei W, Wang Y, You Y (2021) Multitask learning for visual question answering. IEEE Transactions on neural networks and learning systems 1–15
Ma L, Jiang W, Jie Z, Jiang YG, Liu W (2020) Matching image and sentence with multi-faceted representations. IEEE Trans Circ Sys Video Technol 30(7):2250–2261
Mao J, Huang J, Toshev A, Camburu O, Yuille AL, Murphy K (2016) Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition 11–20
Peng L, Yang Y, Wang Z, Wu X, Huang Z (2019) Cra-net: composed relation attention network for visual question answering. In: Proceedings of the 27th ACM international conference on multimedia, pp 1202–1210
Peng Y, Huang X, Zhao Y (2018) An overview of cross-media retrieval: concepts, methodologies, benchmarks and challenges. IEEE Trans Circ Sys Video Technol 28(9):2372–2385
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Qiu H, Li H, Wu Q, Meng F, Shi H, Zhao T, Ngan KN (2020) Language-aware fine-grained object representation for referring expression comprehension. In Proceedings of the 28th ACM international conference on multimedia. pp 4171–4180
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Infor Process Sys 28:91–99
Schuster S, Krishna R, Chang A, Fei-Fei L, Manning CD (2015) Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In: Proceedings of the fourth workshop on vision and language. pp 70–80
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Veličković P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y (2017) Graph attention networks. arXiv preprint arXiv:1710.10903
Wang P, Wu Q, Cao J, Shen C, Gao L, van den HA (2019) Neighbourhood watch: referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1960–1968
Wang S, Wang R, Yao Z, Shan S, Chen X (2020) Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1508–1517
Wang T, Huang J, Zhang H, Sun Q (2020) Visual commonsense r-cnn. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10760–10770
Wang Y, Yang H, Bai X, Qian X, Ma Lin, Jing Lu, Li Biao, Fan Xin (2021) Pfan++: bi-directional image-text retrieval with position focused attention network. IEEE Trans Multim 23:3362–3376
Wang Z, Liu X, Hongsheng LL, Sheng JY, Wang X, Shao J (2019) Camp: cross-modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE/CVF International conference on computer vision pp. 5764–5773
Whitehead S, Wu H, Ji H, Feris R, Saenko K (2021) Separating skills and concepts for novel visual question answering. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 5628–5637
Wu H, Mao J, Zhang Y, Jiang Y, Li L, Sun W, Ma WY (2019) Univse: robust visual semantic embeddings via structured semantic representations. arXiv preprint arXiv:1904.05521
Wu Y, Wang S, Song G, Huang Q (2019) Learning fragment self-attention embeddings for image-text matching. In: Proceedings of the 27th ACM international conference on multimedia, pp 2088–2096
Yang S, Li G, Yizhou Y (2019) Cross-modal relationship inference for grounding referring expressions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 4145–4154
Yang S, Li G, Yizhou Y (2019) Dynamic graph attention for referring expression comprehension. In Proceedings of the IEEE/CVF international conference on computer vision, pp 4644–4653
Yang X, Lin G, Lv F, Liu F (2020) Trrnet: tiered relation reasoning for compositional visual question answering. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI16
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29
Licheng Y, Lin Z, Shen X, Yang J, Xin L, Bansal M, Mattnet BTL (2018) Modular attention network for referring expression comprehension. In: Proceedings of the IEEE conference on computer vision and pattern recognition 10:1307–1315
Yu L, Poirson P, Yang S, Berg AC, Berg TL (2016) Modeling context in referring expressions. In: European conference on computer vision, pp 69–85. Springer
Zhou Y, Jun Y, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 6281–6290
Yu Z, Yu J, Xiang C, Zhao Z, Tian Q, Tao D (2018) Rethinking diversified and discriminative proposal generation for visual grounding. In: Proceedings of the international joint conference on artificial intelligence
Zhang Q, Lei Z, Zhang Z, Li SZ (2020) Context-aware attention network for image-text retrieval. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 3533–3542
Zhang Y, Zhou W, Wang M, Tian Q, Li H (2021) Deep relation embedding for cross-modal retrieval. IEEE Trans Image Process 30:617–627
Zhou Y, Ji R, Sun X, Luo G, Hong X, Su J, Ding X, Shao L (2020) K-armed bandit based multi-modal network architecture search for visual question answering. In: Proceedings of the 28th ACM international conference on multimedia, pp 1245–1254
Zhuang Y, Song J, Fei W, Li X, Zhang Z, Rui Yong (2018) Multimodal deep embedding via hierarchical grounded compositional semantics. IEEE Trans Circ Sys Video Technol 28(1):76–89
Acknowledgements
This paper was supported by National Key R &D Program of China (2019YFC1521204).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Pan, H., Huang, J. Semantic-aware multi-branch interaction network for deep multimodal learning. Neural Comput & Applic 35, 7529–7545 (2023). https://doi.org/10.1007/s00521-022-08048-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-022-08048-w