Cross-modality co-attention networks for visual question answering

Han, Dezhi; Zhou, Shuli; Li, Kuan Ching; de Mello, Rodrigo Fernandes

doi:10.1007/s00500-020-05539-7

Cross-modality co-attention networks for visual question answering

Methodologies and Application
Published: 05 January 2021

Volume 25, pages 5411–5421, (2021)
Cite this article

Soft Computing Aims and scope Submit manuscript

Dezhi Han¹,
Shuli Zhou¹,
Kuan Ching Li ORCID: orcid.org/0000-0003-1381-4364² &
…
Rodrigo Fernandes de Mello³

1283 Accesses
17 Citations
Explore all metrics

Abstract

Visual question answering (VQA) is an emerging task combining natural language processing and computer vision technology. Selecting compelling multi-modality features is the core of visual question answering. In multi-modal learning, the attention network provides an effective way that selectively utilizes the given visual information. However, the internal relationship of modalities is often ignored in VQA, and most previous models focus on the relationship between visual and language features. To address such an issue: (1) we propose a cross-modality co-attention networks (CMCN) framework, such a network framework aims to help in learning both intra-modality and cross-modality relationships. (2) Cross-modality co-attention (CMC) module is the core of the whole network framework, composed of self-attention blocks and guided-attention blocks. The self-attention block learns the relations of intra-modalities, while the guided-attention block models cross-modal interactions between an image and a question. The cascaded network of multiple CMC modules not only improves the fusion of visual and language representations, but also captures more representative image and text information. (3) To prove that the proposed model can improve the results to some extent, we have carried out a thorough experimental verification. Experimental evaluations on the VQA 2.0 dataset confirm that the CMCN has significant performance advantages over existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

Article 13 September 2023

IMCN: Improved modular co-attention networks for visual question answering

Article 16 April 2024

Jointly Learning Attentions with Semantic Cross-Modal Correlation for Visual Question Answering

References

Anderson P, He XD, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: IEEE conference on computer vision and pattern recognition
Ben-Younes H, Cadene R, Cord M, Thome N (2017) MUTAN: multimodal tucker fusion for visual question answering. In: IEEE international conference on computer vision
Chen C, Han D, Wang J (2020) Multimodal encoder-decoder attention networks for visual question answering. IEEE Access 8:35662–35671. https://doi.org/10.1109/ACCESS.2975093
Article Google Scholar
Chung J, Gulcehre C, Cho K (2015) Gated feedback recurrent neural networks. In: International conference on machine learning, pp 2067–2075
Deng W, Xu JJ, Zhao HM, Song YJ (2020a) An effective improved co-evolution ant colony optimisation algorithm with multi-strategies and its application. Int J Bio-Inspir Comput 16(3):1–10
Article Google Scholar
Deng W, Xu JJ, Zhao HM, Song YJ (2020b) A novel gate resource allocation method using improved PSO-based QEA. IEEE trans intell transp syst 1–9
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Fukui A, Huk Park D, Yang D, Rohrbach A, Darrell T, and Rohrbach M (2016) Multi-modal compact bilinear pooling for visual question answering and visual grounding. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 457–468
Gao Y, Beijbom O, Zhang N, Darrell T (2016) Compact bilinear pooling. In: IEEE conference on computer vision and pattern recognition, pp 317–326
Gao P, Jiang ZK, You HX, Lu P, Steven CH, Wang XG, Li HS (2019) Dynamic fusion with intra- and inter-modality attention flow for visual question answering. In: CVPR, pp 6639–6648
Gao HY, Mao JH, Zhou J, Huang ZH, Wang L, Xu W (2015) Are you talking to a machine? Dataset and methods for multilingual image question. NIPS 28:2296–2304
Google Scholar
Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the V in VQA Matter: elevating the role of image understanding in visual question answering. In: IEEE conference on computer vision and pattern recognition, pp 6904–6913
Guo Z, Han D, Massetto FI, Li K-C (2021) Double-layer affective visual question answering network. Comput Sci Inf Syst 18(1):38
Article Google Scholar
Gurari D, Li Q, Stangl AJ, Guo AH, Lin C, Grauman K, Luo JB, and Bigham JP (2018) Vizwiz grand challenge: answering visual questions from blind people. In: IEEE conference on computer vision and pattern recognition
He S, Han D (2020) An effective dense co-attention networks for visual question answering. Sensors 20:4897
Article Google Scholar
He KM, Zhang XY, Ren SQ, and Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition, pp 770–778
Ilievski I, Yan SC, Feng JS (2016) A focused dynamic attention model for visual question answering. In: CoRR. arXiv:abs/1604.01485
Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. arXiv preprint arXiv:1805.07932
Kim JH, Woon K, Lim W, Kim J, Ha JW, Zhang BT (2017) Hadamard product for low-rank bilinear pooling. In: ICLR 2017
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Krishna R, Zhu YK, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision 123(1):32–73
Article MathSciNet Google Scholar
Lei BJ, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:1607.06450
Li S, Xiao T, Li HS, Yang W, Wang XG (2017) Identity-aware textual-visual matching with latent co-attention. In: Computer vision (ICCV), 2017 IEEE international conference on, pp 1908–1917
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Doll' ar P, Zitnick CL (2014 )Microsoft CoCo: common objects in context. In: Proceedings of the European conference on computer vision, pp 740–755. Springer
Lu J, Yang JW, Batra D, Parikh D (2016a) Hierarchical question-image co-attention for visual question answering. In: Advances in neural information processing systems, pp 289–297
Lu J, Yang JW, Batra D, Parikh D (2016b) Hierarchical question-image co-attention for visual question answering. NIPS 29:289–297
Google Scholar
Ma L, Lu ZD, Li H (2016) Learning to answer questions from image using convolutional neural network. In: AAAI, pp 3567–3573
Malinowski M, Rohrbach M, Fritz M (2015) Ask your neurons: a neural-based approach to answering questions about images. In: ICCV, pp 1–9
Mao JH , Xu W, Yang Y, Wang J, Yuille AL (2015) Deep captioning with multimodal recurrent neural networks (m-RNN). In: ICLR
Nam H, Ha JW, Kim J (2017) Dual attention networks for multi-modal reasoning and matching. In: CVPR, pp 2156–2164
Nguyen DK, Okatani T (2018) Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: CVPR, pp 6087–6096
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: The conference on empirical methods in natural language processing, pp 1532–1543
Ren SQ, He KM, Girshick R, Sun J (2015a) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
Ren M, Kiros R, Zemel R (2015b) Exploring models and data for image question answering. NIPS 28:2953–2961
Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409-1416
Sun SY, Pang JM, Shi JP, Yi S, Ouyang WL (2018) Fishnet: a versatile backbone for image, region, and pixel-level prediction. In: Advances in neural information processing systems, pp 760–770
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Weng T-H, Chiu C-C, Hsieh M-Y, Lu H, Li K-C (2020) Parallelisation of practical shared sampling alpha matting with OpenMP. Int J Comput Sci Eng 21(1):105–115
Google Scholar
Wu Q , Shen CH , Liu LQ, Dick AR, Hengel A (2016) What value do explicit high-level concepts have in vision to language problems? In: CVPR, pp. 203–212.
Xiong CM, Zhong V, Socher R (2017) Dynamic co-attention networks for question answering. In: International conference on learning representations
Xu HJ, Saenko K (2016) Ask, attend and answer. exploring question-guided spatial attention for visual question answering. ECCV 7:451–466
Google Scholar
Yang ZC, He XD, Gao JF, Deng L, J. Smola (2016) Stacked attention networks for image question answering. In: CVPR, pp. 21–29
Yu Z, Yu J, Fan JP, Tao DC (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: ICCV, pp 1839–1848
Yu Z, Yu J, Xiang CC, Fan JP, Tao DC (2018) beyond bilinear: generalized multi-modal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959
Article Google Scholar
Zhang Y, Hare JS (2018) Adam Prügel-Bennett. Learning to count objects in natural images for visual question answering. In: ICLR
Zitnick CL, Agrawal A, Antol S, Mitchell M, Batra D, Parikh D (2016) Measuring machine intelligence through visual question answering. AI Maga 37(1):63–72
Article Google Scholar

Download references

Funding

This study was funded by the National Natural Science Foundation of China, Under Grants 61672338 and 61873160.

Author information

Authors and Affiliations

College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, China
Dezhi Han & Shuli Zhou
Department of Computer Science and Information Engineering, Providence University, Taichung, 43301, Taiwan
Kuan Ching Li
Department of Computer Science, University of Sao Paulo, São Carlos, São Paulo, 13566-590, Brazil
Rodrigo Fernandes de Mello

Authors

Dezhi Han
View author publications
You can also search for this author in PubMed Google Scholar
Shuli Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Kuan Ching Li
View author publications
You can also search for this author in PubMed Google Scholar
Rodrigo Fernandes de Mello
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kuan Ching Li.

Ethics declarations

Conflict of interest

The author declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Han, D., Zhou, S., Li, K.C. et al. Cross-modality co-attention networks for visual question answering. Soft Comput 25, 5411–5421 (2021). https://doi.org/10.1007/s00500-020-05539-7

Download citation

Published: 05 January 2021
Issue Date: April 2021
DOI: https://doi.org/10.1007/s00500-020-05539-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-modality co-attention networks for visual question answering

Abstract

Access this article

Similar content being viewed by others

Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

IMCN: Improved modular co-attention networks for visual question answering

Jointly Learning Attentions with Semantic Cross-Modal Correlation for Visual Question Answering

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cross-modality co-attention networks for visual question answering

Abstract

Access this article

Similar content being viewed by others

Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

IMCN: Improved modular co-attention networks for visual question answering

Jointly Learning Attentions with Semantic Cross-Modal Correlation for Visual Question Answering

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation