Hierarchical Attention Networks for Fact-based Visual Question Answering

Yao, Haibo; Luo, Yongkang; Zhang, Zhi; Yang, Jianhang; Cai, Chengtao

doi:10.1007/s11042-023-16151-w

Hierarchical Attention Networks for Fact-based Visual Question Answering

Published: 22 July 2023

Volume 83, pages 17281–17298, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Haibo Yao¹,
Yongkang Luo²,
Zhi Zhang ORCID: orcid.org/0000-0001-8002-6486¹,
Jianhang Yang¹ &
…
Chengtao Cai¹

177 Accesses
Explore all metrics

Abstract

Fact-based Visual Question Answering (FVQA) aims to answer questions with images and facts. It requires a fine-grained and simultaneous understanding visual content, textual questions, and factual knowledge. We propose a novel Hierarchical Attention Network (HANet) for FVQA to address the limitations of existing methods. Most existing FVQA methods only consider external facts as a library of answers, which weakens the role of the external facts, and ignore information from images, questions, and external knowledge. Additionally, they only utilize appearance features of images and disregard position information, which results in a model failing to answer many complex questions, due to the absence of important information in images. Our proposed model considers FVQA as a triple modal interaction task and exploits self-attention and multiple attention interaction to make full use of information from all three modalities. In specific, we introduce three attention modules: Self-Attention Layer, Triple-modal Attention Layer, and Bi-Attention Layer to sufficiently extract useful information from images, questions, facts. Furthermore, we also introduce positional encoding into image embedding acquisition to further improve performance of the model. Our proposed method achieves state-of-the-art performance on the FVQA dataset, with top-3 accuracy of \(85.98\%\) and top-1 accuracy of \(71.68\%\).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

Article 13 September 2023

Intra-Modality Feature Interaction Using Self-attention for Visual Question Answering

Dual-feature collaborative relation-attention networks for visual question answering

Article 04 August 2023

References

Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) Vqa: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2425–2433
Wang P, Wu Q, Shen C, van den Hengel A, Dick A (2017) Explicit knowledge based reasoning for visual question answering. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp 1290–1296
Wang P, Wu Q, Shen C, Dick A, van den Hengel A (2018) Fvqa: Fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(10):2413–2427
Article Google Scholar
Yu J, Zhu Z, Wang Y, Zhang W, Hu Y, Tan J (2020) Cross-modal knowledge reasoning for knowledge-based visual question answering. Pattern Recognition 108:107563
Article Google Scholar
Bhatti UA, Huang M, Wang H, Zhang Y, Mehmood A, Di W (2018) Recommendation system for immunization coverage and monitoring. Human Vaccines and Immunotherapeutics 14(1):165–171
Article Google Scholar
Bhatti UA, Zeeshan Z, Nizamani MM, Bazai S, Yu Z, Yuan L (2022) Assessing the change of ambient air quality patterns in jiangsu province of china pre-to post-covid-19. Chemosphere 288
Bhatti UA, Huang M, Wu D, Zhang Y, Mehmood A, Han H (2019) Recommendation system using feature extraction and pattern recognition in clinical care systems. Enterprise Information Systems 13(3):329–351
Article Google Scholar
Bhatti UA, Yu Z, Li J, Nawaz SA, Mehmood A, Zhang K, Yuan L (2020) Hybrid watermarking algorithm using clifford algebra with arnold scrambling and chaotic encryption. IEEE Access 8:76386–76398
Article Google Scholar
Bhatti UA, Yu Z, Chanussot J, Zeeshan Z, Yuan L, Luo W, Nawaz SA, Bhatti MA, ul Ain Q, Mehmood A (2022) Local similarity-based spatial-spectral fusion hyperspectral image classification with deep cnn and gabor filtering. IEEE Transactions on Geoscience and Remote Sensing 60
Wu Q, Wang P, Shen C, Dick A, van den Hengel A (2016) Ask me anything:free-form visual question answering based on knowledge from external sources. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4622–4630
Narasimhan M, Lazebnik S, Schwing AG (2018) Out of the box: Reasoning with graph convolution nets for factual visual question answering. In: Advances in Neural Information Processing Systems, pp 2654–2665
Zhu Z, Yu J, Wang Y, Sun Y, Hu Y, Wu Q (2020) Mucko: Multi-layer cross-modal knowledge reasoning for fact-based visual question answering. In: International Joint Conference on Artificial Intelligence, pp 1097–1103
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp 5999–6009
Zhu Y, Zhang C, Ré C, Li FF (2015) Building a large-scale multimodal knowledge base system for answering visual queries. arXiv preprint http://arxiv.org/abs/1507.05670arXiv:1507.05670
Krishnamurthy J, Kollar T (2013) Jointly learning to parse and perceive: Connecting natural language to the physical world. Transactions of the Association for Computational Linguistics 1:193–206
Article Google Scholar
Narasimhan K, Yala A, Barzilay R (2016) Improving information extraction by acquiring external evidence with reinforcement learning. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp 2355–2365
Gardères F, Ziaeefard M, Abeloos B, Lecue F (2020) Conceptbert: Concept-aware representation for visual question answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp 489–498
Marino K, Chen X, Parikh D, Gupta A, Rohrbach M (2021) Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based vqa. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14106–14116
Wu J, Lu J, Sabharwal A, Mottaghi R (2022) Multi-modal answer validation for knowledge-based vqa. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 2712–2721
Medhini N, Schwing AG (2018) Straight to the facts: Learning knowledge base retrieval for factual visual question answering. In: Proceedings of the European Conference on Computer Vision, pp 460–477
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations
Zhang S, Feng Y (2022) Gaussian multi-head attention for simultaneous machine translation. In: Annual Meeting of the Association-for-Computational-Linguistics, pp 3019–3030
Li J, Pan Z, Liu Q, Cui Y, Sun Y (2022) Complementarity-aware attention network for salient object detection. IEEE Transactions on Cybernetics 52(2):873–886
Article Google Scholar
Liu S, Zhang L, Lu H, He Y (2022) Center-boundary dual attention for oriented object detection in remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 60
Wu X, Li T (2023) Sentimental visual captioning using multimodal transformer. International Journal of Computer Vision 131(4):1073–1090
Article Google Scholar
Wang W, Bao H, Dong L, Bjorck J, Peng Z, Liu Q, Aggarwal K, Mohammed OK, Singhal S, Som S, Wei F (2022) Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint http://arxiv.org/abs/2208.10442arXiv:2208.10442
Bao H, Wang W, Dong L, Liu Q, Mohammed OK, Aggarwal K, Som S, Wei F (2022) Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. arXiv preprint http://arxiv.org/abs/2111.02358arXiv:2111.02358
Li J, Li D, Xiong C, Hoi S, Chaudhuri K, Jegelka S, Song L, Szepesvari C, Niu G, Sabato S (2022) Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning
Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6274–6283
Peng L, Yang Y, Wang Z, Huang Z, Shen HT (2022) Mra-net: Improving vqa via multi-modal relation attention network. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(1):318–329
Article Google Scholar
Nickel M, Tresp V, Kriegel HP (2011) A three-way model for collective learning on multi-relational data. In: Proceedings of the 28th International Conference on Machine Learning, pp 809–816
Jenatton R, Le Roux N, Bordes A, Obozinski G (2012) A latent factor model for highly multi-relational data. In: Advances in Neural Information Processing Systems, pp 3167–3175
Yang B, Yih Wt, He X, Gao J, Deng L (2015) Embedding entities and relations for learning and inference in knowledge bases. In: International Conference on Learning Representations
Nickel M, Rosasco L, Poggio T (2016) Holographic embeddings of knowledge graphs. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 1955–1961
Bordes A, Usunier N, Garcia-Duran A, Weston J, Yakhnenko O (2013) Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems
Wang Z, Zhang J, Feng J, Chen Z (2014) Knowledge graph embedding by translating on hyperplanes. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 1112–1119
Lin Y, Liu ZLMSY, Zhu X (2015) Learning entity and relation embeddings for knowledge graph completion. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 2181–2187
Goel R, Kazemi SM, Brubaker M, Poupart P (2020) Diachronic embedding for temporal knowledge graph completion. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 3988–3995
Gupta S, Kenkre S, Talukdar P (2019) Care: Open knowledge graph embeddings. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp 378–388
Malaviya C, Bhagavatula C, Bosselut A, Choi Y (2020) Commonsense knowledge base completion with structural and semantic context. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 2925–2933
Rammnath K, Hasegawa-Johnson M (2020) Seeing is knowing! fact-based visual question answering using knowledge graph embeddings. arXiv preprint http://arxiv.org/abs/2012.15484arXiv:2012.15484
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6077–6086
Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(6):1137–1149
Article Google Scholar
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp 1532–1543
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Computation 9(8):1735–1780
Article Google Scholar
Tandon N, de Melo andFabian Suchanek G, Weikum G (2014) Webchild: Harvesting and organizing commonsense knowledge from the web. In: Proceedings of the 7th ACM International Conference on Web Search and Data Mining, pp 523–532
Speer R, Chin J, Havasi C (2017) Conceptnet 5.5: An open multilingual graph of general knowledge. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 4444–4451
Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z (2007) Dbpedia: A nucleus for a web of open data. In: The Semantic Web, pp 722–735
Guo Y, Nie L, Wong Y, Liu Y, Cheng Z, Kankanhalli M (2022) A unified end-to-end retriever-reader framework for knowledge-based vqa. In: Proceedings of the 30th ACM International Conference on Multimedia, pp 2061–2069
Salemi A, Pizzorno JA, Zamani H (2023) A symmetric dual encoding dense retrieval framework for knowledge-intensive visual question answering. arXiv preprint http://arxiv.org/abs/2304.13649arXiv:2304.13649
Li H, Wang P, Shen C, van den Hengel A (2019) Visual question answering as reading comprehension. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6312–6321
Liu L, Wang M, He X, Qing L, Chen H (2022) Fact-based visual question answering via dual-process system. Knowledge-based Systems 237

Download references

Acknowledgements

This work was supported by National Key R &D Program of China (2019YFE0105400).

Author information

Authors and Affiliations

College of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin, 150001, China
Haibo Yao, Zhi Zhang, Jianhang Yang & Chengtao Cai
Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China
Yongkang Luo

Authors

Haibo Yao
View author publications
You can also search for this author in PubMed Google Scholar
Yongkang Luo
View author publications
You can also search for this author in PubMed Google Scholar
Zhi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jianhang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Chengtao Cai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhi Zhang.

Ethics declarations

Conflicts of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yao, H., Luo, Y., Zhang, Z. et al. Hierarchical Attention Networks for Fact-based Visual Question Answering. Multimed Tools Appl 83, 17281–17298 (2024). https://doi.org/10.1007/s11042-023-16151-w

Download citation

Received: 07 January 2022
Revised: 17 May 2023
Accepted: 03 July 2023
Published: 22 July 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s11042-023-16151-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hierarchical Attention Networks for Fact-based Visual Question Answering

Abstract

Access this article

Similar content being viewed by others

Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

Intra-Modality Feature Interaction Using Self-attention for Visual Question Answering

Dual-feature collaborative relation-attention networks for visual question answering

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Hierarchical Attention Networks for Fact-based Visual Question Answering

Abstract

Access this article

Similar content being viewed by others

Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

Intra-Modality Feature Interaction Using Self-attention for Visual Question Answering

Dual-feature collaborative relation-attention networks for visual question answering

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation