Skip to main content
Log in

Hierarchical Attention Networks for Fact-based Visual Question Answering

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Fact-based Visual Question Answering (FVQA) aims to answer questions with images and facts. It requires a fine-grained and simultaneous understanding visual content, textual questions, and factual knowledge. We propose a novel Hierarchical Attention Network (HANet) for FVQA to address the limitations of existing methods. Most existing FVQA methods only consider external facts as a library of answers, which weakens the role of the external facts, and ignore information from images, questions, and external knowledge. Additionally, they only utilize appearance features of images and disregard position information, which results in a model failing to answer many complex questions, due to the absence of important information in images. Our proposed model considers FVQA as a triple modal interaction task and exploits self-attention and multiple attention interaction to make full use of information from all three modalities. In specific, we introduce three attention modules: Self-Attention Layer, Triple-modal Attention Layer, and Bi-Attention Layer to sufficiently extract useful information from images, questions, facts. Furthermore, we also introduce positional encoding into image embedding acquisition to further improve performance of the model. Our proposed method achieves state-of-the-art performance on the FVQA dataset, with top-3 accuracy of \(85.98\%\) and top-1 accuracy of \(71.68\%\).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) Vqa: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2425–2433

  2. Wang P, Wu Q, Shen C, van den Hengel A, Dick A (2017) Explicit knowledge based reasoning for visual question answering. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp 1290–1296

  3. Wang P, Wu Q, Shen C, Dick A, van den Hengel A (2018) Fvqa: Fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(10):2413–2427

    Article  Google Scholar 

  4. Yu J, Zhu Z, Wang Y, Zhang W, Hu Y, Tan J (2020) Cross-modal knowledge reasoning for knowledge-based visual question answering. Pattern Recognition 108:107563

    Article  Google Scholar 

  5. Bhatti UA, Huang M, Wang H, Zhang Y, Mehmood A, Di W (2018) Recommendation system for immunization coverage and monitoring. Human Vaccines and Immunotherapeutics 14(1):165–171

    Article  Google Scholar 

  6. Bhatti UA, Zeeshan Z, Nizamani MM, Bazai S, Yu Z, Yuan L (2022) Assessing the change of ambient air quality patterns in jiangsu province of china pre-to post-covid-19. Chemosphere 288

  7. Bhatti UA, Huang M, Wu D, Zhang Y, Mehmood A, Han H (2019) Recommendation system using feature extraction and pattern recognition in clinical care systems. Enterprise Information Systems 13(3):329–351

    Article  Google Scholar 

  8. Bhatti UA, Yu Z, Li J, Nawaz SA, Mehmood A, Zhang K, Yuan L (2020) Hybrid watermarking algorithm using clifford algebra with arnold scrambling and chaotic encryption. IEEE Access 8:76386–76398

    Article  Google Scholar 

  9. Bhatti UA, Yu Z, Chanussot J, Zeeshan Z, Yuan L, Luo W, Nawaz SA, Bhatti MA, ul Ain Q, Mehmood A (2022) Local similarity-based spatial-spectral fusion hyperspectral image classification with deep cnn and gabor filtering. IEEE Transactions on Geoscience and Remote Sensing 60

  10. Wu Q, Wang P, Shen C, Dick A, van den Hengel A (2016) Ask me anything:free-form visual question answering based on knowledge from external sources. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4622–4630

  11. Narasimhan M, Lazebnik S, Schwing AG (2018) Out of the box: Reasoning with graph convolution nets for factual visual question answering. In: Advances in Neural Information Processing Systems, pp 2654–2665

  12. Zhu Z, Yu J, Wang Y, Sun Y, Hu Y, Wu Q (2020) Mucko: Multi-layer cross-modal knowledge reasoning for fact-based visual question answering. In: International Joint Conference on Artificial Intelligence, pp 1097–1103

  13. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp 5999–6009

  14. Zhu Y, Zhang C, Ré C, Li FF (2015) Building a large-scale multimodal knowledge base system for answering visual queries. arXiv preprint http://arxiv.org/abs/1507.05670arXiv:1507.05670

  15. Krishnamurthy J, Kollar T (2013) Jointly learning to parse and perceive: Connecting natural language to the physical world. Transactions of the Association for Computational Linguistics 1:193–206

    Article  Google Scholar 

  16. Narasimhan K, Yala A, Barzilay R (2016) Improving information extraction by acquiring external evidence with reinforcement learning. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp 2355–2365

  17. Gardères F, Ziaeefard M, Abeloos B, Lecue F (2020) Conceptbert: Concept-aware representation for visual question answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp 489–498

  18. Marino K, Chen X, Parikh D, Gupta A, Rohrbach M (2021) Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based vqa. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14106–14116

  19. Wu J, Lu J, Sabharwal A, Mottaghi R (2022) Multi-modal answer validation for knowledge-based vqa. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 2712–2721

  20. Medhini N, Schwing AG (2018) Straight to the facts: Learning knowledge base retrieval for factual visual question answering. In: Proceedings of the European Conference on Computer Vision, pp 460–477

  21. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations

  22. Zhang S, Feng Y (2022) Gaussian multi-head attention for simultaneous machine translation. In: Annual Meeting of the Association-for-Computational-Linguistics, pp 3019–3030

  23. Li J, Pan Z, Liu Q, Cui Y, Sun Y (2022) Complementarity-aware attention network for salient object detection. IEEE Transactions on Cybernetics 52(2):873–886

    Article  Google Scholar 

  24. Liu S, Zhang L, Lu H, He Y (2022) Center-boundary dual attention for oriented object detection in remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 60

  25. Wu X, Li T (2023) Sentimental visual captioning using multimodal transformer. International Journal of Computer Vision 131(4):1073–1090

    Article  Google Scholar 

  26. Wang W, Bao H, Dong L, Bjorck J, Peng Z, Liu Q, Aggarwal K, Mohammed OK, Singhal S, Som S, Wei F (2022) Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint http://arxiv.org/abs/2208.10442arXiv:2208.10442

  27. Bao H, Wang W, Dong L, Liu Q, Mohammed OK, Aggarwal K, Som S, Wei F (2022) Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. arXiv preprint http://arxiv.org/abs/2111.02358arXiv:2111.02358

  28. Li J, Li D, Xiong C, Hoi S, Chaudhuri K, Jegelka S, Song L, Szepesvari C, Niu G, Sabato S (2022) Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning

  29. Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6274–6283

  30. Peng L, Yang Y, Wang Z, Huang Z, Shen HT (2022) Mra-net: Improving vqa via multi-modal relation attention network. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(1):318–329

    Article  Google Scholar 

  31. Nickel M, Tresp V, Kriegel HP (2011) A three-way model for collective learning on multi-relational data. In: Proceedings of the 28th International Conference on Machine Learning, pp 809–816

  32. Jenatton R, Le Roux N, Bordes A, Obozinski G (2012) A latent factor model for highly multi-relational data. In: Advances in Neural Information Processing Systems, pp 3167–3175

  33. Yang B, Yih Wt, He X, Gao J, Deng L (2015) Embedding entities and relations for learning and inference in knowledge bases. In: International Conference on Learning Representations

  34. Nickel M, Rosasco L, Poggio T (2016) Holographic embeddings of knowledge graphs. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 1955–1961

  35. Bordes A, Usunier N, Garcia-Duran A, Weston J, Yakhnenko O (2013) Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems

  36. Wang Z, Zhang J, Feng J, Chen Z (2014) Knowledge graph embedding by translating on hyperplanes. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 1112–1119

  37. Lin Y, Liu ZLMSY, Zhu X (2015) Learning entity and relation embeddings for knowledge graph completion. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 2181–2187

  38. Goel R, Kazemi SM, Brubaker M, Poupart P (2020) Diachronic embedding for temporal knowledge graph completion. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 3988–3995

  39. Gupta S, Kenkre S, Talukdar P (2019) Care: Open knowledge graph embeddings. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp 378–388

  40. Malaviya C, Bhagavatula C, Bosselut A, Choi Y (2020) Commonsense knowledge base completion with structural and semantic context. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 2925–2933

  41. Rammnath K, Hasegawa-Johnson M (2020) Seeing is knowing! fact-based visual question answering using knowledge graph embeddings. arXiv preprint http://arxiv.org/abs/2012.15484arXiv:2012.15484

  42. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6077–6086

  43. Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(6):1137–1149

    Article  Google Scholar 

  44. Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp 1532–1543

  45. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Computation 9(8):1735–1780

    Article  Google Scholar 

  46. Tandon N, de Melo andFabian Suchanek G, Weikum G (2014) Webchild: Harvesting and organizing commonsense knowledge from the web. In: Proceedings of the 7th ACM International Conference on Web Search and Data Mining, pp 523–532

  47. Speer R, Chin J, Havasi C (2017) Conceptnet 5.5: An open multilingual graph of general knowledge. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 4444–4451

  48. Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z (2007) Dbpedia: A nucleus for a web of open data. In: The Semantic Web, pp 722–735

  49. Guo Y, Nie L, Wong Y, Liu Y, Cheng Z, Kankanhalli M (2022) A unified end-to-end retriever-reader framework for knowledge-based vqa. In: Proceedings of the 30th ACM International Conference on Multimedia, pp 2061–2069

  50. Salemi A, Pizzorno JA, Zamani H (2023) A symmetric dual encoding dense retrieval framework for knowledge-intensive visual question answering. arXiv preprint http://arxiv.org/abs/2304.13649arXiv:2304.13649

  51. Li H, Wang P, Shen C, van den Hengel A (2019) Visual question answering as reading comprehension. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6312–6321

  52. Liu L, Wang M, He X, Qing L, Chen H (2022) Fact-based visual question answering via dual-process system. Knowledge-based Systems 237

Download references

Acknowledgements

This work was supported by National Key R &D Program of China (2019YFE0105400).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhi Zhang.

Ethics declarations

Conflicts of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yao, H., Luo, Y., Zhang, Z. et al. Hierarchical Attention Networks for Fact-based Visual Question Answering. Multimed Tools Appl 83, 17281–17298 (2024). https://doi.org/10.1007/s11042-023-16151-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16151-w

Keywords

Navigation