Abstract
Fact-based Visual Question Answering (FVQA) aims to answer questions with images and facts. It requires a fine-grained and simultaneous understanding visual content, textual questions, and factual knowledge. We propose a novel Hierarchical Attention Network (HANet) for FVQA to address the limitations of existing methods. Most existing FVQA methods only consider external facts as a library of answers, which weakens the role of the external facts, and ignore information from images, questions, and external knowledge. Additionally, they only utilize appearance features of images and disregard position information, which results in a model failing to answer many complex questions, due to the absence of important information in images. Our proposed model considers FVQA as a triple modal interaction task and exploits self-attention and multiple attention interaction to make full use of information from all three modalities. In specific, we introduce three attention modules: Self-Attention Layer, Triple-modal Attention Layer, and Bi-Attention Layer to sufficiently extract useful information from images, questions, facts. Furthermore, we also introduce positional encoding into image embedding acquisition to further improve performance of the model. Our proposed method achieves state-of-the-art performance on the FVQA dataset, with top-3 accuracy of \(85.98\%\) and top-1 accuracy of \(71.68\%\).
Similar content being viewed by others
References
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) Vqa: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2425–2433
Wang P, Wu Q, Shen C, van den Hengel A, Dick A (2017) Explicit knowledge based reasoning for visual question answering. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp 1290–1296
Wang P, Wu Q, Shen C, Dick A, van den Hengel A (2018) Fvqa: Fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(10):2413–2427
Yu J, Zhu Z, Wang Y, Zhang W, Hu Y, Tan J (2020) Cross-modal knowledge reasoning for knowledge-based visual question answering. Pattern Recognition 108:107563
Bhatti UA, Huang M, Wang H, Zhang Y, Mehmood A, Di W (2018) Recommendation system for immunization coverage and monitoring. Human Vaccines and Immunotherapeutics 14(1):165–171
Bhatti UA, Zeeshan Z, Nizamani MM, Bazai S, Yu Z, Yuan L (2022) Assessing the change of ambient air quality patterns in jiangsu province of china pre-to post-covid-19. Chemosphere 288
Bhatti UA, Huang M, Wu D, Zhang Y, Mehmood A, Han H (2019) Recommendation system using feature extraction and pattern recognition in clinical care systems. Enterprise Information Systems 13(3):329–351
Bhatti UA, Yu Z, Li J, Nawaz SA, Mehmood A, Zhang K, Yuan L (2020) Hybrid watermarking algorithm using clifford algebra with arnold scrambling and chaotic encryption. IEEE Access 8:76386–76398
Bhatti UA, Yu Z, Chanussot J, Zeeshan Z, Yuan L, Luo W, Nawaz SA, Bhatti MA, ul Ain Q, Mehmood A (2022) Local similarity-based spatial-spectral fusion hyperspectral image classification with deep cnn and gabor filtering. IEEE Transactions on Geoscience and Remote Sensing 60
Wu Q, Wang P, Shen C, Dick A, van den Hengel A (2016) Ask me anything:free-form visual question answering based on knowledge from external sources. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4622–4630
Narasimhan M, Lazebnik S, Schwing AG (2018) Out of the box: Reasoning with graph convolution nets for factual visual question answering. In: Advances in Neural Information Processing Systems, pp 2654–2665
Zhu Z, Yu J, Wang Y, Sun Y, Hu Y, Wu Q (2020) Mucko: Multi-layer cross-modal knowledge reasoning for fact-based visual question answering. In: International Joint Conference on Artificial Intelligence, pp 1097–1103
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp 5999–6009
Zhu Y, Zhang C, Ré C, Li FF (2015) Building a large-scale multimodal knowledge base system for answering visual queries. arXiv preprint http://arxiv.org/abs/1507.05670arXiv:1507.05670
Krishnamurthy J, Kollar T (2013) Jointly learning to parse and perceive: Connecting natural language to the physical world. Transactions of the Association for Computational Linguistics 1:193–206
Narasimhan K, Yala A, Barzilay R (2016) Improving information extraction by acquiring external evidence with reinforcement learning. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp 2355–2365
Gardères F, Ziaeefard M, Abeloos B, Lecue F (2020) Conceptbert: Concept-aware representation for visual question answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp 489–498
Marino K, Chen X, Parikh D, Gupta A, Rohrbach M (2021) Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based vqa. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14106–14116
Wu J, Lu J, Sabharwal A, Mottaghi R (2022) Multi-modal answer validation for knowledge-based vqa. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 2712–2721
Medhini N, Schwing AG (2018) Straight to the facts: Learning knowledge base retrieval for factual visual question answering. In: Proceedings of the European Conference on Computer Vision, pp 460–477
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations
Zhang S, Feng Y (2022) Gaussian multi-head attention for simultaneous machine translation. In: Annual Meeting of the Association-for-Computational-Linguistics, pp 3019–3030
Li J, Pan Z, Liu Q, Cui Y, Sun Y (2022) Complementarity-aware attention network for salient object detection. IEEE Transactions on Cybernetics 52(2):873–886
Liu S, Zhang L, Lu H, He Y (2022) Center-boundary dual attention for oriented object detection in remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 60
Wu X, Li T (2023) Sentimental visual captioning using multimodal transformer. International Journal of Computer Vision 131(4):1073–1090
Wang W, Bao H, Dong L, Bjorck J, Peng Z, Liu Q, Aggarwal K, Mohammed OK, Singhal S, Som S, Wei F (2022) Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint http://arxiv.org/abs/2208.10442arXiv:2208.10442
Bao H, Wang W, Dong L, Liu Q, Mohammed OK, Aggarwal K, Som S, Wei F (2022) Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. arXiv preprint http://arxiv.org/abs/2111.02358arXiv:2111.02358
Li J, Li D, Xiong C, Hoi S, Chaudhuri K, Jegelka S, Song L, Szepesvari C, Niu G, Sabato S (2022) Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning
Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6274–6283
Peng L, Yang Y, Wang Z, Huang Z, Shen HT (2022) Mra-net: Improving vqa via multi-modal relation attention network. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(1):318–329
Nickel M, Tresp V, Kriegel HP (2011) A three-way model for collective learning on multi-relational data. In: Proceedings of the 28th International Conference on Machine Learning, pp 809–816
Jenatton R, Le Roux N, Bordes A, Obozinski G (2012) A latent factor model for highly multi-relational data. In: Advances in Neural Information Processing Systems, pp 3167–3175
Yang B, Yih Wt, He X, Gao J, Deng L (2015) Embedding entities and relations for learning and inference in knowledge bases. In: International Conference on Learning Representations
Nickel M, Rosasco L, Poggio T (2016) Holographic embeddings of knowledge graphs. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 1955–1961
Bordes A, Usunier N, Garcia-Duran A, Weston J, Yakhnenko O (2013) Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems
Wang Z, Zhang J, Feng J, Chen Z (2014) Knowledge graph embedding by translating on hyperplanes. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 1112–1119
Lin Y, Liu ZLMSY, Zhu X (2015) Learning entity and relation embeddings for knowledge graph completion. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 2181–2187
Goel R, Kazemi SM, Brubaker M, Poupart P (2020) Diachronic embedding for temporal knowledge graph completion. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 3988–3995
Gupta S, Kenkre S, Talukdar P (2019) Care: Open knowledge graph embeddings. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp 378–388
Malaviya C, Bhagavatula C, Bosselut A, Choi Y (2020) Commonsense knowledge base completion with structural and semantic context. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 2925–2933
Rammnath K, Hasegawa-Johnson M (2020) Seeing is knowing! fact-based visual question answering using knowledge graph embeddings. arXiv preprint http://arxiv.org/abs/2012.15484arXiv:2012.15484
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6077–6086
Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(6):1137–1149
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp 1532–1543
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Computation 9(8):1735–1780
Tandon N, de Melo andFabian Suchanek G, Weikum G (2014) Webchild: Harvesting and organizing commonsense knowledge from the web. In: Proceedings of the 7th ACM International Conference on Web Search and Data Mining, pp 523–532
Speer R, Chin J, Havasi C (2017) Conceptnet 5.5: An open multilingual graph of general knowledge. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 4444–4451
Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z (2007) Dbpedia: A nucleus for a web of open data. In: The Semantic Web, pp 722–735
Guo Y, Nie L, Wong Y, Liu Y, Cheng Z, Kankanhalli M (2022) A unified end-to-end retriever-reader framework for knowledge-based vqa. In: Proceedings of the 30th ACM International Conference on Multimedia, pp 2061–2069
Salemi A, Pizzorno JA, Zamani H (2023) A symmetric dual encoding dense retrieval framework for knowledge-intensive visual question answering. arXiv preprint http://arxiv.org/abs/2304.13649arXiv:2304.13649
Li H, Wang P, Shen C, van den Hengel A (2019) Visual question answering as reading comprehension. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6312–6321
Liu L, Wang M, He X, Qing L, Chen H (2022) Fact-based visual question answering via dual-process system. Knowledge-based Systems 237
Acknowledgements
This work was supported by National Key R &D Program of China (2019YFE0105400).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yao, H., Luo, Y., Zhang, Z. et al. Hierarchical Attention Networks for Fact-based Visual Question Answering. Multimed Tools Appl 83, 17281–17298 (2024). https://doi.org/10.1007/s11042-023-16151-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16151-w