research-article

Inner Knowledge-based Img2Doc Scheme for Visual Question Answering

Authors:

Richang HongAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 18, Issue 3

Article No.: 76, Pages 1 - 21

https://doi.org/10.1145/3489142

Published: 04 March 2022 Publication History

Abstract

Visual Question Answering (VQA) is a research topic of significant interest at the intersection of computer vision and natural language understanding. Recent research indicates that attributes and knowledge can effectively improve performance for both image captioning and VQA. In this article, an inner knowledge-based Img2Doc algorithm for VQA is presented. The inner knowledge is characterized as the inner attribute relationship in visual images. In addition to using an attribute network for inner knowledge-based image representation, VQA scheme is associated with a question-guided Doc2Vec method for question–answering. The attribute network generates inner knowledge-based features for visual images, while a novel question-guided Doc2Vec method aims at converting natural language text to vector features. After the vector features are extracted, they are combined with visual image features into a classifier to provide an answer. Based on our model, the VQA problem is resolved by textual question answering. The experimental results demonstrate that the proposed method achieves superior performance on multiple benchmark datasets.

References

[1]

Manoj Acharya, Kushal Kafle, and Christopher Kanan. 2019. TallyQA: Answering complex counting questions. In Proceedings of the AAAI Conference on Artificial Intelligence. 1–9.

Digital Library

[2]

Vedika Agarwal, Rakshith Shetty, and Mario Fritz. 2020. Towards causal VQA: Revealing and reducing spurious correlations by invariant and covariant semantic editing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9690–96980.

[3]

Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. 2018. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–15.

[4]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and VQA. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–15.

[5]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425–2433. DOI: DOI:

Digital Library

[6]

Sören Auer, Chris Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2008. DBpedia: A nucleus for a web of open data. In Proceedings of the International Semantic Web Conference. 722–735.

[7]

Zongwen Bai, Ying Li, Marcin Wozniak, Meili Zhou, and Di Li. 2021. DecomVQANet: Decomposing visual question answering deep network via tensor decomposition and regression. Pattern Recognition 110, 107538 (2021), 1–10.

[8]

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from question-answer pairs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1533–1544.

[9]

Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marcal Rusinol, Ernest Valveny, C. V. Jawahar, and Dimosthenis Karatzas. 2019. Scene text visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4291–4301.

[10]

Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the ACM SIGMOD/PODS International Conference on Management of Data. 1247–1250.

Digital Library

[11]

I. Bernard Cohen. 1987. Newton’s third law and universal gravity. Journal of the History of Ideas 48, 4 (1987), 571–593. Retrieved from http://www.jstor.org/stable/2709688.

[12]

David A. Ferrucci, Eric W. Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya Kalyanpur, Adam Lally, J. William Murdock, Eric Nyberg, John M. Prager, Nico Schlaefer, and Christopher A. Welty. 2010. Building watson: An overview of the DeepQA project. AI Magazine 31, 3 (2010), 59–79.

[13]

Kun Fu, Junqi Jin, Runpeng Cui, Fei Sha, and Changshui Zhang. 2017. Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 12 (2017), 2321–2334. DOI: DOI:

[14]

Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1533–1544.

[15]

Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. 2015. Are you talking to a machine? Dataset and methods for multilingual image question answering. In Proceedings of the International Conference on Neural Information Processing Systems. 2296–2304.

[16]

Tejas Gokhale, Pratyay Banerjee, Chitta Baral, and Yezhou Yang. 2020. VQA-LOL: Visual question answering under the lens of logic. In Proceedings of the European Conference on Computer Vision. 379–396.

Digital Library

[17]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–11.

[18]

Jun Guo, Hanliang Guo, and Zhanyi Wang. 2011. An activation force-based affinity measure for analyzing complex networks. Scientific Reports 1, 113 (2011), 1–9.

[19]

Drew Hudson and Christopher D. Manning. 2019. Learning by abstraction: The neural state machine. In Proceedings of the International Conference on Neural Information Processing Systems. 5901–5914.

[20]

Drew A. Hudson and Christopher D. Manning. 2018. Compositional attention networks for machine reasoning. In International Conference on Learning Representations (ICLR’18). 1–10.

[21]

Allan Jabri, Armand Joulin, and Laurens van der Maaten. 2016. Revisiting visual question answering baselines. In Proceedings of the European Conference on Computer Vision. 727–739.

[22]

J. Johnson, A. Karpathy, and L. Fei-Fei. 2016. DenseCap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4565–4574. DOI: DOI:

[23]

Kushal Kafle and Christopher Kanan. 2016. Answer-type prediction for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4976–4984. DOI: DOI:

[24]

Kushal Kafle and Christopher Kanan. 2017. Visual question answering: Datasets, algorithms, and future challenges. Computer Vision & Image Understanding 163, 1 (2017), 3–20.

[25]

Vahid Kazemi and Ali Elqursh. 2017. Show, ask, attend, and answer: A strong baseline for visual question answering. http://arxiv.org/abs/1704.03162.

[26]

Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. In Proceedings of the Advances in Neural Information Processing Systems. 1564–1574.

[27]

Quoc V. Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning. 1–9.

[28]

Qing Li, Jianlong Fu, Dongfei Yu, Tao Mei, and Jiebo Luo. 2018. Tell-and-answer: Towards explainable visual question answering using attributes and captions. In Conference on Empirical Methods in Natural Language Processing (EMNLP). 1338–1346.

[29]

Qun Li, Fu Xiao, Le An, Xianzhong Long, and Xiaochuan Sun. 2019. Semantic concept network and deep walk-based visual question answering. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2s, Article 49 (2019), 19 pages. DOI: DOI:

Digital Library

[30]

Zechao Li, Jinhui Tang, and Tao Mei. 2019. Deep collaborative embedding for social image understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 9 (2019), 2070–2083. DOI: DOI:

[31]

Z. Li, J. Tang, L. Zhang, and J. Yang. 2020. Weakly-supervised semantic guided hashing for social image retrieval. International Journal of Computer Vision 128, 8 (2020), 2265–2278.

Digital Library

[32]

Tsung-Yu. Lin, Aruni Roychowdhury, and Subhransu Maji. 2018. Bilinear convolutional neural networks for fine-grained visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2018), 1309–1322.

[33]

Mateusz Malinowski and Mario Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in Neural Information Processing Systems (NIPS’14).

[34]

Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2017. Ask your neurons: A deep learning approach to visual question answering. International Journal of Computer Vision 125, 1–3 (2017), 110–135. DOI: DOI:

Digital Library

[35]

G. A. Miller. 1995. WordNet: A lexical database for English. Communications of the Association for Computing Machinery 38, 11 (1995), 39–41.

Digital Library

[36]

Pushmeet Kohli Nathan Silberman, Derek Hoiem, and Rob Fergus. 2012. Indoor segmentation and support inference from RGBD images. In Proceedings of the European Conference on Computer Vision. 746–760.

[37]

Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. 2016. Image question answering using convolutional neural network with dynamic parameter prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 30–38.

[38]

Will Norcliffe-Brown, Stathis Vafeias, and Sarah Parisot. 2018. Learning conditioned graph structures for interpretable visual question answering. In Proceedings of the Advances in Neural Information Processing Systems. 8334–8343.

[39]

Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online learning of social representations. In ACM SIGKDD International Conference on Knowledge Discovery and Data (KDD). 701–710.

Digital Library

[40]

Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Image question answering: A visual semantic embedding model and a new dataset. In Proceedings of the International Conference on Neural Information Processing Systems. 1–10.

[41]

Mengye Ren, Ryan Kiros, and Richard S. Zemel. 2015. Exploring models and data for image question answering. In Proceedings of the International Conference on Neural Information Processing Systems. 2953–2961.

[42]

Adam Santoro, David Raposo, David G. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. 2017. A simple neural network module for relational reasoning. In Proceedings of the Advances in Neural Information Processing Systems. 4967–4976.

[43]

Robik Shrestha, Kushal Kafle, and Christopher Kanan. 2019. Answer them all! Toward universal visual question answering models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 10464–10473.

[44]

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards VQA models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8317–8326.

[45]

H. Tan and M. Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19).

[46]

Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel. 2016. FVQA: Fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 10 (2016), 2413–2427.

Digital Library

[47]

Ye Yi Wang. 1994. Verb semantics and lexical selection. Computer Science 14, 101 (1994), 325–327.

[48]

Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton Van Den Hengel. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203–212.

[49]

Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton van den Hengel. 2018. Image captioning and visual question answering based on attributes and external knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2018), 1367–1381. DOI: DOI:

[50]

Qi Wu, Damien Teney, Peng Wang, Chunhua Shen, Anthony R. Dick, and Anton van den Hengel. 2016. Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding 163, 1 (2016), 21–40.

[51]

Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel. 2016. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4622–4630.

[52]

Huijuan Xu and Kate Saenko. 2016. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–3.

[53]

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 21–29.

[54]

Zhilin Yang, Wei Liang Yuan, Yuexin Wu, William W. Cohen, and Ruslan Salakhutdinov. 2016. Review networks for caption generation. In Proceedings of the International Conference on Neural Information Processing Systems. 1–9.

[55]

Liyan Zhang Yonghua Pan, Zechao Li, and Jinhui Tang. 2020. Distilling knowledge in causal inference for unbiased visual question answering. In Proceedings of the ACM Multimedia Asia. 3:1–3:7.

[56]

Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6281–6290.

[57]

Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 1839–1848.

[58]

Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. 2018. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Transactions on Neural Networks and Learning Systems 29, 12 (2018), 5947–5959.

[59]

Jinglin Zhang, Pu Liu, Feng Zhang, and Qianqian Song. 2018. CloudNet: Ground-based cloud classification with deep convolutional neural network. Geophysical Research Letters 45, 16 (2018), 8665–8672.

[60]

Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. 2015. Simple baseline for visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 12548–12558.

[61]

Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, and Jianfeng Gao. 2020. Unified vision-language pre-training for image captioning and VQA. In Proceedings of the AAAI Conference on Artificial Intelligence. 13041–13049.

[62]

Yuke Zhu, Oliver Groth, Michael S. Bernstein, and Li Fei-Fei. 2016. Visual7W: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4995–5004.

[63]

Yuke Zhu, Ce Zhang, Christopher Ré, and Li Fei-Fei. 2015. Building a large-scale multimodal knowledge base for visual question answering. https://arxiv.org/abs/1507.05670.

Cited By

Yue STu YLi LGao SYu Z(2024)Multi-grained Representation Aggregating Transformer with Gating Cycle for Change CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3660346Online publication date: 22-Apr-2024
https://doi.org/10.1145/3660346
Han TZhou QYu JYu ZZhang JZhao S(2024)Effective Video Summarization by Extracting Parameter-Free Motion AttentionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365467020:7(1-20)Online publication date: 16-May-2024
https://dl.acm.org/doi/10.1145/3654670
He WLi ZWang HXu TWang ZHuai BYuan NChen E(2024)Multimodal Dialogue Systems via Capturing Context-aware Dependencies and Ordinal Information of Semantic ElementsACM Transactions on Intelligent Systems and Technology10.1145/364509915:3(1-25)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1145/3645099
Show More Cited By

Index Terms

Inner Knowledge-based Img2Doc Scheme for Visual Question Answering
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Visual content-based indexing and retrieval
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Information extraction
      2. Question answering

Recommendations

Text-based visual question answering with knowledge base
MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Text-based Visual Question Answering(VQA) usually needs to analyze and understand the text in a picture to give a correct answer for the given question. In this paper, a generic Text-based VQA with Knowledge Base (KB) is proposed, which performs text-...
Visual-Textual Semantic Alignment Network for Visual Question Answering
Artificial Neural Networks and Machine Learning – ICANN 2021
Abstract
VQA task requires deep understanding of visual and textual content and access to key information to better answer the question. Most of current works only use image and question as the input of the network, where the image features are over-... $〈〉$
Question-Guided Hybrid Convolution for Visual Question Answering
Computer Vision – ECCV 2018
Abstract
In this paper, we propose a novel Question-Guided Hybrid Convolution (QGHC) network for Visual Question Answering (VQA). Most state-of-the-art VQA methods fuse the high-level textual and visual features from the neural network and abandon the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 18, Issue 3

August 2022

478 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3505208

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 March 2022

Accepted: 01 September 2021

Received: 01 May 2021

Published in TOMM Volume 18, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

National Natural Science Foundation of China
Nature Science Foundation of Jiangsu for Distinguished Young Scientist
Postdoctoral Research Plan of Jiangsu Province
Postdoctoral Science Foundation of China
Nanjing University of Posts and Telecommunications Program

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

27
Total Citations
View Citations
474
Total Downloads

Downloads (Last 12 months)51
Downloads (Last 6 weeks)2

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yue STu YLi LGao SYu Z(2024)Multi-grained Representation Aggregating Transformer with Gating Cycle for Change CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3660346Online publication date: 22-Apr-2024
https://doi.org/10.1145/3660346
Han TZhou QYu JYu ZZhang JZhao S(2024)Effective Video Summarization by Extracting Parameter-Free Motion AttentionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365467020:7(1-20)Online publication date: 16-May-2024
https://dl.acm.org/doi/10.1145/3654670
He WLi ZWang HXu TWang ZHuai BYuan NChen E(2024)Multimodal Dialogue Systems via Capturing Context-aware Dependencies and Ordinal Information of Semantic ElementsACM Transactions on Intelligent Systems and Technology10.1145/364509915:3(1-25)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1145/3645099
Syu JLin JSrivastava G(2024)Distributed Learning Mechanisms for Anomaly Detection in Privacy-Aware Energy Grid Management SystemsACM Transactions on Sensor Networks10.1145/3640341Online publication date: 17-Jan-2024
https://dl.acm.org/doi/10.1145/3640341
Yu FZhang PDing XLu TGu N(2024)BNoteHelper: A Note-based Outline Generation Tool for Structured Learning on Video-sharing PlatformsACM Transactions on the Web10.1145/363877518:2(1-30)Online publication date: 12-Mar-2024
https://dl.acm.org/doi/10.1145/3638775
Feng ZXu JMa LZhang S(2024)Efficient Video Transformers via Spatial-temporal Token Merging for Action RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363378120:4(1-21)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3633781
Wang DLi FLiu KZhang X(2024)Real-time Cyber-Physical Security Solution Leveraging an Integrated Learning-Based ApproachACM Transactions on Sensor Networks10.1145/358200920:2(1-22)Online publication date: 9-Jan-2024
https://dl.acm.org/doi/10.1145/3582009
Liu HBhanu B(2024)RepSGG: Novel Representations of Entities and Relationships for Scene Graph GenerationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.340214346:12(8018-8035)Online publication date: Dec-2024
https://doi.org/10.1109/TPAMI.2024.3402143
Sun XDai YWang YMa WLin X(2024)Video question answering via traffic knowledge database and question classificationMultimedia Systems10.1007/s00530-023-01240-530:1Online publication date: 16-Jan-2024
https://dl.acm.org/doi/10.1007/s00530-023-01240-5
Wang KAkhtar SAl-Zahrani F(2023)An Efficient Algorithm for Resource Allocation in Mobile Edge Computing Based on Convex Optimization and Karush–Kuhn–Tucker MethodComplexity10.1155/2023/96044542023Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1155/2023/9604454
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents