Abstract
Visual Question Answering (VQA) is a research topic of significant interest at the intersection of computer vision and natural language understanding. Recent research indicates that attributes and knowledge can effectively improve performance for both image captioning and VQA. In this article, an inner knowledge-based Img2Doc algorithm for VQA is presented. The inner knowledge is characterized as the inner attribute relationship in visual images. In addition to using an attribute network for inner knowledge-based image representation, VQA scheme is associated with a question-guided Doc2Vec method for question–answering. The attribute network generates inner knowledge-based features for visual images, while a novel question-guided Doc2Vec method aims at converting natural language text to vector features. After the vector features are extracted, they are combined with visual image features into a classifier to provide an answer. Based on our model, the VQA problem is resolved by textual question answering. The experimental results demonstrate that the proposed method achieves superior performance on multiple benchmark datasets.
- [1] . 2019. TallyQA: Answering complex counting questions. In Proceedings of the AAAI Conference on Artificial Intelligence. 1–9.Google ScholarDigital Library
- [2] . 2020. Towards causal VQA: Revealing and reducing spurious correlations by invariant and covariant semantic editing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9690–96980.Google ScholarCross Ref
- [3] . 2018. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–15.Google ScholarCross Ref
- [4] . 2018. Bottom-up and top-down attention for image captioning and VQA. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–15.Google Scholar
- [5] . 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425–2433.
DOI :DOI: Google ScholarDigital Library - [6] . 2008. DBpedia: A nucleus for a web of open data. In Proceedings of the International Semantic Web Conference. 722–735.Google Scholar
- [7] . 2021. DecomVQANet: Decomposing visual question answering deep network via tensor decomposition and regression. Pattern Recognition 110, 107538 (2021), 1–10.Google Scholar
- [8] . 2013. Semantic parsing on freebase from question-answer pairs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1533–1544.Google Scholar
- [9] . 2019. Scene text visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4291–4301.Google ScholarCross Ref
- [10] . 2008. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the ACM SIGMOD/PODS International Conference on Management of Data. 1247–1250.Google ScholarDigital Library
- [11] . 1987. Newton’s third law and universal gravity. Journal of the History of Ideas 48, 4 (1987), 571–593. Retrieved from http://www.jstor.org/stable/2709688.Google ScholarCross Ref
- [12] . 2010. Building watson: An overview of the DeepQA project. AI Magazine 31, 3 (2010), 59–79.Google ScholarCross Ref
- [13] . 2017. Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 12 (2017), 2321–2334.
DOI :DOI: Google ScholarCross Ref - [14] . 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1533–1544.Google ScholarCross Ref
- [15] . 2015. Are you talking to a machine? Dataset and methods for multilingual image question answering. In Proceedings of the International Conference on Neural Information Processing Systems. 2296–2304.Google Scholar
- [16] . 2020. VQA-LOL: Visual question answering under the lens of logic. In Proceedings of the European Conference on Computer Vision. 379–396.Google ScholarDigital Library
- [17] . 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–11.Google ScholarCross Ref
- [18] . 2011. An activation force-based affinity measure for analyzing complex networks. Scientific Reports 1, 113 (2011), 1–9.Google Scholar
- [19] . 2019. Learning by abstraction: The neural state machine. In Proceedings of the International Conference on Neural Information Processing Systems. 5901–5914.Google Scholar
- [20] . 2018. Compositional attention networks for machine reasoning. In International Conference on Learning Representations (ICLR’18). 1–10.Google Scholar
- [21] . 2016. Revisiting visual question answering baselines. In Proceedings of the European Conference on Computer Vision. 727–739.Google ScholarCross Ref
- [22] . 2016. DenseCap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4565–4574.
DOI :DOI: Google ScholarCross Ref - [23] . 2016. Answer-type prediction for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4976–4984.
DOI :DOI: Google ScholarCross Ref - [24] . 2017. Visual question answering: Datasets, algorithms, and future challenges. Computer Vision & Image Understanding 163, 1 (2017), 3–20.Google ScholarCross Ref
- [25] . 2017. Show, ask, attend, and answer: A strong baseline for visual question answering. http://arxiv.org/abs/1704.03162.Google Scholar
- [26] . 2018. Bilinear attention networks. In Proceedings of the Advances in Neural Information Processing Systems. 1564–1574.Google Scholar
- [27] . 2014. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning. 1–9.Google Scholar
- [28] . 2018. Tell-and-answer: Towards explainable visual question answering using attributes and captions. In Conference on Empirical Methods in Natural Language Processing (EMNLP). 1338–1346.Google ScholarCross Ref
- [29] . 2019. Semantic concept network and deep walk-based visual question answering. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2s, Article 49 (2019), 19 pages.
DOI :DOI: Google ScholarDigital Library - [30] . 2019. Deep collaborative embedding for social image understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 9 (2019), 2070–2083.
DOI :DOI: Google ScholarCross Ref - [31] . 2020. Weakly-supervised semantic guided hashing for social image retrieval. International Journal of Computer Vision 128, 8 (2020), 2265–2278.Google ScholarDigital Library
- [32] . 2018. Bilinear convolutional neural networks for fine-grained visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2018), 1309–1322.Google ScholarCross Ref
- [33] . 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in Neural Information Processing Systems (NIPS’14).Google Scholar
- [34] . 2017. Ask your neurons: A deep learning approach to visual question answering. International Journal of Computer Vision 125, 1–3 (2017), 110–135.
DOI :DOI: Google ScholarDigital Library - [35] . 1995. WordNet: A lexical database for English. Communications of the Association for Computing Machinery 38, 11 (1995), 39–41.Google ScholarDigital Library
- [36] . 2012. Indoor segmentation and support inference from RGBD images. In Proceedings of the European Conference on Computer Vision. 746–760.Google Scholar
- [37] . 2016. Image question answering using convolutional neural network with dynamic parameter prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 30–38.Google ScholarCross Ref
- [38] . 2018. Learning conditioned graph structures for interpretable visual question answering. In Proceedings of the Advances in Neural Information Processing Systems. 8334–8343.Google Scholar
- [39] . 2014. DeepWalk: Online learning of social representations. In ACM SIGKDD International Conference on Knowledge Discovery and Data (KDD). 701–710.Google ScholarDigital Library
- [40] . 2015. Image question answering: A visual semantic embedding model and a new dataset. In Proceedings of the International Conference on Neural Information Processing Systems. 1–10.Google Scholar
- [41] . 2015. Exploring models and data for image question answering. In Proceedings of the International Conference on Neural Information Processing Systems. 2953–2961.Google Scholar
- [42] . 2017. A simple neural network module for relational reasoning. In Proceedings of the Advances in Neural Information Processing Systems. 4967–4976.Google Scholar
- [43] . 2019. Answer them all! Toward universal visual question answering models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 10464–10473.Google ScholarCross Ref
- [44] . 2019. Towards VQA models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8317–8326.Google ScholarCross Ref
- [45] . 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19).Google ScholarCross Ref
- [46] . 2016. FVQA: Fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 10 (2016), 2413–2427.Google ScholarDigital Library
- [47] . 1994. Verb semantics and lexical selection. Computer Science 14, 101 (1994), 325–327.Google Scholar
- [48] . 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203–212.Google ScholarCross Ref
- [49] . 2018. Image captioning and visual question answering based on attributes and external knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2018), 1367–1381.
DOI :DOI: Google ScholarCross Ref - [50] . 2016. Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding 163, 1 (2016), 21–40.Google Scholar
- [51] . 2016. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4622–4630.Google ScholarCross Ref
- [52] . 2016. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–3.Google ScholarCross Ref
- [53] . 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 21–29.Google ScholarCross Ref
- [54] . 2016. Review networks for caption generation. In Proceedings of the International Conference on Neural Information Processing Systems. 1–9.Google Scholar
- [55] . 2020. Distilling knowledge in causal inference for unbiased visual question answering. In Proceedings of the ACM Multimedia Asia. 3:1–3:7.Google Scholar
- [56] . 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6281–6290.Google ScholarCross Ref
- [57] . 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 1839–1848.Google ScholarCross Ref
- [58] . 2018. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Transactions on Neural Networks and Learning Systems 29, 12 (2018), 5947–5959.Google ScholarCross Ref
- [59] . 2018. CloudNet: Ground-based cloud classification with deep convolutional neural network. Geophysical Research Letters 45, 16 (2018), 8665–8672.Google ScholarCross Ref
- [60] . 2015. Simple baseline for visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 12548–12558.Google Scholar
- [61] . 2020. Unified vision-language pre-training for image captioning and VQA. In Proceedings of the AAAI Conference on Artificial Intelligence. 13041–13049.Google ScholarCross Ref
- [62] . 2016. Visual7W: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4995–5004.Google ScholarCross Ref
- [63] . 2015. Building a large-scale multimodal knowledge base for visual question answering. https://arxiv.org/abs/1507.05670.Google Scholar
Index Terms
- Inner Knowledge-based Img2Doc Scheme for Visual Question Answering
Recommendations
Text-based visual question answering with knowledge base
MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in AsiaText-based Visual Question Answering(VQA) usually needs to analyze and understand the text in a picture to give a correct answer for the given question. In this paper, a generic Text-based VQA with Knowledge Base (KB) is proposed, which performs text-...
Visual-Textual Semantic Alignment Network for Visual Question Answering
Artificial Neural Networks and Machine Learning – ICANN 2021AbstractVQA task requires deep understanding of visual and textual content and access to key information to better answer the question. Most of current works only use image and question as the input of the network, where the image features are over-...
Question-Guided Hybrid Convolution for Visual Question Answering
Computer Vision – ECCV 2018AbstractIn this paper, we propose a novel Question-Guided Hybrid Convolution (QGHC) network for Visual Question Answering (VQA). Most state-of-the-art VQA methods fuse the high-level textual and visual features from the neural network and abandon the ...
Comments