skip to main content
research-article

Inner Knowledge-based Img2Doc Scheme for Visual Question Answering

Authors Info & Claims
Published:04 March 2022Publication History
Skip Abstract Section

Abstract

Visual Question Answering (VQA) is a research topic of significant interest at the intersection of computer vision and natural language understanding. Recent research indicates that attributes and knowledge can effectively improve performance for both image captioning and VQA. In this article, an inner knowledge-based Img2Doc algorithm for VQA is presented. The inner knowledge is characterized as the inner attribute relationship in visual images. In addition to using an attribute network for inner knowledge-based image representation, VQA scheme is associated with a question-guided Doc2Vec method for question–answering. The attribute network generates inner knowledge-based features for visual images, while a novel question-guided Doc2Vec method aims at converting natural language text to vector features. After the vector features are extracted, they are combined with visual image features into a classifier to provide an answer. Based on our model, the VQA problem is resolved by textual question answering. The experimental results demonstrate that the proposed method achieves superior performance on multiple benchmark datasets.

REFERENCES

  1. [1] Acharya Manoj, Kafle Kushal, and Kanan Christopher. 2019. TallyQA: Answering complex counting questions. In Proceedings of the AAAI Conference on Artificial Intelligence. 19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Agarwal Vedika, Shetty Rakshith, and Fritz Mario. 2020. Towards causal VQA: Revealing and reducing spurious correlations by invariant and covariant semantic editing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 969096980.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Agrawal Aishwarya, Batra Dhruv, Parikh Devi, and Kembhavi Aniruddha. 2018. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 115.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and VQA. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 115.Google ScholarGoogle Scholar
  5. [5] Antol Stanislaw, Agrawal Aishwarya, Lu Jiasen, Mitchell Margaret, Batra Dhruv, Zitnick C. Lawrence, and Parikh Devi. 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 24252433. DOI: DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Auer Sören, Bizer Chris, Kobilarov Georgi, Lehmann Jens, Cyganiak Richard, and Ives Zachary. 2008. DBpedia: A nucleus for a web of open data. In Proceedings of the International Semantic Web Conference. 722735.Google ScholarGoogle Scholar
  7. [7] Bai Zongwen, Li Ying, Wozniak Marcin, Zhou Meili, and Li Di. 2021. DecomVQANet: Decomposing visual question answering deep network via tensor decomposition and regression. Pattern Recognition 110, 107538 (2021), 110.Google ScholarGoogle Scholar
  8. [8] Berant Jonathan, Chou Andrew, Frostig Roy, and Liang Percy. 2013. Semantic parsing on freebase from question-answer pairs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 15331544.Google ScholarGoogle Scholar
  9. [9] Biten Ali Furkan, Tito Ruben, Mafla Andres, Gomez Lluis, Rusinol Marcal, Valveny Ernest, Jawahar C. V., and Karatzas Dimosthenis. 2019. Scene text visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 42914301.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Bollacker Kurt, Evans Colin, Paritosh Praveen, Sturge Tim, and Taylor Jamie. 2008. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the ACM SIGMOD/PODS International Conference on Management of Data. 12471250.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Cohen I. Bernard. 1987. Newton’s third law and universal gravity. Journal of the History of Ideas 48, 4 (1987), 571593. Retrieved from http://www.jstor.org/stable/2709688.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Ferrucci David A., Brown Eric W., Chu-Carroll Jennifer, Fan James, Gondek David, Kalyanpur Aditya, Lally Adam, Murdock J. William, Nyberg Eric, Prager John M., Schlaefer Nico, and Welty Christopher A.. 2010. Building watson: An overview of the DeepQA project. AI Magazine 31, 3 (2010), 5979.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Fu Kun, Jin Junqi, Cui Runpeng, Sha Fei, and Zhang Changshui. 2017. Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 12 (2017), 23212334. DOI: DOI:Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Fukui Akira, Park Dong Huk, Yang Daylen, Rohrbach Anna, Darrell Trevor, and Rohrbach Marcus. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 15331544.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Gao Haoyuan, Mao Junhua, Zhou Jie, Huang Zhiheng, Wang Lei, and Xu Wei. 2015. Are you talking to a machine? Dataset and methods for multilingual image question answering. In Proceedings of the International Conference on Neural Information Processing Systems. 22962304.Google ScholarGoogle Scholar
  16. [16] Gokhale Tejas, Banerjee Pratyay, Baral Chitta, and Yang Yezhou. 2020. VQA-LOL: Visual question answering under the lens of logic. In Proceedings of the European Conference on Computer Vision. 379396.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Goyal Yash, Khot Tejas, Summers-Stay Douglas, Batra Dhruv, and Parikh Devi. 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 111.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Guo Jun, Guo Hanliang, and Wang Zhanyi. 2011. An activation force-based affinity measure for analyzing complex networks. Scientific Reports 1, 113 (2011), 1–9.Google ScholarGoogle Scholar
  19. [19] Hudson Drew and Manning Christopher D.. 2019. Learning by abstraction: The neural state machine. In Proceedings of the International Conference on Neural Information Processing Systems. 59015914.Google ScholarGoogle Scholar
  20. [20] Hudson Drew A. and Manning Christopher D.. 2018. Compositional attention networks for machine reasoning. In International Conference on Learning Representations (ICLR’18). 1–10.Google ScholarGoogle Scholar
  21. [21] Jabri Allan, Joulin Armand, and Maaten Laurens van der. 2016. Revisiting visual question answering baselines. In Proceedings of the European Conference on Computer Vision. 727739.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Johnson J., Karpathy A., and Fei-Fei L.. 2016. DenseCap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 45654574. DOI: DOI:Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Kafle Kushal and Kanan Christopher. 2016. Answer-type prediction for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 49764984. DOI: DOI:Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Kafle Kushal and Kanan Christopher. 2017. Visual question answering: Datasets, algorithms, and future challenges. Computer Vision & Image Understanding 163, 1 (2017), 320.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Kazemi Vahid and Elqursh Ali. 2017. Show, ask, attend, and answer: A strong baseline for visual question answering. http://arxiv.org/abs/1704.03162.Google ScholarGoogle Scholar
  26. [26] Kim Jin-Hwa, Jun Jaehyun, and Zhang Byoung-Tak. 2018. Bilinear attention networks. In Proceedings of the Advances in Neural Information Processing Systems. 15641574.Google ScholarGoogle Scholar
  27. [27] Le Quoc V. and Mikolov Tomas. 2014. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning. 19.Google ScholarGoogle Scholar
  28. [28] Li Qing, Fu Jianlong, Yu Dongfei, Mei Tao, and Luo Jiebo. 2018. Tell-and-answer: Towards explainable visual question answering using attributes and captions. In Conference on Empirical Methods in Natural Language Processing (EMNLP). 1338–1346.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Li Qun, Xiao Fu, An Le, Long Xianzhong, and Sun Xiaochuan. 2019. Semantic concept network and deep walk-based visual question answering. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2s, Article 49 (2019), 19 pages. DOI: DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Li Zechao, Tang Jinhui, and Mei Tao. 2019. Deep collaborative embedding for social image understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 9 (2019), 20702083. DOI: DOI:Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Li Z., Tang J., Zhang L., and Yang J.. 2020. Weakly-supervised semantic guided hashing for social image retrieval. International Journal of Computer Vision 128, 8 (2020), 22652278.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Lin Tsung-Yu., Roychowdhury Aruni, and Maji Subhransu. 2018. Bilinear convolutional neural networks for fine-grained visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2018), 13091322.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Malinowski Mateusz and Fritz Mario. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in Neural Information Processing Systems (NIPS’14).Google ScholarGoogle Scholar
  34. [34] Malinowski Mateusz, Rohrbach Marcus, and Fritz Mario. 2017. Ask your neurons: A deep learning approach to visual question answering. International Journal of Computer Vision 125, 1–3 (2017), 110135. DOI: DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Miller G. A.. 1995. WordNet: A lexical database for English. Communications of the Association for Computing Machinery 38, 11 (1995), 3941.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Hoiem Pushmeet Kohli Nathan Silberman, Derek, and Fergus Rob. 2012. Indoor segmentation and support inference from RGBD images. In Proceedings of the European Conference on Computer Vision. 746760.Google ScholarGoogle Scholar
  37. [37] Noh Hyeonwoo, Seo Paul Hongsuck, and Han Bohyung. 2016. Image question answering using convolutional neural network with dynamic parameter prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3038.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Norcliffe-Brown Will, Vafeias Stathis, and Parisot Sarah. 2018. Learning conditioned graph structures for interpretable visual question answering. In Proceedings of the Advances in Neural Information Processing Systems. 83348343.Google ScholarGoogle Scholar
  39. [39] Perozzi Bryan, Al-Rfou Rami, and Skiena Steven. 2014. DeepWalk: Online learning of social representations. In ACM SIGKDD International Conference on Knowledge Discovery and Data (KDD). 701–710.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Ren Mengye, Kiros Ryan, and Zemel Richard. 2015. Image question answering: A visual semantic embedding model and a new dataset. In Proceedings of the International Conference on Neural Information Processing Systems. 110.Google ScholarGoogle Scholar
  41. [41] Ren Mengye, Kiros Ryan, and Zemel Richard S.. 2015. Exploring models and data for image question answering. In Proceedings of the International Conference on Neural Information Processing Systems. 29532961.Google ScholarGoogle Scholar
  42. [42] Santoro Adam, Raposo David, Barrett David G., Malinowski Mateusz, Pascanu Razvan, Battaglia Peter, and Lillicrap Timothy. 2017. A simple neural network module for relational reasoning. In Proceedings of the Advances in Neural Information Processing Systems. 49674976.Google ScholarGoogle Scholar
  43. [43] Shrestha Robik, Kafle Kushal, and Kanan Christopher. 2019. Answer them all! Toward universal visual question answering models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 10464–10473.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Singh Amanpreet, Natarajan Vivek, Shah Meet, Jiang Yu, Chen Xinlei, Batra Dhruv, Parikh Devi, and Rohrbach Marcus. 2019. Towards VQA models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 83178326.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Tan H. and Bansal M.. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19).Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Wang Peng, Wu Qi, Shen Chunhua, Dick Anthony, and Hengel Anton Van Den. 2016. FVQA: Fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 10 (2016), 24132427.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Wang Ye Yi. 1994. Verb semantics and lexical selection. Computer Science 14, 101 (1994), 325327.Google ScholarGoogle Scholar
  48. [48] Wu Qi, Shen Chunhua, Liu Lingqiao, Dick Anthony, and Hengel Anton Van Den. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203212.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Wu Qi, Shen Chunhua, Wang Peng, Dick Anthony, and Hengel Anton van den. 2018. Image captioning and visual question answering based on attributes and external knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2018), 13671381. DOI: DOI:Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Wu Qi, Teney Damien, Wang Peng, Shen Chunhua, Dick Anthony R., and Hengel Anton van den. 2016. Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding 163, 1 (2016), 2140.Google ScholarGoogle Scholar
  51. [51] Wu Qi, Wang Peng, Shen Chunhua, Dick Anthony, and Hengel Anton Van Den. 2016. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 46224630.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Xu Huijuan and Saenko Kate. 2016. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 13.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Yang Zichao, He Xiaodong, Gao Jianfeng, Deng Li, and Smola Alexander J.. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2129.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Yang Zhilin, Yuan Wei Liang, Wu Yuexin, Cohen William W., and Salakhutdinov Ruslan. 2016. Review networks for caption generation. In Proceedings of the International Conference on Neural Information Processing Systems. 19.Google ScholarGoogle Scholar
  55. [55] Li Liyan Zhang Yonghua Pan, Zechao, and Tang Jinhui. 2020. Distilling knowledge in causal inference for unbiased visual question answering. In Proceedings of the ACM Multimedia Asia. 3:1–3:7.Google ScholarGoogle Scholar
  56. [56] Yu Zhou, Yu Jun, Cui Yuhao, Tao Dacheng, and Tian Qi. 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 62816290.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Yu Zhou, Yu Jun, Fan Jianping, and Tao Dacheng. 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 18391848.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Yu Zhou, Yu Jun, Xiang Chenchao, Fan Jianping, and Tao Dacheng. 2018. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Transactions on Neural Networks and Learning Systems 29, 12 (2018), 59475959.Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Zhang Jinglin, Liu Pu, Zhang Feng, and Song Qianqian. 2018. CloudNet: Ground-based cloud classification with deep convolutional neural network. Geophysical Research Letters 45, 16 (2018), 86658672.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Zhou Bolei, Tian Yuandong, Sukhbaatar Sainbayar, Szlam Arthur, and Fergus Rob. 2015. Simple baseline for visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 12548–12558.Google ScholarGoogle Scholar
  61. [61] Zhou Luowei, Palangi Hamid, Zhang Lei, Hu Houdong, Corso Jason, and Gao Jianfeng. 2020. Unified vision-language pre-training for image captioning and VQA. In Proceedings of the AAAI Conference on Artificial Intelligence. 1304113049.Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Zhu Yuke, Groth Oliver, Bernstein Michael S., and Fei-Fei Li. 2016. Visual7W: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 49955004.Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Zhu Yuke, Zhang Ce, Ré Christopher, and Fei-Fei Li. 2015. Building a large-scale multimodal knowledge base for visual question answering. https://arxiv.org/abs/1507.05670.Google ScholarGoogle Scholar

Index Terms

  1. Inner Knowledge-based Img2Doc Scheme for Visual Question Answering

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Multimedia Computing, Communications, and Applications
          ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 3
          August 2022
          478 pages
          ISSN:1551-6857
          EISSN:1551-6865
          DOI:10.1145/3505208
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 4 March 2022
          • Accepted: 1 September 2021
          • Received: 1 May 2021
          Published in tomm Volume 18, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format