skip to main content
10.1145/3503161.3548387acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

AI-VQA: Visual Question Answering based on Agent Interaction with Interpretability

Authors Info & Claims
Published:10 October 2022Publication History

ABSTRACT

Visual Question Answering (VQA) serves as a proxy for evaluating the scene understanding of an intelligent agent by answering questions about images. Most VQA benchmarks to date are focused on those questions that can be answered through understanding visual content in the scene, such as simple counting, visual attributes, and even a little challenging questions that require extra encyclopedic knowledge. However, humans have a remarkable capacity to reason dynamic interaction on the scene, which is beyond the literal content of an image and has not been investigated so far. In this paper, we propose Agent Interaction Visual Question Answering (AI-VQA), a task investigating deep scene understanding if the agent takes a certain action. For this task, a model not only needs to answer action-related questions but also to locate the objects in which the interaction occurs for guaranteeing it truly comprehends the action. Accordingly, we make a new dataset based on Visual Genome and ATOMIC knowledge graph, including more than 19,000 manually annotated questions, and will make it publicly available. Besides, we also provide an annotation of the reasoning path while developing the answer for each question. Based on the dataset, we further propose a novel method, called ARE, that can comprehend the interaction and explain the reason based on a given event knowledge base. Experimental results show that our proposed method outperforms the baseline by a clear margin.

Skip Supplemental Material Section

Supplemental Material

References

  1. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077--6086.Google ScholarGoogle ScholarCross RefCross Ref
  2. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425--2433.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Zhuo Chen, Jiaoyan Chen, Yuxia Geng, Jeff Z Pan, Zonggang Yuan, and Huajun Chen. 2021. Zero-Shot Visual Question Answering Using Knowledge Graph. In International Semantic Web Conference. Springer, 146--162.Google ScholarGoogle Scholar
  4. Baoyu Fan, Li Wang, Runze Zhang, Zhenhua Guo, Yaqian Zhao, Rengang Li, and Weifeng Gong. 2020. Contextual Multi-Scale Feature Learning for Person Re-Identification. In MM '20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020, Chang Wen Chen, Rita Cucchiara, Xian-Sheng Hua, Guo-Jun Qi, Elisa Ricci, Zhengyou Zhang, and Roger Zimmermann (Eds.). ACM, 655--663. https://doi.org/10.1145/3394171.3414038Google ScholarGoogle Scholar
  5. Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In EMNLP.Google ScholarGoogle Scholar
  6. Donald Geman, Stuart Geman, Neil Hallonquist, and Laurent Younes. 2015. Visual turing test for computer vision systems. Proceedings of the National Academy of Sciences 112, 12 (2015), 3618--3623.Google ScholarGoogle ScholarCross RefCross Ref
  7. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  8. Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for realworld visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6700--6709.Google ScholarGoogle ScholarCross RefCross Ref
  9. Drew A Hudson and Christopher D Manning. 2019. GQA: A New Dataset for Real- World Visual Reasoning and Compositional Question Answering. Conference on Computer Vision and Pattern Recognition (CVPR) (2019).Google ScholarGoogle Scholar
  10. Jena D Hwang, Chandra Bhagavatula, Ronan Le Bras, Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, and Yejin Choi. 2020. Comet-atomic 2020: On symbolic and neural commonsense knowledge graphs. arXiv preprint arXiv:2010.05953 (2020).Google ScholarGoogle Scholar
  11. Aman Jain, Mayank Kothyari, Vishwajeet Kumar, Preethi Jyothi, Ganesh Ramakrishnan, and Soumen Chakrabarti. 2021. Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual Question Answering. arXiv preprint arXiv:2103.05568 (2021).Google ScholarGoogle Scholar
  12. D. Jia, D. Wei, R. Socher, L. J. Li, L. Kai, and F. F. Li. 2009. ImageNet: A large-scale hierarchical image database. 248--255.Google ScholarGoogle Scholar
  13. Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2901--2910.Google ScholarGoogle ScholarCross RefCross Ref
  14. Maxime Kayser, Oana-Maria Camburu, Leonard Salewski, Cornelius Emde, Virginie Do, Zeynep Akata, and Thomas Lukasiewicz. 2021. e-vil: A dataset and benchmark for natural language explanations in vision-language tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1244--1254.Google ScholarGoogle ScholarCross RefCross Ref
  15. Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. Advances in Neural Information Processing Systems 31 (2018), 1571--1581.Google ScholarGoogle Scholar
  16. Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123, 1 (2017), 32--73.Google ScholarGoogle Scholar
  17. Changsheng Li, Rongqing Li, Ye Yuan, Guoren Wang, and Dong Xu. 2021. Deep Unsupervised Active Learning via Matrix Sketching. IEEE Transactions on Image Processing 30 (2021), 9280--9293.Google ScholarGoogle ScholarCross RefCross Ref
  18. Changsheng Li, Handong Ma, Ye Yuan, Guoren Wang, and Dong Xu. 2022. Structure Guided Deep Neural Network for Unsupervised Active Learning. IEEE Transactions on Image Processing (2022).Google ScholarGoogle ScholarCross RefCross Ref
  19. Changsheng Li, Xiangfeng Wang, Weishan Dong, Junchi Yan, Qingshan Liu, and Hongyuan Zha. 2018. Joint active learning with feature selection via cur matrix decomposition. IEEE transactions on pattern analysis and machine intelligence 41, 6 (2018), 1382--1396.Google ScholarGoogle Scholar
  20. Changsheng Li, Fan Wei, Weishan Dong, Xiangfeng Wang, Qingshan Liu, and Xin Zhang. 2018. Dynamic structure embedded online multiple-output regression for streaming data. IEEE transactions on pattern analysis and machine intelligence 41, 2 (2018), 323--336.Google ScholarGoogle Scholar
  21. Qing Li, Jianlong Fu, Dongfei Yu, Tao Mei, and Jiebo Luo. 2018. Tell-and-answer: Towards explainable visual question answering using attributes and captions. arXiv preprint arXiv:1801.09041 (2018).Google ScholarGoogle Scholar
  22. Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, LijuanWang, Houdong Hu, Li Dong, FuruWei, et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision. Springer, 121--137.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Hugo Liu and Push Singh. 2004. ConceptNet-a practical commonsense reasoning tool-kit. BT technology journal 22, 4 (2004), 211--226.Google ScholarGoogle Scholar
  24. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).Google ScholarGoogle Scholar
  25. Mateusz Malinowski and Mario Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. Advances in neural information processing systems 27 (2014), 1682--1690.Google ScholarGoogle Scholar
  26. Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. 2019. Ok-vqa:Avisual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3195--3204.Google ScholarGoogle Scholar
  27. Medhini Narasimhan, Svetlana Lazebnik, and Alexander G Schwing. 2018. Out of the box: Reasoning with graph convolution nets for factual visual question answering. arXiv preprint arXiv:1811.00538 (2018).Google ScholarGoogle Scholar
  28. Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and Arun Sacheti. 2020. Imagebert: Cross-modal pre-training with large-scale weak-supervised image text data. arXiv preprint arXiv:2001.07966 (2020).Google ScholarGoogle Scholar
  29. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3982--3992. https://doi.org/10.18653/v1/D19-1410Google ScholarGoogle Scholar
  30. Li Wang, Baoyu Fan, Zhenhua Guo, Yaqian Zhao, Runze Zhang, Rengang Li, Weifeng Gong, and Endong Wang. 2021. Knowledge-Supervised Learning: Knowledge Consensus Constraints for Person Re-Identification. In MM '21: ACM Multimedia Conference, Virtual Event, China, October 20 - 24, 2021, Heng Tao Shen, Yueting Zhuang, John R. Smith, Yang Yang, Pablo Cesar, Florian Metze, and Balakrishnan Prabhakaran (Eds.). ACM, 1866--1874. https: //doi.org/10.1145/3474085.3475340Google ScholarGoogle Scholar
  31. Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel. 2017. Fvqa: Fact-based visual question answering. IEEE transactions on pattern analysis and machine intelligence 40, 10 (2017), 2413--2427.Google ScholarGoogle Scholar
  32. Peng Wang, Qi Wu, Chunhua Shen, Anton van den Hengel, and Anthony Dick. 2015. Explicit knowledge-based reasoning for visual question answering. arXiv preprint arXiv:1511.02570 (2015).Google ScholarGoogle Scholar
  33. Qingqing Wang, Liqiang Xiao, Yue Lu, Yaohui Jin, and Hao He. 2021. Towards Reasoning Ability in Scene Text Visual Question Answering. Association for Computing Machinery, New York, NY, USA, 2281--2289. https://doi.org/10.1145/ 3474085.3475390Google ScholarGoogle Scholar
  34. JundaWu, Tong Yu, and Shuai Li. 2021. Deconfounded and Explainable Interactive Vision-Language Retrieval of Complex Scenes. Association for Computing Machinery, New York, NY, USA, 2103--2111. https://doi.org/10.1145/3474085.3475366Google ScholarGoogle Scholar
  35. Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE international conference on computer vision. 1821--1830.Google ScholarGoogle ScholarCross RefCross Ref
  36. Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6720--6731.Google ScholarGoogle ScholarCross RefCross Ref
  37. Kaihao Zhang, Rongqing Li, Yanjiang Yu, Wenhan Luo, and Changsheng Li. 2021. Deep dense multi-scale network for snow removal using semantic and depth priors. IEEE Transactions on Image Processing 30 (2021), 7419--7431.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, LijuanWang, Yejin Choi, and Jianfeng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5579--5588.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. AI-VQA: Visual Question Answering based on Agent Interaction with Interpretability

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MM '22: Proceedings of the 30th ACM International Conference on Multimedia
        October 2022
        7537 pages
        ISBN:9781450392037
        DOI:10.1145/3503161

        Copyright © 2022 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 October 2022

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate995of4,171submissions,24%

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia
      • Article Metrics

        • Downloads (Last 12 months)118
        • Downloads (Last 6 weeks)20

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader