ABSTRACT
Visual Question Answering (VQA) serves as a proxy for evaluating the scene understanding of an intelligent agent by answering questions about images. Most VQA benchmarks to date are focused on those questions that can be answered through understanding visual content in the scene, such as simple counting, visual attributes, and even a little challenging questions that require extra encyclopedic knowledge. However, humans have a remarkable capacity to reason dynamic interaction on the scene, which is beyond the literal content of an image and has not been investigated so far. In this paper, we propose Agent Interaction Visual Question Answering (AI-VQA), a task investigating deep scene understanding if the agent takes a certain action. For this task, a model not only needs to answer action-related questions but also to locate the objects in which the interaction occurs for guaranteeing it truly comprehends the action. Accordingly, we make a new dataset based on Visual Genome and ATOMIC knowledge graph, including more than 19,000 manually annotated questions, and will make it publicly available. Besides, we also provide an annotation of the reasoning path while developing the answer for each question. Based on the dataset, we further propose a novel method, called ARE, that can comprehend the interaction and explain the reason based on a given event knowledge base. Experimental results show that our proposed method outperforms the baseline by a clear margin.
Supplemental Material
Available for Download
- Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077--6086.Google ScholarCross Ref
- Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425--2433.Google ScholarDigital Library
- Zhuo Chen, Jiaoyan Chen, Yuxia Geng, Jeff Z Pan, Zonggang Yuan, and Huajun Chen. 2021. Zero-Shot Visual Question Answering Using Knowledge Graph. In International Semantic Web Conference. Springer, 146--162.Google Scholar
- Baoyu Fan, Li Wang, Runze Zhang, Zhenhua Guo, Yaqian Zhao, Rengang Li, and Weifeng Gong. 2020. Contextual Multi-Scale Feature Learning for Person Re-Identification. In MM '20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020, Chang Wen Chen, Rita Cucchiara, Xian-Sheng Hua, Guo-Jun Qi, Elisa Ricci, Zhengyou Zhang, and Roger Zimmermann (Eds.). ACM, 655--663. https://doi.org/10.1145/3394171.3414038Google Scholar
- Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In EMNLP.Google Scholar
- Donald Geman, Stuart Geman, Neil Hallonquist, and Laurent Younes. 2015. Visual turing test for computer vision systems. Proceedings of the National Academy of Sciences 112, 12 (2015), 3618--3623.Google ScholarCross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
- Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for realworld visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6700--6709.Google ScholarCross Ref
- Drew A Hudson and Christopher D Manning. 2019. GQA: A New Dataset for Real- World Visual Reasoning and Compositional Question Answering. Conference on Computer Vision and Pattern Recognition (CVPR) (2019).Google Scholar
- Jena D Hwang, Chandra Bhagavatula, Ronan Le Bras, Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, and Yejin Choi. 2020. Comet-atomic 2020: On symbolic and neural commonsense knowledge graphs. arXiv preprint arXiv:2010.05953 (2020).Google Scholar
- Aman Jain, Mayank Kothyari, Vishwajeet Kumar, Preethi Jyothi, Ganesh Ramakrishnan, and Soumen Chakrabarti. 2021. Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual Question Answering. arXiv preprint arXiv:2103.05568 (2021).Google Scholar
- D. Jia, D. Wei, R. Socher, L. J. Li, L. Kai, and F. F. Li. 2009. ImageNet: A large-scale hierarchical image database. 248--255.Google Scholar
- Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2901--2910.Google ScholarCross Ref
- Maxime Kayser, Oana-Maria Camburu, Leonard Salewski, Cornelius Emde, Virginie Do, Zeynep Akata, and Thomas Lukasiewicz. 2021. e-vil: A dataset and benchmark for natural language explanations in vision-language tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1244--1254.Google ScholarCross Ref
- Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. Advances in Neural Information Processing Systems 31 (2018), 1571--1581.Google Scholar
- Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123, 1 (2017), 32--73.Google Scholar
- Changsheng Li, Rongqing Li, Ye Yuan, Guoren Wang, and Dong Xu. 2021. Deep Unsupervised Active Learning via Matrix Sketching. IEEE Transactions on Image Processing 30 (2021), 9280--9293.Google ScholarCross Ref
- Changsheng Li, Handong Ma, Ye Yuan, Guoren Wang, and Dong Xu. 2022. Structure Guided Deep Neural Network for Unsupervised Active Learning. IEEE Transactions on Image Processing (2022).Google ScholarCross Ref
- Changsheng Li, Xiangfeng Wang, Weishan Dong, Junchi Yan, Qingshan Liu, and Hongyuan Zha. 2018. Joint active learning with feature selection via cur matrix decomposition. IEEE transactions on pattern analysis and machine intelligence 41, 6 (2018), 1382--1396.Google Scholar
- Changsheng Li, Fan Wei, Weishan Dong, Xiangfeng Wang, Qingshan Liu, and Xin Zhang. 2018. Dynamic structure embedded online multiple-output regression for streaming data. IEEE transactions on pattern analysis and machine intelligence 41, 2 (2018), 323--336.Google Scholar
- Qing Li, Jianlong Fu, Dongfei Yu, Tao Mei, and Jiebo Luo. 2018. Tell-and-answer: Towards explainable visual question answering using attributes and captions. arXiv preprint arXiv:1801.09041 (2018).Google Scholar
- Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, LijuanWang, Houdong Hu, Li Dong, FuruWei, et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision. Springer, 121--137.Google ScholarDigital Library
- Hugo Liu and Push Singh. 2004. ConceptNet-a practical commonsense reasoning tool-kit. BT technology journal 22, 4 (2004), 211--226.Google Scholar
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).Google Scholar
- Mateusz Malinowski and Mario Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. Advances in neural information processing systems 27 (2014), 1682--1690.Google Scholar
- Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. 2019. Ok-vqa:Avisual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3195--3204.Google Scholar
- Medhini Narasimhan, Svetlana Lazebnik, and Alexander G Schwing. 2018. Out of the box: Reasoning with graph convolution nets for factual visual question answering. arXiv preprint arXiv:1811.00538 (2018).Google Scholar
- Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and Arun Sacheti. 2020. Imagebert: Cross-modal pre-training with large-scale weak-supervised image text data. arXiv preprint arXiv:2001.07966 (2020).Google Scholar
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3982--3992. https://doi.org/10.18653/v1/D19-1410Google Scholar
- Li Wang, Baoyu Fan, Zhenhua Guo, Yaqian Zhao, Runze Zhang, Rengang Li, Weifeng Gong, and Endong Wang. 2021. Knowledge-Supervised Learning: Knowledge Consensus Constraints for Person Re-Identification. In MM '21: ACM Multimedia Conference, Virtual Event, China, October 20 - 24, 2021, Heng Tao Shen, Yueting Zhuang, John R. Smith, Yang Yang, Pablo Cesar, Florian Metze, and Balakrishnan Prabhakaran (Eds.). ACM, 1866--1874. https: //doi.org/10.1145/3474085.3475340Google Scholar
- Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel. 2017. Fvqa: Fact-based visual question answering. IEEE transactions on pattern analysis and machine intelligence 40, 10 (2017), 2413--2427.Google Scholar
- Peng Wang, Qi Wu, Chunhua Shen, Anton van den Hengel, and Anthony Dick. 2015. Explicit knowledge-based reasoning for visual question answering. arXiv preprint arXiv:1511.02570 (2015).Google Scholar
- Qingqing Wang, Liqiang Xiao, Yue Lu, Yaohui Jin, and Hao He. 2021. Towards Reasoning Ability in Scene Text Visual Question Answering. Association for Computing Machinery, New York, NY, USA, 2281--2289. https://doi.org/10.1145/ 3474085.3475390Google Scholar
- JundaWu, Tong Yu, and Shuai Li. 2021. Deconfounded and Explainable Interactive Vision-Language Retrieval of Complex Scenes. Association for Computing Machinery, New York, NY, USA, 2103--2111. https://doi.org/10.1145/3474085.3475366Google Scholar
- Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE international conference on computer vision. 1821--1830.Google ScholarCross Ref
- Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6720--6731.Google ScholarCross Ref
- Kaihao Zhang, Rongqing Li, Yanjiang Yu, Wenhan Luo, and Changsheng Li. 2021. Deep dense multi-scale network for snow removal using semantic and depth priors. IEEE Transactions on Image Processing 30 (2021), 7419--7431.Google ScholarDigital Library
- Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, LijuanWang, Yejin Choi, and Jianfeng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5579--5588.Google ScholarCross Ref
Index Terms
- AI-VQA: Visual Question Answering based on Agent Interaction with Interpretability
Recommendations
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping ...
R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningRecently, Visual Question Answering (VQA) has emerged as one of the most significant tasks in multimodal learning as it requires understanding both visual and textual modalities. Existing methods mainly rely on extracting image and question features to ...
VQA: Visual Question Answering
ICCV '15: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV)We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping ...
Comments