research-article

AI-VQA: Visual Question Answering based on Agent Interaction with Interpretability

Authors:
Rengang Li

Inspur (Beijing) Electronic Information Industry Co., Ltd. & State Key Laboratory of High-end Server & Storage Technology, Beijing, China

Inspur (Beijing) Electronic Information Industry Co., Ltd. & State Key Laboratory of High-end Server & Storage Technology, Beijing, China
View Profile

,
Cong Xu

Inspur (Beijing) Electronic Information Industry Co., Ltd. & State Key Laboratory of High-end Server & Storage Technology, Beijing, China

Inspur (Beijing) Electronic Information Industry Co., Ltd. & State Key Laboratory of High-end Server & Storage Technology, Beijing, China
View Profile

,
Zhenhua Guo

Inspur Electronic Information Industry Co., Ltd. & State Key Laboratory of High-end Server & Storage Technology, Jinan, China

Inspur Electronic Information Industry Co., Ltd. & State Key Laboratory of High-end Server & Storage Technology, Jinan, China
View Profile

,
Baoyu Fan

Inspur Electronic Information Industry Co., Ltd. & State Key Laboratory of High-end Server & Storage Technology, Jinan, China

Inspur Electronic Information Industry Co., Ltd. & State Key Laboratory of High-end Server & Storage Technology, Jinan, China
View Profile

,
Runze Zhang

Inspur Electronic Information Industry Co., Ltd. & State Key Laboratory of High-end Server & Storage Technology, Jinan, China

Inspur Electronic Information Industry Co., Ltd. & State Key Laboratory of High-end Server & Storage Technology, Jinan, China
View Profile

,
Wei Liu

Inspur Electronic Information Industry Co., Ltd. & State Key Laboratory of High-end Server & Storage Technology, Jinan, China

Inspur Electronic Information Industry Co., Ltd. & State Key Laboratory of High-end Server & Storage Technology, Jinan, China
View Profile

,
Yaqian Zhao

Inspur Electronic Information Industry Co., Ltd. & State Key Laboratory of High-end Server & Storage Technology, Jinan, China

Inspur Electronic Information Industry Co., Ltd. & State Key Laboratory of High-end Server & Storage Technology, Jinan, China
View Profile

,
Weifeng Gong

Inspur Electronic Information Industry Co., Ltd. & State Key Laboratory of High-end Server & Storage Technology, Jinan, China

Inspur Electronic Information Industry Co., Ltd. & State Key Laboratory of High-end Server & Storage Technology, Jinan, China
View Profile

,
Endong Wang

Inspur Electronic Information Industry Co., Ltd. & State Key Laboratory of High-end Server & Storage Technology, Jinan, China

Inspur Electronic Information Industry Co., Ltd. & State Key Laboratory of High-end Server & Storage Technology, Jinan, China
View Profile

MM '22: Proceedings of the 30th ACM International Conference on MultimediaOctober 2022Pages 5274–5282https://doi.org/10.1145/3503161.3548387

Published:10 October 2022Publication History

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 5274–5282

ABSTRACT

Visual Question Answering (VQA) serves as a proxy for evaluating the scene understanding of an intelligent agent by answering questions about images. Most VQA benchmarks to date are focused on those questions that can be answered through understanding visual content in the scene, such as simple counting, visual attributes, and even a little challenging questions that require extra encyclopedic knowledge. However, humans have a remarkable capacity to reason dynamic interaction on the scene, which is beyond the literal content of an image and has not been investigated so far. In this paper, we propose Agent Interaction Visual Question Answering (AI-VQA), a task investigating deep scene understanding if the agent takes a certain action. For this task, a model not only needs to answer action-related questions but also to locate the objects in which the interaction occurs for guaranteeing it truly comprehends the action. Accordingly, we make a new dataset based on Visual Genome and ATOMIC knowledge graph, including more than 19,000 manually annotated questions, and will make it publicly available. Besides, we also provide an annotation of the reasoning path while developing the answer for each question. Based on the dataset, we further propose a novel method, called ARE, that can comprehend the interaction and explain the reason based on a given event knowledge base. Experimental results show that our proposed method outperforms the baseline by a clear margin.

Supplemental Material

Available for Download

mp4

MM22-fp2946.mp4 (41.2 MB)

References

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077--6086.Google ScholarCross Ref
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425--2433.Google ScholarDigital Library
Zhuo Chen, Jiaoyan Chen, Yuxia Geng, Jeff Z Pan, Zonggang Yuan, and Huajun Chen. 2021. Zero-Shot Visual Question Answering Using Knowledge Graph. In International Semantic Web Conference. Springer, 146--162.Google Scholar
Baoyu Fan, Li Wang, Runze Zhang, Zhenhua Guo, Yaqian Zhao, Rengang Li, and Weifeng Gong. 2020. Contextual Multi-Scale Feature Learning for Person Re-Identification. In MM '20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020, Chang Wen Chen, Rita Cucchiara, Xian-Sheng Hua, Guo-Jun Qi, Elisa Ricci, Zhengyou Zhang, and Roger Zimmermann (Eds.). ACM, 655--663. https://doi.org/10.1145/3394171.3414038Google Scholar
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In EMNLP.Google Scholar
Donald Geman, Stuart Geman, Neil Hallonquist, and Laurent Younes. 2015. Visual turing test for computer vision systems. Proceedings of the National Academy of Sciences 112, 12 (2015), 3618--3623.Google ScholarCross Ref
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for realworld visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6700--6709.Google ScholarCross Ref
Drew A Hudson and Christopher D Manning. 2019. GQA: A New Dataset for Real- World Visual Reasoning and Compositional Question Answering. Conference on Computer Vision and Pattern Recognition (CVPR) (2019).Google Scholar
Jena D Hwang, Chandra Bhagavatula, Ronan Le Bras, Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, and Yejin Choi. 2020. Comet-atomic 2020: On symbolic and neural commonsense knowledge graphs. arXiv preprint arXiv:2010.05953 (2020).Google Scholar
Aman Jain, Mayank Kothyari, Vishwajeet Kumar, Preethi Jyothi, Ganesh Ramakrishnan, and Soumen Chakrabarti. 2021. Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual Question Answering. arXiv preprint arXiv:2103.05568 (2021).Google Scholar
D. Jia, D. Wei, R. Socher, L. J. Li, L. Kai, and F. F. Li. 2009. ImageNet: A large-scale hierarchical image database. 248--255.Google Scholar
Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2901--2910.Google ScholarCross Ref
Maxime Kayser, Oana-Maria Camburu, Leonard Salewski, Cornelius Emde, Virginie Do, Zeynep Akata, and Thomas Lukasiewicz. 2021. e-vil: A dataset and benchmark for natural language explanations in vision-language tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1244--1254.Google ScholarCross Ref
Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. Advances in Neural Information Processing Systems 31 (2018), 1571--1581.Google Scholar
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123, 1 (2017), 32--73.Google Scholar
Changsheng Li, Rongqing Li, Ye Yuan, Guoren Wang, and Dong Xu. 2021. Deep Unsupervised Active Learning via Matrix Sketching. IEEE Transactions on Image Processing 30 (2021), 9280--9293.Google ScholarCross Ref
Changsheng Li, Handong Ma, Ye Yuan, Guoren Wang, and Dong Xu. 2022. Structure Guided Deep Neural Network for Unsupervised Active Learning. IEEE Transactions on Image Processing (2022).Google ScholarCross Ref
Changsheng Li, Xiangfeng Wang, Weishan Dong, Junchi Yan, Qingshan Liu, and Hongyuan Zha. 2018. Joint active learning with feature selection via cur matrix decomposition. IEEE transactions on pattern analysis and machine intelligence 41, 6 (2018), 1382--1396.Google Scholar
Changsheng Li, Fan Wei, Weishan Dong, Xiangfeng Wang, Qingshan Liu, and Xin Zhang. 2018. Dynamic structure embedded online multiple-output regression for streaming data. IEEE transactions on pattern analysis and machine intelligence 41, 2 (2018), 323--336.Google Scholar
Qing Li, Jianlong Fu, Dongfei Yu, Tao Mei, and Jiebo Luo. 2018. Tell-and-answer: Towards explainable visual question answering using attributes and captions. arXiv preprint arXiv:1801.09041 (2018).Google Scholar
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, LijuanWang, Houdong Hu, Li Dong, FuruWei, et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision. Springer, 121--137.Google ScholarDigital Library
Hugo Liu and Push Singh. 2004. ConceptNet-a practical commonsense reasoning tool-kit. BT technology journal 22, 4 (2004), 211--226.Google Scholar
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).Google Scholar
Mateusz Malinowski and Mario Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. Advances in neural information processing systems 27 (2014), 1682--1690.Google Scholar
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. 2019. Ok-vqa:Avisual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3195--3204.Google Scholar
Medhini Narasimhan, Svetlana Lazebnik, and Alexander G Schwing. 2018. Out of the box: Reasoning with graph convolution nets for factual visual question answering. arXiv preprint arXiv:1811.00538 (2018).Google Scholar
Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and Arun Sacheti. 2020. Imagebert: Cross-modal pre-training with large-scale weak-supervised image text data. arXiv preprint arXiv:2001.07966 (2020).Google Scholar
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3982--3992. https://doi.org/10.18653/v1/D19-1410Google Scholar
Li Wang, Baoyu Fan, Zhenhua Guo, Yaqian Zhao, Runze Zhang, Rengang Li, Weifeng Gong, and Endong Wang. 2021. Knowledge-Supervised Learning: Knowledge Consensus Constraints for Person Re-Identification. In MM '21: ACM Multimedia Conference, Virtual Event, China, October 20 - 24, 2021, Heng Tao Shen, Yueting Zhuang, John R. Smith, Yang Yang, Pablo Cesar, Florian Metze, and Balakrishnan Prabhakaran (Eds.). ACM, 1866--1874. https: //doi.org/10.1145/3474085.3475340Google Scholar
Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel. 2017. Fvqa: Fact-based visual question answering. IEEE transactions on pattern analysis and machine intelligence 40, 10 (2017), 2413--2427.Google Scholar
Peng Wang, Qi Wu, Chunhua Shen, Anton van den Hengel, and Anthony Dick. 2015. Explicit knowledge-based reasoning for visual question answering. arXiv preprint arXiv:1511.02570 (2015).Google Scholar
Qingqing Wang, Liqiang Xiao, Yue Lu, Yaohui Jin, and Hao He. 2021. Towards Reasoning Ability in Scene Text Visual Question Answering. Association for Computing Machinery, New York, NY, USA, 2281--2289. https://doi.org/10.1145/ 3474085.3475390Google Scholar
JundaWu, Tong Yu, and Shuai Li. 2021. Deconfounded and Explainable Interactive Vision-Language Retrieval of Complex Scenes. Association for Computing Machinery, New York, NY, USA, 2103--2111. https://doi.org/10.1145/3474085.3475366Google Scholar
Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE international conference on computer vision. 1821--1830.Google ScholarCross Ref
Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6720--6731.Google ScholarCross Ref
Kaihao Zhang, Rongqing Li, Yanjiang Yu, Wenhan Luo, and Changsheng Li. 2021. Deep dense multi-scale network for snow removal using semantic and depth priors. IEEE Transactions on Image Processing 30 (2021), 7419--7431.Google ScholarDigital Library
Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, LijuanWang, Yejin Choi, and Jianfeng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5579--5588.Google ScholarCross Ref

Index Terms

AI-VQA: Visual Question Answering based on Agent Interaction with Interpretability
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Question answering
    2. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

VQA: Visual Question Answering

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping ...
Read More
R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Recently, Visual Question Answering (VQA) has emerged as one of the most significant tasks in multimodal learning as it requires understanding both visual and textual modalities. Existing methods mainly rely on extracting image and question features to ...
Read More
VQA: Visual Question Answering
ICCV '15: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV)

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 October 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
dataset
vision and language
visual question answer
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 229
  Total Downloads
- Downloads (Last 12 months)118
- Downloads (Last 6 weeks)20
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.