research-article

Multi-stage reasoning on introspecting and revising bias for visual question answering

Authors:

Li XuanyaAuthors Info & Claims

ACM Transactions on the Web, Volume 18, Issue 4

Article No.: 44, Pages 1 - 13

https://doi.org/10.1145/3616399

Published: 08 October 2024 Publication History

Abstract

Visual Question Answering (VQA) is a task that involves predicting an answer to a question depending on the content of an image. However, recent VQA methods have relied more on language priors between the question and answer rather than the image content. To address this issue, many debiasing methods have been proposed to reduce language bias in model reasoning. However, the bias can be divided into two categories: good bias and bad bias. Good bias can benefit to the answer prediction, while the bad bias may associate the models with the unrelated information. Therefore, instead of excluding good and bad bias indiscriminately in existing debiasing methods, we proposed a bias discrimination module to distinguish them. Additionally, bad bias may reduce the model’s reliance on image content during answer reasoning and thus attend little on image features updating. To tackle this, we leverage Markov theory to construct a Markov field with image regions and question words as nodes. This helps with feature updating for both image regions and question words, thereby facilitating more accurate and comprehensive reasoning about both the image content and question. To verify the effectiveness of our network, we evaluate our network on VQA v2 and VQA cp v2 datasets and conduct extensive quantity and quality studies to verify the effectiveness of our proposed network. Experimental resu- lts show that our network achieves significant performance against the previous state-of-the-art methods.

References

[1]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering. In ICCV. 2425–2433.

Digital Library

[2]

Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. 2018. Tips and tricks for visual question answering: Learnings from the 2017 challenge. In CVPR. 4223–4232.

[3]

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. 2016. Stacked attention networks for image question answering. In CVPR. 21–29.

[4]

Dingwen Zhang, Junwei Han, Gong Cheng, and Ming-Hsuan Yang. 2021. Weakly supervised object localization and detection: A survey. CoRR abs/2104.07918 (2021).

[5]

An-An Liu, Chenxi Huang, Ning Xu, Hongshuo Tian, Jing Liu, and Yongdong Zhang. 2023. Counterfactual visual dialog: Robust commonsense knowledge learning from unbiased training. IEEE Trans. Multim (2023).

[6]

Long Chen, Xin Yan, Jun Xiao, Hanwang Zhang, Shiliang Pu, and Yueting Zhuang. 2020. Counterfactual samples synthesizing for robust visual question answering. In CVPR. IEEE, 10797–10806.

[7]

Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. 2021. Counterfactual VQA: A cause-effect look at language bias. In CVPR. Computer Vision Foundation/IEEE, 12700–12710.

[8]

Wenqiang Lei, Chongming Gao, and Maarten de Rijke. 2021. RecSys 2021 tutorial on conversational recommendation: Formulation, methods, and evaluation. In ACM RecSys. 842–844.

Digital Library

[9]

Wenqiang Lei, Xiangnan He, Yisong Miao, Qingyun Wu, Richang Hong, Min-Yen Kan, and Tat-Seng Chua. 2020. Estimation-action-reflection: Towards deep interaction between conversational and recommender systems. In WSDM. ACM, 304–312.

Digital Library

[10]

Xiangbo Shu, Liyan Zhang, Guo-Jun Qi, Wei Liu, and Jinhui Tang. 2022. Spatiotemporal co-attention recurrent neural networks for human-skeleton motion prediction. IEEE Trans. Pattern Anal. Mach. Intell. 44, 6 (2022), 3300–3315.

[11]

Yanhui Wang, Ning Xu, An-An Liu, Wenhui Li, and Yongdong Zhang. 2022. High-order interaction learning for image captioning. IEEE Trans. Circ. Syst. Vid. Technol. 32, 7 (2022), 4417–4430.

Digital Library

[12]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In NeurIPS. 1106–1114.

[13]

Yu Zhao, Hongwei Li, Shaohua Wan, Anjany Sekuboyina, Xiaobin Hu, Giles Tetteh, Marie Piraud, and Bjoern H. Menze. 2019. Knowledge-aided convolutional neural network for small organ segmentation. IEEE J. Biomed. Health Inform. 23, 4 (2019), 1363–1373.

[14]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780.

Digital Library

[15]

Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In ACL. 1724–1734.

[16]

Kevin J. Shih, Saurabh Singh, and Derek Hoiem. 2016. Where to look: Focus regions for visual question answering. In CVPR. 4613–4621.

[17]

Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. Deep modular co-attention networks for visual question answering. In CVPR. 6281–6290.

[18]

Yuke Zhu, Oliver Groth, Michael S. Bernstein, and Li Fei-Fei. 2016. Visual7W: Grounded question answering in images. In CVPR. 4995–5004.

[19]

An-An Liu, Yingchen Zhai, Ning Xu, Weizhi Nie, Wenhui Li, and Yongdong Zhang. 2022. Region-aware image captioning via interaction learning. IEEE Trans. Circ. Syst. Vid. Technol. 32, 6 (2022), 3685–3696.

[20]

Pan Lu, Lei Ji, Wei Zhang, Nan Duan, Ming Zhou, and Jianyong Wang. 2018. R-VQA: Learning visual relation facts with semantic attention for visual question answering. In ACM. 1880–1889.

Digital Library

[21]

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. 2019. OK-VQA: A visual question answering benchmark requiring external knowledge. In CVPR. 3195–3204.

[22]

Medhini Narasimhan and Alexander G. Schwing. 2018. Straight to the facts: Learning knowledge base retrieval for factual visual question answering. In ECCV. 460–477.

Digital Library

[23]

Zhou Su, Chen Zhu, Yinpeng Dong, Dongqi Cai, Yurong Chen, and Jianguo Li. 2018. Learning visual knowledge memory networks for visual question answering. In CVPR. 7736–7745.

[24]

Peng Wang, Qi Wu, Chunhua Shen, Anthony R. Dick, and Anton van den Hengel. 2018. FVQA: Fact-based visual question answering. Trans. Pattern Anal. Mach. Intell. 40, 10 (2018), 2413–2427.

Digital Library

[25]

Ramprasaath Ramasamy Selvaraju, Stefan Lee, Yilin Shen, Hongxia Jin, Shalini Ghosh, Larry P. Heck, Dhruv Batra, and Devi Parikh. 2019. Taking a HINT: Leveraging explanations to make vision and language models more grounded. In ICCV. IEEE, 2591–2600.

[26]

Jialin Wu and Raymond J. Mooney. 2019. Self-critical reasoning for robust visual question answering. In NeurIPS. 8601–8611.

[27]

Rémi Cadène, Corentin Dancette, Hedi BenYounes, Matthieu Cord, and Devi Parikh. 2019. RUBi: Reducing unimodal biases in visual question answering. Advances in neural information processing systems, 2019, 32.

[28]

Christopher Clark, Mark Yatskar, and Luke Zettlemoyer. 2019. Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In EMNLP-IJCNLP. 4067–4080.

[29]

Tejas Gokhale, Pratyay Banerjee, Chitta Baral, and Yezhou Yang. 2020. MUTANT: A training paradigm for out-of-distribution generalization in visual question answering. In EMNLP. Association for Computational Linguistics, 878–892.

[30]

Mingrui Lao, Yanming Guo, Yu Liu, and Michael S. Lew. 2021. A language prior based focal loss for visual question answering. In ICME. IEEE, 1–6.

[31]

Zujie Liang, Weitao Jiang, Haifeng Hu, and Jiaying Zhu. 2020. Learning to contrast the counterfactual samples for robust visual question answering. In EMNLP. Association for Computational Linguistics, 3285–3292.

[32]

Xi Zhu, Zhendong Mao, Chunxiao Liu, Peng Zhang, Bin Wang, and Yongdong Zhang. 2020. Overcoming language priors with self-supervised learning for visual question answering. In IJCAI. ijcai.org, 1083–1089.

[33]

Sainandan Ramakrishnan, Aishwarya Agrawal, and Stefan Lee. 2018. Overcoming language priors in visual question answering with adversarial regularization. In NeurIPS. 1548–1558.

[34]

Chenchen Jing, Yuwei Wu, Xiaoxun Zhang, Yunde Jia, and Qi Wu. Overcoming language priors in VQA via decomposed linguistic representations. In AAAI. 11181–11188.

[35]

Gouthaman K. V. and Anurag Mittal. 2020. Reducing language biases in visual question answering with visually-grounded question encoder. In ECCV, Vol. 12358. 18–34.

[36]

Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinícius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Çaglar Gülçehre, H. Francis Song, Andrew J. Ballard, Justin Gilmer, George E. Dahl, Ashish Vaswani, Kelsey R. Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matthew M. Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. 2018. Relational inductive biases, deep learning, and graph networks. CoRR abs/1806.01261 (2018).

[37]

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In NeurIPS. 91–99.

[38]

Yan Zhang, Jonathon S. Hare, and Adam Prügel-Bennett. 2018. Learning to count objects in natural images for visual question answering. In ICLR.

[39]

Norman E. Fenton, Martin Neil, Anthony C. Constantinou. The Book of Why: The New Science of Cause and Effect, Basic Books (2018).

[40]

Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. 2018. Don’t just assume; look and answer: Overcoming priors for visual question answering. In CVPR. 4971–4980.

[41]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR. 6077–6086.

[42]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.

[43]

PaddlePaddle, An Easy-to-use, Easy-to-learn Deep Learning Platform. 2019. Retrieved from http://www.paddlepaddle.org/

[44]

Yash Goyal, Tejas Khot, Aishwarya Agrawal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2019. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. Int. J. Comput. Vis. 127, 4 (2019), 398–414.

Digital Library

[45]

Will Norcliffe-Brown, Stathis Vafeias, and Sarah Parisot. 2018. Learning conditioned graph structures for interpretable visual question answering. In NeurIPS. 8344–8353.

[46]

Rémi Cadène, Hedi Ben-Younes, Matthieu Cord, and Nicolas Thome. 2019. MUREL: Multimodal relational reasoning for visual question answering. In CVPR. 1989–1998.

[47]

Duy-Kien Nguyen and Takayuki Okatani. 2018. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In CVPR. 6087–6096.

[48]

Liang Peng, Yang Yang, Zheng Wang, Xiao Wu, and Zi Huang. 2019. CRA-Net: Composed relation attention network for visual question answering. In ACM MM. ACM, 1202–1210.

Digital Library

[49]

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards VQA models that can read. In CVPR. Computer Vision Foundation/IEEE, 8317–8326.

[50]

Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. 2017. Beyond bilinear: Generalized multi-modal factorized high-order pooling for visual question answering. CoRR abs/1708.03619 (2017).

[51]

Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. In NeurIPS. 1571–1581.

[52]

Qingxing Cao, Xiaodan Liang, Bailin Li, and Liang Lin. 2021. Interpretable visual question answering by reasoning on dependency trees. IEEE Trans. Pattern Anal. Mach. Intell. 43, 3 (2021), 887–901.

[53]

Liang Peng, Yang Yang, Zheng Wang, Zi Huang, and Heng Tao Shen. 2022. MRA-Net: Improving VQA via multi-modal relation attention network. IEEE Trans. Pattern Anal. Mach. Intell. 44, 1 (2022), 318–329.

Digital Library

[54]

Yulei Niu and Hanwang Zhang. 2021. Introspective distillation for robust question answering. In NeurIPS. 16292–16304.

Cited By

Ning KLi Z(2025)Robust data augmentation and contrast learning for debiased visual question answeringNeurocomputing10.1016/j.neucom.2025.129527626(129527)Online publication date: Apr-2025
https://doi.org/10.1016/j.neucom.2025.129527
Li CJing CLi ZWu YJia Y(2024)Adversarial Sample Synthesis for Visual Question AnsweringACM Transactions on Multimedia Computing, Communications, and Applications10.1145/368884820:12(1-24)Online publication date: 16-Sep-2024
https://dl.acm.org/doi/10.1145/3688848
Lei WHong RZamani HBudzianowski PMurdock VYilmaz E(2024)Special Issue on Conversational Information SeekingACM Transactions on the Web10.1145/368839218:4(1-3)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3688392

Index Terms

Multi-stage reasoning on introspecting and revising bias for visual question answering
1. Computer systems organization
  1. Embedded and cyber-physical systems
    1. Embedded systems

Recommendations

From Superficial to Deep: Language Bias driven Curriculum Learning for Visual Question Answering
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Most Visual Question Answering (VQA) models are faced with language bias when learning to answer a given question, thereby failing to understand multimodal knowledge simultaneously. Based on the fact that VQA samples with different levels of language ...
Bias-guided margin loss for robust Visual Question Answering
Abstract
Visual Question Answering (VQA) suffers from language prior issue, where models tend to rely on dataset biases to answer the questions while ignoring the image information. Existing studies have been devoted to mitigating language bias by using ...
Highlights
- A bias-guided margin loss debiasing strategy is proposed to overcome the language prior problem.
- Adversarial training, knowledge distillation, and contrastive learning are integrated to capture biases.
- The proposed model achieves ...
VQA-PDF: Purifying Debiased Features for Robust Visual Question Answering Task
Advanced Intelligent Computing Technology and Applications
Abstract
With the widespread adoption of deep learning, the performance of Visual Question Answering (VQA) tasks has seen significant improvements. Nonetheless, this progress has unveiled significant challenges concerning their credibility, primarily due ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on the Web

ACM Transactions on the Web Volume 18, Issue 4

November 2024

257 pages

EISSN:1559-114X

DOI:10.1145/3613734

Editor:
Ryen White
Microsoft Research, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 October 2024

Online AM: 28 August 2023

Accepted: 29 July 2023

Revised: 22 May 2023

Received: 04 June 2022

Published in TWEB Volume 18, Issue 4

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research and Development Program of China
National Natural Science Foundation of China
Tianjin Research Innovation Project for Postgraduate Students

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
346
Total Downloads

Downloads (Last 12 months)216
Downloads (Last 6 weeks)23

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ning KLi Z(2025)Robust data augmentation and contrast learning for debiased visual question answeringNeurocomputing10.1016/j.neucom.2025.129527626(129527)Online publication date: Apr-2025
https://doi.org/10.1016/j.neucom.2025.129527
Li CJing CLi ZWu YJia Y(2024)Adversarial Sample Synthesis for Visual Question AnsweringACM Transactions on Multimedia Computing, Communications, and Applications10.1145/368884820:12(1-24)Online publication date: 16-Sep-2024
https://dl.acm.org/doi/10.1145/3688848
Lei WHong RZamani HBudzianowski PMurdock VYilmaz E(2024)Special Issue on Conversational Information SeekingACM Transactions on the Web10.1145/368839218:4(1-3)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3688392

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents