skip to main content
research-article

Multi-stage reasoning on introspecting and revising bias for visual question answering

Published: 08 October 2024 Publication History

Abstract

Visual Question Answering (VQA) is a task that involves predicting an answer to a question depending on the content of an image. However, recent VQA methods have relied more on language priors between the question and answer rather than the image content. To address this issue, many debiasing methods have been proposed to reduce language bias in model reasoning. However, the bias can be divided into two categories: good bias and bad bias. Good bias can benefit to the answer prediction, while the bad bias may associate the models with the unrelated information. Therefore, instead of excluding good and bad bias indiscriminately in existing debiasing methods, we proposed a bias discrimination module to distinguish them. Additionally, bad bias may reduce the model’s reliance on image content during answer reasoning and thus attend little on image features updating. To tackle this, we leverage Markov theory to construct a Markov field with image regions and question words as nodes. This helps with feature updating for both image regions and question words, thereby facilitating more accurate and comprehensive reasoning about both the image content and question. To verify the effectiveness of our network, we evaluate our network on VQA v2 and VQA cp v2 datasets and conduct extensive quantity and quality studies to verify the effectiveness of our proposed network. Experimental resu- lts show that our network achieves significant performance against the previous state-of-the-art methods.

References

[1]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering. In ICCV. 2425–2433.
[2]
Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. 2018. Tips and tricks for visual question answering: Learnings from the 2017 challenge. In CVPR. 4223–4232.
[3]
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. 2016. Stacked attention networks for image question answering. In CVPR. 21–29.
[4]
Dingwen Zhang, Junwei Han, Gong Cheng, and Ming-Hsuan Yang. 2021. Weakly supervised object localization and detection: A survey. CoRR abs/2104.07918 (2021).
[5]
An-An Liu, Chenxi Huang, Ning Xu, Hongshuo Tian, Jing Liu, and Yongdong Zhang. 2023. Counterfactual visual dialog: Robust commonsense knowledge learning from unbiased training. IEEE Trans. Multim (2023).
[6]
Long Chen, Xin Yan, Jun Xiao, Hanwang Zhang, Shiliang Pu, and Yueting Zhuang. 2020. Counterfactual samples synthesizing for robust visual question answering. In CVPR. IEEE, 10797–10806.
[7]
Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. 2021. Counterfactual VQA: A cause-effect look at language bias. In CVPR. Computer Vision Foundation/IEEE, 12700–12710.
[8]
Wenqiang Lei, Chongming Gao, and Maarten de Rijke. 2021. RecSys 2021 tutorial on conversational recommendation: Formulation, methods, and evaluation. In ACM RecSys. 842–844.
[9]
Wenqiang Lei, Xiangnan He, Yisong Miao, Qingyun Wu, Richang Hong, Min-Yen Kan, and Tat-Seng Chua. 2020. Estimation-action-reflection: Towards deep interaction between conversational and recommender systems. In WSDM. ACM, 304–312.
[10]
Xiangbo Shu, Liyan Zhang, Guo-Jun Qi, Wei Liu, and Jinhui Tang. 2022. Spatiotemporal co-attention recurrent neural networks for human-skeleton motion prediction. IEEE Trans. Pattern Anal. Mach. Intell. 44, 6 (2022), 3300–3315.
[11]
Yanhui Wang, Ning Xu, An-An Liu, Wenhui Li, and Yongdong Zhang. 2022. High-order interaction learning for image captioning. IEEE Trans. Circ. Syst. Vid. Technol. 32, 7 (2022), 4417–4430.
[12]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In NeurIPS. 1106–1114.
[13]
Yu Zhao, Hongwei Li, Shaohua Wan, Anjany Sekuboyina, Xiaobin Hu, Giles Tetteh, Marie Piraud, and Bjoern H. Menze. 2019. Knowledge-aided convolutional neural network for small organ segmentation. IEEE J. Biomed. Health Inform. 23, 4 (2019), 1363–1373.
[14]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780.
[15]
Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In ACL. 1724–1734.
[16]
Kevin J. Shih, Saurabh Singh, and Derek Hoiem. 2016. Where to look: Focus regions for visual question answering. In CVPR. 4613–4621.
[17]
Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. Deep modular co-attention networks for visual question answering. In CVPR. 6281–6290.
[18]
Yuke Zhu, Oliver Groth, Michael S. Bernstein, and Li Fei-Fei. 2016. Visual7W: Grounded question answering in images. In CVPR. 4995–5004.
[19]
An-An Liu, Yingchen Zhai, Ning Xu, Weizhi Nie, Wenhui Li, and Yongdong Zhang. 2022. Region-aware image captioning via interaction learning. IEEE Trans. Circ. Syst. Vid. Technol. 32, 6 (2022), 3685–3696.
[20]
Pan Lu, Lei Ji, Wei Zhang, Nan Duan, Ming Zhou, and Jianyong Wang. 2018. R-VQA: Learning visual relation facts with semantic attention for visual question answering. In ACM. 1880–1889.
[21]
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. 2019. OK-VQA: A visual question answering benchmark requiring external knowledge. In CVPR. 3195–3204.
[22]
Medhini Narasimhan and Alexander G. Schwing. 2018. Straight to the facts: Learning knowledge base retrieval for factual visual question answering. In ECCV. 460–477.
[23]
Zhou Su, Chen Zhu, Yinpeng Dong, Dongqi Cai, Yurong Chen, and Jianguo Li. 2018. Learning visual knowledge memory networks for visual question answering. In CVPR. 7736–7745.
[24]
Peng Wang, Qi Wu, Chunhua Shen, Anthony R. Dick, and Anton van den Hengel. 2018. FVQA: Fact-based visual question answering. Trans. Pattern Anal. Mach. Intell. 40, 10 (2018), 2413–2427.
[25]
Ramprasaath Ramasamy Selvaraju, Stefan Lee, Yilin Shen, Hongxia Jin, Shalini Ghosh, Larry P. Heck, Dhruv Batra, and Devi Parikh. 2019. Taking a HINT: Leveraging explanations to make vision and language models more grounded. In ICCV. IEEE, 2591–2600.
[26]
Jialin Wu and Raymond J. Mooney. 2019. Self-critical reasoning for robust visual question answering. In NeurIPS. 8601–8611.
[27]
Rémi Cadène, Corentin Dancette, Hedi BenYounes, Matthieu Cord, and Devi Parikh. 2019. RUBi: Reducing unimodal biases in visual question answering. Advances in neural information processing systems, 2019, 32.
[28]
Christopher Clark, Mark Yatskar, and Luke Zettlemoyer. 2019. Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In EMNLP-IJCNLP. 4067–4080.
[29]
Tejas Gokhale, Pratyay Banerjee, Chitta Baral, and Yezhou Yang. 2020. MUTANT: A training paradigm for out-of-distribution generalization in visual question answering. In EMNLP. Association for Computational Linguistics, 878–892.
[30]
Mingrui Lao, Yanming Guo, Yu Liu, and Michael S. Lew. 2021. A language prior based focal loss for visual question answering. In ICME. IEEE, 1–6.
[31]
Zujie Liang, Weitao Jiang, Haifeng Hu, and Jiaying Zhu. 2020. Learning to contrast the counterfactual samples for robust visual question answering. In EMNLP. Association for Computational Linguistics, 3285–3292.
[32]
Xi Zhu, Zhendong Mao, Chunxiao Liu, Peng Zhang, Bin Wang, and Yongdong Zhang. 2020. Overcoming language priors with self-supervised learning for visual question answering. In IJCAI. ijcai.org, 1083–1089.
[33]
Sainandan Ramakrishnan, Aishwarya Agrawal, and Stefan Lee. 2018. Overcoming language priors in visual question answering with adversarial regularization. In NeurIPS. 1548–1558.
[34]
Chenchen Jing, Yuwei Wu, Xiaoxun Zhang, Yunde Jia, and Qi Wu. Overcoming language priors in VQA via decomposed linguistic representations. In AAAI. 11181–11188.
[35]
Gouthaman K. V. and Anurag Mittal. 2020. Reducing language biases in visual question answering with visually-grounded question encoder. In ECCV, Vol. 12358. 18–34.
[36]
Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinícius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Çaglar Gülçehre, H. Francis Song, Andrew J. Ballard, Justin Gilmer, George E. Dahl, Ashish Vaswani, Kelsey R. Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matthew M. Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. 2018. Relational inductive biases, deep learning, and graph networks. CoRR abs/1806.01261 (2018).
[37]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In NeurIPS. 91–99.
[38]
Yan Zhang, Jonathon S. Hare, and Adam Prügel-Bennett. 2018. Learning to count objects in natural images for visual question answering. In ICLR.
[39]
Norman E. Fenton, Martin Neil, Anthony C. Constantinou. The Book of Why: The New Science of Cause and Effect, Basic Books (2018).
[40]
Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. 2018. Don’t just assume; look and answer: Overcoming priors for visual question answering. In CVPR. 4971–4980.
[41]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR. 6077–6086.
[42]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
[43]
PaddlePaddle, An Easy-to-use, Easy-to-learn Deep Learning Platform. 2019. Retrieved from http://www.paddlepaddle.org/
[44]
Yash Goyal, Tejas Khot, Aishwarya Agrawal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2019. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. Int. J. Comput. Vis. 127, 4 (2019), 398–414.
[45]
Will Norcliffe-Brown, Stathis Vafeias, and Sarah Parisot. 2018. Learning conditioned graph structures for interpretable visual question answering. In NeurIPS. 8344–8353.
[46]
Rémi Cadène, Hedi Ben-Younes, Matthieu Cord, and Nicolas Thome. 2019. MUREL: Multimodal relational reasoning for visual question answering. In CVPR. 1989–1998.
[47]
Duy-Kien Nguyen and Takayuki Okatani. 2018. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In CVPR. 6087–6096.
[48]
Liang Peng, Yang Yang, Zheng Wang, Xiao Wu, and Zi Huang. 2019. CRA-Net: Composed relation attention network for visual question answering. In ACM MM. ACM, 1202–1210.
[49]
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards VQA models that can read. In CVPR. Computer Vision Foundation/IEEE, 8317–8326.
[50]
Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. 2017. Beyond bilinear: Generalized multi-modal factorized high-order pooling for visual question answering. CoRR abs/1708.03619 (2017).
[51]
Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. In NeurIPS. 1571–1581.
[52]
Qingxing Cao, Xiaodan Liang, Bailin Li, and Liang Lin. 2021. Interpretable visual question answering by reasoning on dependency trees. IEEE Trans. Pattern Anal. Mach. Intell. 43, 3 (2021), 887–901.
[53]
Liang Peng, Yang Yang, Zheng Wang, Zi Huang, and Heng Tao Shen. 2022. MRA-Net: Improving VQA via multi-modal relation attention network. IEEE Trans. Pattern Anal. Mach. Intell. 44, 1 (2022), 318–329.
[54]
Yulei Niu and Hanwang Zhang. 2021. Introspective distillation for robust question answering. In NeurIPS. 16292–16304.

Cited By

View all
  • (2025)Robust data augmentation and contrast learning for debiased visual question answeringNeurocomputing10.1016/j.neucom.2025.129527626(129527)Online publication date: Apr-2025
  • (2024)Adversarial Sample Synthesis for Visual Question AnsweringACM Transactions on Multimedia Computing, Communications, and Applications10.1145/368884820:12(1-24)Online publication date: 16-Sep-2024
  • (2024)Special Issue on Conversational Information SeekingACM Transactions on the Web10.1145/368839218:4(1-3)Online publication date: 8-Oct-2024

Index Terms

  1. Multi-stage reasoning on introspecting and revising bias for visual question answering

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on the Web
    ACM Transactions on the Web  Volume 18, Issue 4
    November 2024
    257 pages
    EISSN:1559-114X
    DOI:10.1145/3613734
    • Editor:
    • Ryen White
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 October 2024
    Online AM: 28 August 2023
    Accepted: 29 July 2023
    Revised: 22 May 2023
    Received: 04 June 2022
    Published in TWEB Volume 18, Issue 4

    Check for updates

    Author Tags

    1. Visual question answering
    2. language bias
    3. attention
    4. artificial intelligence

    Qualifiers

    • Research-article

    Funding Sources

    • National Key Research and Development Program of China
    • National Natural Science Foundation of China
    • Tianjin Research Innovation Project for Postgraduate Students

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)216
    • Downloads (Last 6 weeks)23
    Reflects downloads up to 03 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Robust data augmentation and contrast learning for debiased visual question answeringNeurocomputing10.1016/j.neucom.2025.129527626(129527)Online publication date: Apr-2025
    • (2024)Adversarial Sample Synthesis for Visual Question AnsweringACM Transactions on Multimedia Computing, Communications, and Applications10.1145/368884820:12(1-24)Online publication date: 16-Sep-2024
    • (2024)Special Issue on Conversational Information SeekingACM Transactions on the Web10.1145/368839218:4(1-3)Online publication date: 8-Oct-2024

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media