Skip to main content
Log in

Overcoming language priors with self-contrastive learning for visual question answering

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Although remarkable success has been achieved in the last few years on the Visual Question Answer(VQA) task, most existing models are heavily driven by the surface linguistic correlation in the training set and ignore the image contents. Several recent methods introduce auxiliary tasks (visual annotation, counterfactual samples, etc.) to overcome language priors and enhance image dependence. However, the inherent priors, which evaluate whether the original models are driven by memorizing priors in training data, still have not been resolved. Therefore, we proposed a novel self-contrastive learning method contrasting the answers to the question predicted by question-relevant regions and question-irrelevant regions to solve this problem without introducing auxiliary tasks. Concretely, when the question pays attention to the question-relevant regions and the question-irrelevant regions, different answer spaces are generated to form a contrast to prevent the model from being driven by surface language priors. Therefore, the question is forced to rely on relevant image regions to predict the correct answer. Extensive experiments on the benchmark dataset demonstrate the effectiveness of our method. Particularly, by building on top of the model LMH, our method achieves the state-of-the-art performance of 59.00% on the most commonly used benchmark VQA-CP v2 without auxiliary tasks, with an improvement of 6.51%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Agrawal A, Batra D, Parikh D (2016) Analyzing the behavior of visual question answering models. In: Proceedings of the 2016 conference on empirical methods in natural language processing. Association for Computational Linguistics, Austin, pp 1955–1960, DOI https://doi.org/10.18653/v1/D16-1203

  2. Agrawal A, Batra D, Parikh D, Kembhavi A (2018) Don’t just assume; look and answer: Overcoming priors for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4971–4980

  3. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086

  4. Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433

  5. Cadene R, Ben-Younes H, Cord M, Thome N (2019a) Murel: Multimodal relational reasoning for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1989–1998

  6. Cadene R, Dancette C, Cord M, Parikh D et al (2019b) Rubi: Reducing unimodal biases for visual question answering. Advances in Neural Information Processing Systems 32

  7. Chen L, Yan X, Xiao J, Zhang H, Pu S, Zhuang Y (2020) Counterfactual samples synthesizing for robust visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10800–10809

  8. Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, pp 1724–1734, DOI https://doi.org/10.3115/v1/D14-1179, (to appear in print)

  9. Clark C, Yatskar M, Zettlemoyer L (2019) Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, China, pp 4069–4082, DOI https://doi.org/10.18653/v1/D19-1418, (to appear in print)

  10. Das A, Agrawal H, Zitnick L, Parikh D, Batra D (2017) Human attention in visual question answering: Do humans and deep networks look at the same regions? Comput Vis Image Underst 163:90–100

    Article  Google Scholar 

  11. Fukui A, Park D H, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In: Proceedings of the 2016 conference on empirical methods in natural language processing. Texas, Association for Computational Linguistics, pp 457–468, DOI https://doi.org/10.18653/v1/D16-1044, (to appear in print)

  12. Gat I, Schwartz I, Schwing A, Hazan T (2020) Removing bias in multi-modal classifiers: Regularization by maximizing functional entropies. Adv Neural Inf Process Syst 33:3197–3208

    Google Scholar 

  13. Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6904–6913

  14. Grand G, Belinkov Y (2019) Adversarial regularization for visual question answering: Strengths, shortcomings, and side effects. In: Proceedings of the second workshop on shortcomings in vision and language. Minnesota, Association for Computational Linguistics, pp 1–13, DOI https://doi.org/10.18653/v1/W19-1801https://doi.org/10.18653/v1/W19-1801, (to appear in print)

  15. Guo W, Zhang Y, Yang J, Yuan X (2021a) Re-attention for visual question answering. IEEE Trans Image Process 30:6730–6743

    Article  Google Scholar 

  16. Guo Y, Nie L, Cheng Z, Tian Q, Zhang M (2021b) Loss re-scaling vqa: Revisiting the language prior problem from a class-imbalance view. IEEE Trans Image Process 31:227–238

    Article  Google Scholar 

  17. Jing C, Wu Y, Zhang X, Jia Y, Wu Q (2020) Overcoming language priors in vqa via decomposed linguistic representations. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 11181–11188

  18. Kafle K, Kanan C (2017) An analysis of visual question answering algorithms. In: Proceedings of the IEEE international conference on computer vision, pp 1965–1973

  19. Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. arXiv:1412.6980

  20. KV G, Mittal A (2020) Reducing language biases in visual question answering with visually-grounded question encoder. In: European Conference on computer vision. Springer, pp 18–34

  21. Li L, Gan Z, Cheng Y, Liu J (2019) Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10313–10322

  22. Liu C, Mao J, Sha F, Yuille A (2017) Attention correctness in neural image captioning. In: Proceedings of the AAAI conference on artificial intelligence, vol 31

  23. Malinowski M, Doersch C, Santoro A, Battaglia P (2018) Learning visual question answering by bootstrapping hard attention. In: Proceedings of the European conference on computer vision (ECCV), pp 3–20

  24. Niu Y, Tang K, Zhang H, Lu Z, Hua XS, Wen JR (2021) Counterfactual vqa: A cause-effect look at language bias. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12700–12710

  25. Park DH, Hendricks LA, Akata Z, Rohrbach A, Schiele B, Darrell T, Rohrbach M (2018) Multimodal explanations: Justifying decisions and pointing to the evidence. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8779–8788

  26. Peng L, Yang Y, Bin Y, Xie N, Shen F, Ji Y, Xu X (2019) Word-to-region attention network for visual question answering. Multimed Tools Appl 78(3):3843–3858

    Article  Google Scholar 

  27. Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

  28. Ramakrishnan S, Agrawal A, Lee S (2018) Overcoming language priors in visual question answering with adversarial regularization. Adv Neural Inf Process Syst 31

  29. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems 28

  30. Selvaraju RR, Lee S, Shen Y, Jin H, Ghosh S, Heck L, Batra D, Parikh D (2019) Taking a hint: Leveraging explanations to make vision and language models more grounded. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2591–2600

  31. Shrestha R, Kafle K, Kanan C (2019) Answer them all! toward universal visual question answering models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10472–10481

  32. Shrestha R, Kafle K, Kanan C (2020) A negative case analysis of visual grounding methods for VQA. In: Proceedings of the 58th annual meeting of the association for computational linguistics, Online. Association for Computational Linguistics, pp 8172–8181, DOI https://doi.org/10.18653/v1/2020.acl-main.727, (to appear in print)

  33. Teney D, Abbasnedjad E, Hengel AVD (2020a) Learning what makes a difference from counterfactual examples and gradient supervision. In: European conference on computer vision. Springer, pp 580–599

  34. Teney D, Abbasnejad E, Kafle K, Shrestha R, Kanan C, Van Den Hengel A (2020b) On the value of out-of-distribution testing: an example of goodhart’s law. Adv Neural Inf Process Syst 33:407– 417

    Google Scholar 

  35. Teney D, Abbasnejad E, van den Hengel A (2021) Unshuffling data for improved generalization in visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1417–1427

  36. Toor AS, Wechsler H, Nappi M (2019) Question action relevance and editing for visual question answering. Multimed Tools Appl 78(3):2921–2935

    Article  Google Scholar 

  37. Wu J, Mooney R (2019) Self-critical reasoning for robust visual question answering. Adv Neural Inf Process Syst 32

  38. Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29

  39. Zhu X, Mao Z, Liu C, Zhang P, Wang B, Zhang Y (2020) Overcoming language priors with self-supervised learning for visual question answering. In: Bessiere C (ed) Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI 2020, ijcai.org, pp 1083–1089, DOI https://doi.org/10.24963/ijcai.2020/151, (to appear in print)

  40. Zhu X, Mao Z, Chen Z, Li Y, Wang Z, Wang B (2021) Object-difference drived graph convolutional networks for visual question answering. Multimed Tools Appl 80(11):16247–16265

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grants 81860318 and 81560296.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qingsong Huang.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yan, H., Liu, L., Feng, X. et al. Overcoming language priors with self-contrastive learning for visual question answering. Multimed Tools Appl 82, 16343–16358 (2023). https://doi.org/10.1007/s11042-022-14167-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-14167-2

Keywords

Navigation