Abstract
In this paper, one of the most novel topics in Deep Learning (DL) is explored: Visual Question Answering (VQA). This research area uses three of the most important fields in Artificial Intelligence (AI) to automatically provide natural language answers for questions that a user can ask about an image. These fields are: 1) Computer Vision (CV), 2) Natural Language Processing (NLP) and 3) Knowledge Representation & Reasoning (KR&R). Initially, a review of the state of art in VQA and our contributions to it are discussed. Then, we build upon the ideas provided by Pythia, which is one of the most outstanding approaches. Therefore, a study of the Pythia’s architecture is carried out with the aim of presenting varied enhancements with respect to the original proposal in order to fine-tune models using a bag of tricks. Several training strategies are compared to increase the global accuracy and understand the limitations associated with VQA models. Extended results check the impact of the different tricks over our enhanced architecture, jointly with additional qualitative results.
Similar content being viewed by others
Notes
The VQA dataset can be downloaded from: https://visualqa.org/download.html
PyTorch can be downloaded from: https://pytorch.org/
Pythia’s FAIR implementation can be downloaded from: https://github.com/facebookresearch/mmf
Detectron can be downloaded from: https://github.com/facebookresearch/Detectron
The technichal reference for Pythia’s results by FAIR is currently available at the following link: https://learnpythia.readthedocs.io/en/latest/notes/pretrained_models.html
References
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Conference on computer vision and pattern recognition (CVPR)
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) VQA: visual question answering. In: International conference on computer vision (ICCV)
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Conference on computer vision and pattern recognition (CVPR), pp 770–778
Jiang Y, Natarajan V, Chen X, Rohrbach M, Batra D, Parikh D (2018) Pythia v0.1: the winning entry to the VQA challenge 2018. arXiv preprint, arXiv:180709956
Johnson J, Hariharan B, van der Maaten L, Fei-Fei L, Lawrence Zitnick C, Girshick R (2017) CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: Conference on computer vision and pattern recognition (CVPR), pp 2901–2910
Kervadec C, Antipov G, Baccouche M, Wolf C (2021) Roses are red, violets are blue... But should VQA expect them to? In: Conference on computer vision and pattern recognition (CVPR), pp 2776–2785
Kervadec C, Jaunet T, Antipov G, Baccouche M, Vuillemot R, Wolf C (2021) How transferable are reasoning patterns in VQA? In: Conference on computer vision and pattern recognition (CVPR), pp 4207–4216
Kingma DP, Ba J (2014) Adam: a method for atochastic optimization. arXiv preprint, arXiv:14126980
Liang J, Jiang L, Cao L, Li LJ, Hauptmann AG (2018) Focal visual-text attention for visual question answering. In: Conference on computer vision and pattern recognition (CVPR), pp 6135–6143
Lin T, Maire M, Belongie SJ, Bourdev LD, Girshick RB, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: Common Objects in Context. arXiv preprint, arXiv:14050312
Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. In: Conference and workshop on neural information processing systems (NeurIPS), pp 1682–1690
Ortiz ME, Bergasa LM, Arroyo R, Álvarez S, Aller A (2020) Towards fine-tuning of VQA models in public datasets. In: Workshop of physical agents (WAF), pp 256–273
Pennington J, Socher R, Manning CD (2014) GloVe: global vectors for word representation. Conference on Empirical Methods in Natural Language Processing (EMNLP) 14:1532–1543
Singh A, Natarajan V, Shah M, Jiang Y, Chen X, Batra D, Parikh D, Rohrbach M (2019) Towards VQA models that can read. In: Conference on computer vision and pattern recognition (CVPR), pp 8317–8326
Teney D, Anderson P, He X, Van Den Hengel A (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge. In: Conference on computer vision and pattern recognition (CVPR), pp 4223–4232
Wu C, Liu J, Wang X, Li R (2019) Differential networks for visual question answering. AAAI Conference on Artificial Intelligence (AAAI) 33:8997–9004
Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: Conference on computer vision and pattern recognition (CVPR), pp 1492–1500
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Conference on computer vision and pattern recognition (CVPR), pp 21–29
Yi K, Wu J, Gan C, Torralba A, Kohli P, Tenenbaum J (2018) Neural-symbolic VQA: disentangling reasoning from vision and language understanding. In: Conference and workshop on neural information processing systems (NeurIPS), pp 1031–1042
Yuan Y, Wang S, Jiang M, Chen TY (2021) Perception matters: detecting perception failures of VQA models using metamorphic testing. In: Conference on computer vision and pattern recognition (CVPR), pp 16908–16917
Zhang M, Maidment T, Diab A, Kovashka A, Hwa R (2021) Domain-robust VQA with diverse datasets and methods but no target labels. In: Conference on computer vision and pattern recognition (CVPR), pp 7046–7056
Acknowledgements
Authors want to thank to NielsenIQ for its funding in the development of this project. This work has been also funded in part from the Spanish MICINN/FEDER through the Techs4AgeCar project (RTI2018-099263-B-C21) and from the RoboCity2030-DIH-CM project (P2018/NMT- 4331), funded by Programas de actividades I+D (CAM) and cofunded by EU Structural Funds.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Arroyo, R., Álvarez, S., Aller, A. et al. Fine-tuning your answers: a bag of tricks for improving VQA models. Multimed Tools Appl 81, 26889–26913 (2022). https://doi.org/10.1007/s11042-021-11546-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11546-z