Fine-tuning your answers: a bag of tricks for improving VQA models

Arroyo, Roberto; Álvarez, Sergio; Aller, Aitor; Bergasa, Luis M.; Ortiz, Miguel E.

doi:10.1007/s11042-021-11546-z

Fine-tuning your answers: a bag of tricks for improving VQA models

1202: Multimedia Tools for Digital Twin
Published: 08 January 2022

Volume 81, pages 26889–26913, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Roberto Arroyo ORCID: orcid.org/0000-0003-2649-0477¹,
Sergio Álvarez¹,
Aitor Aller¹,
Luis M. Bergasa² &
…
Miguel E. Ortiz²

530 Accesses
Explore all metrics

Abstract

In this paper, one of the most novel topics in Deep Learning (DL) is explored: Visual Question Answering (VQA). This research area uses three of the most important fields in Artificial Intelligence (AI) to automatically provide natural language answers for questions that a user can ask about an image. These fields are: 1) Computer Vision (CV), 2) Natural Language Processing (NLP) and 3) Knowledge Representation & Reasoning (KR&R). Initially, a review of the state of art in VQA and our contributions to it are discussed. Then, we build upon the ideas provided by Pythia, which is one of the most outstanding approaches. Therefore, a study of the Pythia’s architecture is carried out with the aim of presenting varied enhancements with respect to the original proposal in order to fine-tune models using a bag of tricks. Several training strategies are compared to increase the global accuracy and understand the limitations associated with VQA models. Extended results check the impact of the different tricks over our enhanced architecture, jointly with additional qualitative results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Comprehensive Analysis of State-of-the-Art Techniques for VQA

Visual Question Answering Using Deep Learning: A Survey and Performance Analysis

A Critical Analysis of VQA Models and Datasets

Notes

The VQA dataset can be downloaded from: https://visualqa.org/download.html
PyTorch can be downloaded from: https://pytorch.org/
Pythia’s FAIR implementation can be downloaded from: https://github.com/facebookresearch/mmf
Detectron can be downloaded from: https://github.com/facebookresearch/Detectron
The technichal reference for Pythia’s results by FAIR is currently available at the following link: https://learnpythia.readthedocs.io/en/latest/notes/pretrained_models.html

References

Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Conference on computer vision and pattern recognition (CVPR)
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) VQA: visual question answering. In: International conference on computer vision (ICCV)
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Conference on computer vision and pattern recognition (CVPR), pp 770–778
Jiang Y, Natarajan V, Chen X, Rohrbach M, Batra D, Parikh D (2018) Pythia v0.1: the winning entry to the VQA challenge 2018. arXiv preprint, arXiv:180709956
Johnson J, Hariharan B, van der Maaten L, Fei-Fei L, Lawrence Zitnick C, Girshick R (2017) CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: Conference on computer vision and pattern recognition (CVPR), pp 2901–2910
Kervadec C, Antipov G, Baccouche M, Wolf C (2021) Roses are red, violets are blue... But should VQA expect them to? In: Conference on computer vision and pattern recognition (CVPR), pp 2776–2785
Kervadec C, Jaunet T, Antipov G, Baccouche M, Vuillemot R, Wolf C (2021) How transferable are reasoning patterns in VQA? In: Conference on computer vision and pattern recognition (CVPR), pp 4207–4216
Kingma DP, Ba J (2014) Adam: a method for atochastic optimization. arXiv preprint, arXiv:14126980
Liang J, Jiang L, Cao L, Li LJ, Hauptmann AG (2018) Focal visual-text attention for visual question answering. In: Conference on computer vision and pattern recognition (CVPR), pp 6135–6143
Lin T, Maire M, Belongie SJ, Bourdev LD, Girshick RB, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: Common Objects in Context. arXiv preprint, arXiv:14050312
Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. In: Conference and workshop on neural information processing systems (NeurIPS), pp 1682–1690
Ortiz ME, Bergasa LM, Arroyo R, Álvarez S, Aller A (2020) Towards fine-tuning of VQA models in public datasets. In: Workshop of physical agents (WAF), pp 256–273
Pennington J, Socher R, Manning CD (2014) GloVe: global vectors for word representation. Conference on Empirical Methods in Natural Language Processing (EMNLP) 14:1532–1543
Google Scholar
Singh A, Natarajan V, Shah M, Jiang Y, Chen X, Batra D, Parikh D, Rohrbach M (2019) Towards VQA models that can read. In: Conference on computer vision and pattern recognition (CVPR), pp 8317–8326
Teney D, Anderson P, He X, Van Den Hengel A (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge. In: Conference on computer vision and pattern recognition (CVPR), pp 4223–4232
Wu C, Liu J, Wang X, Li R (2019) Differential networks for visual question answering. AAAI Conference on Artificial Intelligence (AAAI) 33:8997–9004
Article Google Scholar
Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: Conference on computer vision and pattern recognition (CVPR), pp 1492–1500
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Conference on computer vision and pattern recognition (CVPR), pp 21–29
Yi K, Wu J, Gan C, Torralba A, Kohli P, Tenenbaum J (2018) Neural-symbolic VQA: disentangling reasoning from vision and language understanding. In: Conference and workshop on neural information processing systems (NeurIPS), pp 1031–1042
Yuan Y, Wang S, Jiang M, Chen TY (2021) Perception matters: detecting perception failures of VQA models using metamorphic testing. In: Conference on computer vision and pattern recognition (CVPR), pp 16908–16917
Zhang M, Maidment T, Diab A, Kovashka A, Hwa R (2021) Domain-robust VQA with diverse datasets and methods but no target labels. In: Conference on computer vision and pattern recognition (CVPR), pp 7046–7056

Download references

Acknowledgements

Authors want to thank to NielsenIQ for its funding in the development of this project. This work has been also funded in part from the Spanish MICINN/FEDER through the Techs4AgeCar project (RTI2018-099263-B-C21) and from the RoboCity2030-DIH-CM project (P2018/NMT- 4331), funded by Programas de actividades I+D (CAM) and cofunded by EU Structural Funds.

Author information

Authors and Affiliations

NielsenIQ, Madrid, Spain
Roberto Arroyo, Sergio Álvarez & Aitor Aller
Electronics Department, University of Alcalá (UAH), Madrid, Spain
Luis M. Bergasa & Miguel E. Ortiz

Authors

Roberto Arroyo
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Álvarez
View author publications
You can also search for this author in PubMed Google Scholar
Aitor Aller
View author publications
You can also search for this author in PubMed Google Scholar
Luis M. Bergasa
View author publications
You can also search for this author in PubMed Google Scholar
Miguel E. Ortiz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Roberto Arroyo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Arroyo, R., Álvarez, S., Aller, A. et al. Fine-tuning your answers: a bag of tricks for improving VQA models. Multimed Tools Appl 81, 26889–26913 (2022). https://doi.org/10.1007/s11042-021-11546-z

Download citation

Received: 01 February 2021
Revised: 23 August 2021
Accepted: 09 September 2021
Published: 08 January 2022
Issue Date: August 2022
DOI: https://doi.org/10.1007/s11042-021-11546-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fine-tuning your answers: a bag of tricks for improving VQA models

Abstract

Access this article

Similar content being viewed by others

Comprehensive Analysis of State-of-the-Art Techniques for VQA

Visual Question Answering Using Deep Learning: A Survey and Performance Analysis

A Critical Analysis of VQA Models and Datasets

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fine-tuning your answers: a bag of tricks for improving VQA models

Abstract

Access this article

Similar content being viewed by others

Comprehensive Analysis of State-of-the-Art Techniques for VQA

Visual Question Answering Using Deep Learning: A Survey and Performance Analysis

A Critical Analysis of VQA Models and Datasets

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation