Skip to main content
Log in

Fine-tuning your answers: a bag of tricks for improving VQA models

  • 1202: Multimedia Tools for Digital Twin
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In this paper, one of the most novel topics in Deep Learning (DL) is explored: Visual Question Answering (VQA). This research area uses three of the most important fields in Artificial Intelligence (AI) to automatically provide natural language answers for questions that a user can ask about an image. These fields are: 1) Computer Vision (CV), 2) Natural Language Processing (NLP) and 3) Knowledge Representation & Reasoning (KR&R). Initially, a review of the state of art in VQA and our contributions to it are discussed. Then, we build upon the ideas provided by Pythia, which is one of the most outstanding approaches. Therefore, a study of the Pythia’s architecture is carried out with the aim of presenting varied enhancements with respect to the original proposal in order to fine-tune models using a bag of tricks. Several training strategies are compared to increase the global accuracy and understand the limitations associated with VQA models. Extended results check the impact of the different tricks over our enhanced architecture, jointly with additional qualitative results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

Notes

  1. The VQA dataset can be downloaded from: https://visualqa.org/download.html

  2. PyTorch can be downloaded from: https://pytorch.org/

  3. Pythia’s FAIR implementation can be downloaded from: https://github.com/facebookresearch/mmf

  4. Detectron can be downloaded from: https://github.com/facebookresearch/Detectron

  5. The technichal reference for Pythia’s results by FAIR is currently available at the following link: https://learnpythia.readthedocs.io/en/latest/notes/pretrained_models.html

References

  1. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Conference on computer vision and pattern recognition (CVPR)

  2. Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) VQA: visual question answering. In: International conference on computer vision (ICCV)

  3. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Conference on computer vision and pattern recognition (CVPR), pp 770–778

  4. Jiang Y, Natarajan V, Chen X, Rohrbach M, Batra D, Parikh D (2018) Pythia v0.1: the winning entry to the VQA challenge 2018. arXiv preprint, arXiv:180709956

  5. Johnson J, Hariharan B, van der Maaten L, Fei-Fei L, Lawrence Zitnick C, Girshick R (2017) CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: Conference on computer vision and pattern recognition (CVPR), pp 2901–2910

  6. Kervadec C, Antipov G, Baccouche M, Wolf C (2021) Roses are red, violets are blue... But should VQA expect them to? In: Conference on computer vision and pattern recognition (CVPR), pp 2776–2785

  7. Kervadec C, Jaunet T, Antipov G, Baccouche M, Vuillemot R, Wolf C (2021) How transferable are reasoning patterns in VQA? In: Conference on computer vision and pattern recognition (CVPR), pp 4207–4216

  8. Kingma DP, Ba J (2014) Adam: a method for atochastic optimization. arXiv preprint, arXiv:14126980

  9. Liang J, Jiang L, Cao L, Li LJ, Hauptmann AG (2018) Focal visual-text attention for visual question answering. In: Conference on computer vision and pattern recognition (CVPR), pp 6135–6143

  10. Lin T, Maire M, Belongie SJ, Bourdev LD, Girshick RB, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: Common Objects in Context. arXiv preprint, arXiv:14050312

  11. Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. In: Conference and workshop on neural information processing systems (NeurIPS), pp 1682–1690

  12. Ortiz ME, Bergasa LM, Arroyo R, Álvarez S, Aller A (2020) Towards fine-tuning of VQA models in public datasets. In: Workshop of physical agents (WAF), pp 256–273

  13. Pennington J, Socher R, Manning CD (2014) GloVe: global vectors for word representation. Conference on Empirical Methods in Natural Language Processing (EMNLP) 14:1532–1543

    Google Scholar 

  14. Singh A, Natarajan V, Shah M, Jiang Y, Chen X, Batra D, Parikh D, Rohrbach M (2019) Towards VQA models that can read. In: Conference on computer vision and pattern recognition (CVPR), pp 8317–8326

  15. Teney D, Anderson P, He X, Van Den Hengel A (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge. In: Conference on computer vision and pattern recognition (CVPR), pp 4223–4232

  16. Wu C, Liu J, Wang X, Li R (2019) Differential networks for visual question answering. AAAI Conference on Artificial Intelligence (AAAI) 33:8997–9004

    Article  Google Scholar 

  17. Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: Conference on computer vision and pattern recognition (CVPR), pp 1492–1500

  18. Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Conference on computer vision and pattern recognition (CVPR), pp 21–29

  19. Yi K, Wu J, Gan C, Torralba A, Kohli P, Tenenbaum J (2018) Neural-symbolic VQA: disentangling reasoning from vision and language understanding. In: Conference and workshop on neural information processing systems (NeurIPS), pp 1031–1042

  20. Yuan Y, Wang S, Jiang M, Chen TY (2021) Perception matters: detecting perception failures of VQA models using metamorphic testing. In: Conference on computer vision and pattern recognition (CVPR), pp 16908–16917

  21. Zhang M, Maidment T, Diab A, Kovashka A, Hwa R (2021) Domain-robust VQA with diverse datasets and methods but no target labels. In: Conference on computer vision and pattern recognition (CVPR), pp 7046–7056

Download references

Acknowledgements

Authors want to thank to NielsenIQ for its funding in the development of this project. This work has been also funded in part from the Spanish MICINN/FEDER through the Techs4AgeCar project (RTI2018-099263-B-C21) and from the RoboCity2030-DIH-CM project (P2018/NMT- 4331), funded by Programas de actividades I+D (CAM) and cofunded by EU Structural Funds.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roberto Arroyo.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Arroyo, R., Álvarez, S., Aller, A. et al. Fine-tuning your answers: a bag of tricks for improving VQA models. Multimed Tools Appl 81, 26889–26913 (2022). https://doi.org/10.1007/s11042-021-11546-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-11546-z

Keywords

Navigation