Skip to main content
Log in

Image captioning improved visual question answering

  • 1174: Futuristic Trends and Innovations in Multimedia Systems Using Big Data, IoT and Cloud Technologies (FTIMS)
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Both Visual Question Answering (VQA) and image captioning are the problems which involve Computer Vision (CV) and Natural Language Processing (NLP) domains. In general, computer vision models are effectively utilized to represent visual contents. While NLP algorithms are used to represent the sentences. In recent years, VQA and image captioning tasks are tackled independently although they require similar type of algorithms. In this paper, a joint relationship between these two tasks is established and exploited. We present an image captioning based VQA model that uses the knowledge learnt from the image captioning task and transfers that knowledge to VQA task. We integrate the image captioning module into the VQA model by fusing the features obtained from captioning model and the attention-based visual feature. The experimental results demonstrate the improvement in the answer generation accuracy by a margin 3.45 % on VQA 1.0, 3.33% on VQA 2.0 and 1.73% on VQA-CP v2 datasets over the state-of-the-art VQA models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Agrawal A, Batra D, Parikh D (2016) Analyzing the behavior of visual question answering models. arXiv preprint arXiv:1606.07356

  2. Agrawal A, Batra D, Parikh D, Kembhavi A (2018) Don't just assume; look and answer: Overcoming priors for visual question answering. In Proc IEEE Conf Comput Vis Pattern Recognit 4971–4980

  3. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In Proc IEEE Conf Comput Vis Pattern Recognit 6077–6086

  4. Andreas J, Rohrbach M, Darrell T, Klein D (2016) Neural module networks. In Proc IEEE Conf Comput Vis Pattern Recognit 39–48

  5. Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) Vqa: Visual question answering. In Proc IEEE Int Conf Comput Vis 2425−2433

  6. Cadene R, Ben-Younes H, Cord M, Thome N (2019) Murel: Multimodal relational reasoning for visual question answering. In Proc IEEE Conf Comput Vis Pattern Recognit 1989–1998

  7. Chen K, Wang J, Chen LC, Gao H, Xu W, Nevatia R (2015a) Abc-cnn: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960

  8. Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollár P, Zitnick CL (2015b) Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325

  9. Chen L, Yan X, Xiao J, Zhang H, Pu S, Zhuang Y (2020) Counterfactual samples synthesizing for robust visual question answering. In Proc IEEE/CVF Conf Comput Vis Pattern Recognit 10800–10809

  10. Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078

  11. Das A, Agrawal H, Zitnick L, Parikh D, Batra D (2017) Human attention in visual question answering: Do humans and deep networks look at the same regions? Comput Vis Image Underst 163:90–100

    Article  Google Scholar 

  12. Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In Proc 9th Workshop Stat Mach Transl 376–380

  13. Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In Proceed IEEE Conf Comput Vis Pattern Recognit 2625–2634

  14. Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847

  15. Gao L, Cao L, Xu X, Shao J, Song J (2020) Question-Led object attention for visual question answering. Neurocomputing 391:227–233

    Article  Google Scholar 

  16. Geman D, Geman S, Hallonquist N, Younes L (2015) Visual turing test for computer vision systems. Proc Natl Acad Sci 112(12):3618–3623

    Article  Google Scholar 

  17. Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 6904–6913

  18. Gupta N, Jalal AS (2019) Integration of textual cues for fine-grained image captioning using deep CNN and LSTM. Neural Comput Appl 1–10

  19. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 770–778

  20. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural computation 9(8):1735–1780

    Article  Google Scholar 

  21. Hu R, Singh A, Darrell T, Rohrbach M (2020) Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In Proc IEEE/CVF Conf Comput Vis Pattern Recognit 9992–10002

  22. Jiang A, Wang F, Porikli F, Li Y (2015) Compositional memory for visual question answering. arXiv preprint arXiv:1511.05676

  23. Jiang H, Misra I, Rohrbach M, Learned-Miller E, Chen X (2020) In Defense of Grid Features for Visual Question Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 10267–10276

  24. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In Proc IEEE Conf Comput Vis Pattern Recognit 3128–3137

  25. Kazemi V, Elqursh A (2017) Show, ask, attend, and answer: A strong baseline for visual question answering. arXiv preprint arXiv:1704.03162

  26. Kazemzadeh S, Ordonez V, Matten M, Berg T (2014). Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 787–798

  27. Kim JH, On KW, Lim W, Kim J, Ha JW, Zhang BT (2016) Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325

  28. Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. In Proc 3rd Int Conf Learn Represent (ICLR). arXiv preprint arXiv 1412

  29. Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539

  30. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, ... Zitnick CL (2014) Microsoft coco: Common objects in context. In European Conference on Computer Vision 740–755. Springer, Cham

  31. Liu F, Xiang T, Hospedales TM, Yang W, Sun C (2018) Inverse visual question answering: A new benchmark and VQA diagnosis tool. IEEE Trans Pattern Anal Mach Intell 42(2):460–474

    Article  Google Scholar 

  32. Liu Y, Zhang X, Zhao Z, Zhang B, Cheng L, Li Z (2020) ALSA: Adversarial Learning of Supervised Attentions for Visual Question Answering. IEEE Transactions on Cybernetics

  33. Malinowski M, Rohrbach M, Fritz M (2015) Ask your neurons: A neural-based approach to answering questions about images. In Proceedings of the IEEE International Conference on Computer Vision 1–9

  34. Malinowski M, Doersch C, Santoro A, Battaglia P (2018) Learning visual question answering by bootstrapping hard attention. In Proceedings of the European Conference on Computer Vision (ECCV) 3–20

  35. Mishra A, Shekhar S, Singh AK, Chakraborty A (2019) Ocr-vqa: Visual question answering by reading text in images. In 2019 International Conference on Document Analysis and Recognition (ICDAR) 947–952. IEEE

  36. Nam H, Ha JW, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In Proc IEEE Conf Comput Vis Pattern Recognit 299–307

  37. Nguyen DK, Okatani T (2018) Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In Proc IEEE Conf Comput Vis Pattern Recognit 6087–6096

  38. Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics 311–318

  39. Plummer BA, Wang L, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S (2015) Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision Vis 2641–2649

  40. Ren S, He K, Girshick R, Sun J (2015a) Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS

  41. Ren M, Kiros R, Zemel R (2015b) Exploring models and data for image question answering. In Advances in Neural Information Processing Systems 2953–2961

  42. Ren Z, Wang X, Zhang N, Lv X, Li LJ (2017) Deep reinforcement learning-based image captioning with embedding reward. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 290–298

  43. Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2008) The graph neural network model. IEEE Transactions on Neural Netw 20(1):61–80

    Article  Google Scholar 

  44. Sharma H, Jalal AS (2020) Incorporating external knowledge for image captioning using CNN and LSTM. Modern Phys Letters B 34(28):2050315

    Article  MathSciNet  Google Scholar 

  45. Sharma A, Singh PK, Sharma A, Kumar R (2019) An efficient architecture for the accurate detection and monitoring of an event through the sky. Comput Commun 148:115–128

    Article  Google Scholar 

  46. Sharma H, Agrahari M, Singh SK, Firoj M, Mishra RK (2020) Image Captioning: A Comprehensive Survey. In 2020 International Conference on Power Electronics & IoT Applications in Renewable Energy and its Control (PARC) 325-–328. IEEE

  47. Sharma H, Jalal AS (2021) Visual question answering model based on graph neural network and contextual attention. Image Vis Comput 110:104165

  48. Shevchenko V, Teney D, Dick A, Hengel AVD (2020) Visual Question Answering with Prior Class Semantics. arXiv preprint arXiv:2005.01239

  49. Shih KJ, Singh S, Hoiem D (2016) Where to look: Focus regions for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 4613–4621

  50. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  51. Singh A, Natarajan V, Shah M, Jiang Y, Chen X, Batra D, ... Rohrbach M (2019) Towards vqa models that can read. In Proc IEEE Conf Comput Vis Pattern Recognit 8317–8326

  52. Teney D, Anderson P, He X, Van Den Hengel A (2018) Tips and tricks for visual question answering: Learnings from the 2017 challenge. In Proc IEEE Conf Comput Vis Pattern Recognit 4223–4232

  53. Vedantam R, Zitnick CL, Parikh D (2015) Cider: Consensus-based image description evaluation. In Proc IEEE Conf Comput Vis Pattern Recognit 4566–4575

  54. Venugopalan S, Anne Hendricks L, Rohrbach M, Mooney R, Darrell T, Saenko K (2017) Captioning images with diverse objects. In Proc IEEE Conf Comput Vis Pattern Recognit 5753–5761

  55. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In Proc IEEE Conf Comput Vis Pattern Recognit 3156−3164

  56. Wang P, Wu Q, Shen C, Hengel AVD, Dick A (2015) Explicit knowledge-based reasoning for visual question answering. arXiv preprint arXiv:1511.02570

  57. Wang P, Wu Q, Shen C, Dick A, van den Hengel A (2018) Fvqa: Fact-based visual question answering. IEEE Trans Pattern Anal Mach Intell 40(10):2413–2427

    Article  Google Scholar 

  58. Wang J, Wang, W, Wang, L, Wang Z, Feng DD, Tan T (2020a) Learning visual relationship and context-aware attention for image captioning. Pattern Recognit 98:107075

  59. Wang X, Liu Y, Shen C, Ng CC, Luo C, Jin L, ... Wang L (2020b) On the general value of evidence, and bilingual scene-text visual question answering. In Proc IEEE/CVF Conf Comput Vis Pattern Recognit 10126–10135

  60. Wu Y, Zhu L, Jiang L, Yang Y (2018) Decoupled novel object captioner. In Proc 26th ACM Int Conf Multimed 1029–1037

  61. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, ... Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning 2048–2057

  62. Xu H, Saenko K (2016) Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In European Conference on Computer Vision 451–466 Springer, Cham

  63. Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In Proc IEEE Conf Comput Vis Pattern Recognit 21–29

  64. Yang X, Xu C (2019). Image Captioning by Asking Questions. ACM Trans Multimed Comput Commun Appl (TOMM) 15(2s):1–19

  65. Yang J, Sun Y, Liang J, Ren B, Lai SH (2019) Image captioning by incorporating affective concepts learned from both visual and textual components. Neurocomputing 328:56–68

    Article  Google Scholar 

  66. Yao B, Fei-Fei L (2012) Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. IEEE Trans Pattern Anal Mach Intell 34(9):1691–1703

    Article  Google Scholar 

  67. Yao T, Pan Y, Li Y, Mei T (2017) Incorporating copying mechanism in image captioning for learning novel objects. In Proc IEEE Conf Comput Vis Pattern Recognit 6580–6588

  68. Yu D, Fu J, Mei T, Rui Y (2017a) Multi-level attention networks for visual question answering. In Proc IEEE Conf Comput Vis Pattern Recognit 4709–4717

  69. Yu Z, Yu J, Fan J, Tao D (2017b) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision 1821–1830

  70. Yu N, Hu X, Song B, Yang J, Zhang J (2018) Topic-oriented image captioning based on order-embedding. IEEE Trans Image Process 28(6):2743–2754

    Article  MathSciNet  Google Scholar 

  71. Yu Z, Yu J, Xiang C, Fan J, Tao D (2018) Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Transactions on Neural Networks and Learning Systems 29(12):5947–5959

    Article  Google Scholar 

  72. Zhang W, Yu J, Hu H, Hu H, Qin Z (2020) Multimodal feature fusion by relational reasoning and attention for visual question answering. Information Fusion 55:116–126

    Article  Google Scholar 

  73. Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015) Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167

  74. Zhu Y, Groth O, Bernstein M, Fei-Fei L (2016) Visual7w: Grounded question answering in images. In Proc IEEE Conf Comput Vis Pattern Recognit 4995–5004

  75. Zhu X, Mao Z, Chen Z, Li Y, Wang Z, Wang B (2020) Object-difference drived graph convolutional networks for visual question answering. Multimed Tools Appl 1–19

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Himanshu Sharma.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sharma, H., Jalal, A.S. Image captioning improved visual question answering. Multimed Tools Appl 81, 34775–34796 (2022). https://doi.org/10.1007/s11042-021-11276-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-11276-2

Keywords

Navigation