Abstract
Both Visual Question Answering (VQA) and image captioning are the problems which involve Computer Vision (CV) and Natural Language Processing (NLP) domains. In general, computer vision models are effectively utilized to represent visual contents. While NLP algorithms are used to represent the sentences. In recent years, VQA and image captioning tasks are tackled independently although they require similar type of algorithms. In this paper, a joint relationship between these two tasks is established and exploited. We present an image captioning based VQA model that uses the knowledge learnt from the image captioning task and transfers that knowledge to VQA task. We integrate the image captioning module into the VQA model by fusing the features obtained from captioning model and the attention-based visual feature. The experimental results demonstrate the improvement in the answer generation accuracy by a margin 3.45 % on VQA 1.0, 3.33% on VQA 2.0 and 1.73% on VQA-CP v2 datasets over the state-of-the-art VQA models.
Similar content being viewed by others
References
Agrawal A, Batra D, Parikh D (2016) Analyzing the behavior of visual question answering models. arXiv preprint arXiv:1606.07356
Agrawal A, Batra D, Parikh D, Kembhavi A (2018) Don't just assume; look and answer: Overcoming priors for visual question answering. In Proc IEEE Conf Comput Vis Pattern Recognit 4971–4980
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In Proc IEEE Conf Comput Vis Pattern Recognit 6077–6086
Andreas J, Rohrbach M, Darrell T, Klein D (2016) Neural module networks. In Proc IEEE Conf Comput Vis Pattern Recognit 39–48
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) Vqa: Visual question answering. In Proc IEEE Int Conf Comput Vis 2425−2433
Cadene R, Ben-Younes H, Cord M, Thome N (2019) Murel: Multimodal relational reasoning for visual question answering. In Proc IEEE Conf Comput Vis Pattern Recognit 1989–1998
Chen K, Wang J, Chen LC, Gao H, Xu W, Nevatia R (2015a) Abc-cnn: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960
Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollár P, Zitnick CL (2015b) Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325
Chen L, Yan X, Xiao J, Zhang H, Pu S, Zhuang Y (2020) Counterfactual samples synthesizing for robust visual question answering. In Proc IEEE/CVF Conf Comput Vis Pattern Recognit 10800–10809
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
Das A, Agrawal H, Zitnick L, Parikh D, Batra D (2017) Human attention in visual question answering: Do humans and deep networks look at the same regions? Comput Vis Image Underst 163:90–100
Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In Proc 9th Workshop Stat Mach Transl 376–380
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In Proceed IEEE Conf Comput Vis Pattern Recognit 2625–2634
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847
Gao L, Cao L, Xu X, Shao J, Song J (2020) Question-Led object attention for visual question answering. Neurocomputing 391:227–233
Geman D, Geman S, Hallonquist N, Younes L (2015) Visual turing test for computer vision systems. Proc Natl Acad Sci 112(12):3618–3623
Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 6904–6913
Gupta N, Jalal AS (2019) Integration of textual cues for fine-grained image captioning using deep CNN and LSTM. Neural Comput Appl 1–10
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 770–778
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural computation 9(8):1735–1780
Hu R, Singh A, Darrell T, Rohrbach M (2020) Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In Proc IEEE/CVF Conf Comput Vis Pattern Recognit 9992–10002
Jiang A, Wang F, Porikli F, Li Y (2015) Compositional memory for visual question answering. arXiv preprint arXiv:1511.05676
Jiang H, Misra I, Rohrbach M, Learned-Miller E, Chen X (2020) In Defense of Grid Features for Visual Question Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 10267–10276
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In Proc IEEE Conf Comput Vis Pattern Recognit 3128–3137
Kazemi V, Elqursh A (2017) Show, ask, attend, and answer: A strong baseline for visual question answering. arXiv preprint arXiv:1704.03162
Kazemzadeh S, Ordonez V, Matten M, Berg T (2014). Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 787–798
Kim JH, On KW, Lim W, Kim J, Ha JW, Zhang BT (2016) Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. In Proc 3rd Int Conf Learn Represent (ICLR). arXiv preprint arXiv 1412
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, ... Zitnick CL (2014) Microsoft coco: Common objects in context. In European Conference on Computer Vision 740–755. Springer, Cham
Liu F, Xiang T, Hospedales TM, Yang W, Sun C (2018) Inverse visual question answering: A new benchmark and VQA diagnosis tool. IEEE Trans Pattern Anal Mach Intell 42(2):460–474
Liu Y, Zhang X, Zhao Z, Zhang B, Cheng L, Li Z (2020) ALSA: Adversarial Learning of Supervised Attentions for Visual Question Answering. IEEE Transactions on Cybernetics
Malinowski M, Rohrbach M, Fritz M (2015) Ask your neurons: A neural-based approach to answering questions about images. In Proceedings of the IEEE International Conference on Computer Vision 1–9
Malinowski M, Doersch C, Santoro A, Battaglia P (2018) Learning visual question answering by bootstrapping hard attention. In Proceedings of the European Conference on Computer Vision (ECCV) 3–20
Mishra A, Shekhar S, Singh AK, Chakraborty A (2019) Ocr-vqa: Visual question answering by reading text in images. In 2019 International Conference on Document Analysis and Recognition (ICDAR) 947–952. IEEE
Nam H, Ha JW, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In Proc IEEE Conf Comput Vis Pattern Recognit 299–307
Nguyen DK, Okatani T (2018) Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In Proc IEEE Conf Comput Vis Pattern Recognit 6087–6096
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics 311–318
Plummer BA, Wang L, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S (2015) Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision Vis 2641–2649
Ren S, He K, Girshick R, Sun J (2015a) Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS
Ren M, Kiros R, Zemel R (2015b) Exploring models and data for image question answering. In Advances in Neural Information Processing Systems 2953–2961
Ren Z, Wang X, Zhang N, Lv X, Li LJ (2017) Deep reinforcement learning-based image captioning with embedding reward. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 290–298
Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2008) The graph neural network model. IEEE Transactions on Neural Netw 20(1):61–80
Sharma H, Jalal AS (2020) Incorporating external knowledge for image captioning using CNN and LSTM. Modern Phys Letters B 34(28):2050315
Sharma A, Singh PK, Sharma A, Kumar R (2019) An efficient architecture for the accurate detection and monitoring of an event through the sky. Comput Commun 148:115–128
Sharma H, Agrahari M, Singh SK, Firoj M, Mishra RK (2020) Image Captioning: A Comprehensive Survey. In 2020 International Conference on Power Electronics & IoT Applications in Renewable Energy and its Control (PARC) 325-–328. IEEE
Sharma H, Jalal AS (2021) Visual question answering model based on graph neural network and contextual attention. Image Vis Comput 110:104165
Shevchenko V, Teney D, Dick A, Hengel AVD (2020) Visual Question Answering with Prior Class Semantics. arXiv preprint arXiv:2005.01239
Shih KJ, Singh S, Hoiem D (2016) Where to look: Focus regions for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 4613–4621
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Singh A, Natarajan V, Shah M, Jiang Y, Chen X, Batra D, ... Rohrbach M (2019) Towards vqa models that can read. In Proc IEEE Conf Comput Vis Pattern Recognit 8317–8326
Teney D, Anderson P, He X, Van Den Hengel A (2018) Tips and tricks for visual question answering: Learnings from the 2017 challenge. In Proc IEEE Conf Comput Vis Pattern Recognit 4223–4232
Vedantam R, Zitnick CL, Parikh D (2015) Cider: Consensus-based image description evaluation. In Proc IEEE Conf Comput Vis Pattern Recognit 4566–4575
Venugopalan S, Anne Hendricks L, Rohrbach M, Mooney R, Darrell T, Saenko K (2017) Captioning images with diverse objects. In Proc IEEE Conf Comput Vis Pattern Recognit 5753–5761
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In Proc IEEE Conf Comput Vis Pattern Recognit 3156−3164
Wang P, Wu Q, Shen C, Hengel AVD, Dick A (2015) Explicit knowledge-based reasoning for visual question answering. arXiv preprint arXiv:1511.02570
Wang P, Wu Q, Shen C, Dick A, van den Hengel A (2018) Fvqa: Fact-based visual question answering. IEEE Trans Pattern Anal Mach Intell 40(10):2413–2427
Wang J, Wang, W, Wang, L, Wang Z, Feng DD, Tan T (2020a) Learning visual relationship and context-aware attention for image captioning. Pattern Recognit 98:107075
Wang X, Liu Y, Shen C, Ng CC, Luo C, Jin L, ... Wang L (2020b) On the general value of evidence, and bilingual scene-text visual question answering. In Proc IEEE/CVF Conf Comput Vis Pattern Recognit 10126–10135
Wu Y, Zhu L, Jiang L, Yang Y (2018) Decoupled novel object captioner. In Proc 26th ACM Int Conf Multimed 1029–1037
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, ... Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning 2048–2057
Xu H, Saenko K (2016) Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In European Conference on Computer Vision 451–466 Springer, Cham
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In Proc IEEE Conf Comput Vis Pattern Recognit 21–29
Yang X, Xu C (2019). Image Captioning by Asking Questions. ACM Trans Multimed Comput Commun Appl (TOMM) 15(2s):1–19
Yang J, Sun Y, Liang J, Ren B, Lai SH (2019) Image captioning by incorporating affective concepts learned from both visual and textual components. Neurocomputing 328:56–68
Yao B, Fei-Fei L (2012) Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. IEEE Trans Pattern Anal Mach Intell 34(9):1691–1703
Yao T, Pan Y, Li Y, Mei T (2017) Incorporating copying mechanism in image captioning for learning novel objects. In Proc IEEE Conf Comput Vis Pattern Recognit 6580–6588
Yu D, Fu J, Mei T, Rui Y (2017a) Multi-level attention networks for visual question answering. In Proc IEEE Conf Comput Vis Pattern Recognit 4709–4717
Yu Z, Yu J, Fan J, Tao D (2017b) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision 1821–1830
Yu N, Hu X, Song B, Yang J, Zhang J (2018) Topic-oriented image captioning based on order-embedding. IEEE Trans Image Process 28(6):2743–2754
Yu Z, Yu J, Xiang C, Fan J, Tao D (2018) Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Transactions on Neural Networks and Learning Systems 29(12):5947–5959
Zhang W, Yu J, Hu H, Hu H, Qin Z (2020) Multimodal feature fusion by relational reasoning and attention for visual question answering. Information Fusion 55:116–126
Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015) Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167
Zhu Y, Groth O, Bernstein M, Fei-Fei L (2016) Visual7w: Grounded question answering in images. In Proc IEEE Conf Comput Vis Pattern Recognit 4995–5004
Zhu X, Mao Z, Chen Z, Li Y, Wang Z, Wang B (2020) Object-difference drived graph convolutional networks for visual question answering. Multimed Tools Appl 1–19
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sharma, H., Jalal, A.S. Image captioning improved visual question answering. Multimed Tools Appl 81, 34775–34796 (2022). https://doi.org/10.1007/s11042-021-11276-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11276-2