Improving visual question answering by combining scene-text information

Sharma, Himanshu; Jalal, Anand Singh

doi:10.1007/s11042-022-12317-0

Improving visual question answering by combining scene-text information

1177: Advances in Deep Learning for Multimodal Fusion and Alignment
Published: 19 February 2022

Volume 81, pages 12177–12208, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Himanshu Sharma¹ &
Anand Singh Jalal¹

672 Accesses
9 Citations
1 Altmetric
Explore all metrics

Abstract

The text present in natural scenes contains semantic information about its surrounding environment. For example, the majority of questions asked by blind people related to images around them require understanding of text in the image. However, most of the existing Visual Question Answering (VQA) models do not consider the text present in an image. In this paper, the proposed model fuses the multiple inputs such as visual features, questions features and OCR tokens. Also, we have captured the relationship between OCR tokens and the objects in an image, which previous model fail to use. As compared to previous model on TextVQA dataset, the proposed model uses dynamic pointer networks based decoder to predict multi-word (OCR tokens and words from fixed vocabulary) answers instead of single-step classification task. OCR tokens are represented using location, appearance, phoc and fisher vectors features in addition to the FastText features used by previous model on TextVQA. A powerful descriptor is constructed by applying Fisher Vectors (FV) which is computed from PHOCs of the text present in images. This FV based feature representation is better than the feature representation based on word embeddings only, which are used by previous state-of-the-art models. Quantitative and qualitative experiments performed on popular benchmarks including TextVQA, ST-VQA and VQA 2.0 reveal the efficacy of proposed model. Our proposed VQA model attains 41.23% on TextVQA dataset, 40.98% on ST-VQA dataset and 74.98% overall accuracy on VQA 2.0 dataset. Results suggest that there is a significant gap between human accuracy and model accuracy on TextVQA and ST-VQA datasets compared to VQA 2.0, recommending the use of TextVQA and ST-VQA datasets for future research which can complement VQA 2.0.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Two-Stage Multimodality Fusion for High-Performance Text-Based Visual Question Answering

Comparison of Visual Question Answering Datasets for Improving Their Reading Capabilities

OECA-Net: A co-attention network for visual question answering based on OCR scene text feature enhancement

Article 05 June 2023

References

Abacha AB, Datla VV, Hasan SA, Demner-Fushman D, Müller H (2020) Overview of the VQA-med task at imageclef 2020: visual question answering and generation in the medical domain. CLEF 2020 working notes:22–25
Almazán J, Gordo A, Fornés A, Valveny E (2014) Word spotting and recognition with embedded attributes. IEEE Trans Pattern Anal Mach Intell 36(12):2552–2566
Article Google Scholar
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Google Scholar
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) VQA: visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433
Google Scholar
Ben-Younes H, Cadene R, Cord M, Thome N (2017) Mutan: multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2612–2620
Google Scholar
Biten AF, Tito R, Mafla A, Gomez L, Rusinol M, Valveny E, Karatzas D (2019) Scene text visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 4291–4301
Google Scholar
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 248–255
Chapter Google Scholar
Dey AU, Ghosh SK, Valveny E (2019) Beyond visual semantics: exploring the role of scene text in image understanding. arXiv preprint arXiv:1905.10622
Farazi M, Khan S, Barnes N (2021) Accuracy vs. complexity: a trade-off in visual question answering models. Patt Recogn 120(108106)
Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11(1):86–92
Article MathSciNet MATH Google Scholar
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847
Gao H, Mao J, Zhou J, Huang Z, Wang L, Xu W (2015) Are you talking to a machine? Dataset and methods for multilingual image question. In: Advances in neural information processing systems, pp 2296–2304
Google Scholar
Gómez L, Mafla A, Rusinol M, Karatzas D (2018) Single shot scene text retrieval. In: Proceedings of the European conference on computer vision (ECCV), pp 700–715
Google Scholar
Gómez L, Biten AF, Tito R, Mafla A, Rusiñol M, Valveny E, Karatzas D (2021) Multimodal grid features and cell pointers for scene text visual question answering. Pattern Recogn Lett 150:242–249
Article Google Scholar
Gregor J (1969) An algorithm for the decomposition of a distribution into Gaussian components. Biometrics 25:79–93
Article Google Scholar
Guo D, Xu C, Tao D (2021) Bilinear graph networks for visual question answering. IEEE Trans Neural Networks Learn Syst
Gurari D, Li Q, Stangl AJ, Guo A, Lin C, Grauman K, Bigham JP (2018) Vizwiz grand challenge: answering visual questions from blind people. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3608–3617
Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Hosseinabad SH, Safayani M, Mirzaei A (2021) Multiple answers to a question: a new approach for visual question answering. Vis Comput 37(1):119–131
Article Google Scholar
Hu R, Singh A, Darrell T, Rohrbach M (2020) Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9992–10002
Google Scholar
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2016) Reading text in the wild with convolutional neural networks. Int J Comput Vis 116(1):1–20
Article MathSciNet Google Scholar
Jiang Y, Natarajan V, Chen X, Rohrbach M, Batra D, Parikh D (2018) Pythia v0. 1: the winning entry to the VQA challenge 2018. arXiv preprint arXiv:1807.09956
Jin ZX, Wu H, Yang C, Zhou F, Qin J, Xiao L, Yin XC (2021) Ruart: a novel text-centered solution for text-based visual question answering. IEEE Trans Multimedia 1
Kafle K, Price B, Cohen S, Kanan C (2018) DVQA: understanding data visualizations via question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5648–5656
Google Scholar
Karatzas D, Shafait F, Uchid S, Iwamura M, i Bigorda LG, Mestre SR, De Las Heras LP (2013) ICDAR 2013 robust reading competition. In: 2013 12th international conference on document analysis and recognition. IEEE, pp 1484–1493
Chapter Google Scholar
Karatzas D, Gomez-Bigorda L, Nicolaou A, Ghosh S, Bagdanov A, Iwamura M, Shafait F (2015) ICDAR 2015 competition on robust reading. In: 2015 13th international conference on document analysis and recognition (ICDAR). IEEE, pp 1156–1160
Chapter Google Scholar
Kazemi V, Elqursh A (2017) Show, ask, attend, and answer: a strong baseline for visual question answering. arXiv. preprint arXiv:1704.03162
Kembhavi A, Seo M, Schwenk D, Choi J, Farhadi A, Hajishirzi H (2017) Are you smarter than a sixth grader? Textbook question answering for multimodal machine comprehension. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4999–5007
Google Scholar
Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. In advances in neural information processing systems, pp 1564–1574
Google Scholar
Kingma DP, Ba J (2015) Adam (2014), a method for stochastic optimization. In proceedings of the 3rd international conference on learning representations (ICLR), arXiv preprint. arXiv (Vol. 1412)
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Bernstein MS (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Article MathSciNet Google Scholar
Liu Y, Zhang X, Huang F, Li Z (2018) Adversarial learning of answer-related representation for visual question answering. In: Proceedings of the 27th ACM international conference on information and knowledge management, pp 1013–1022
Chapter Google Scholar
Liu Y, Zhang X, Huang F, Cheng L, Li Z (2020) Adversarial learning with multi-modal attention for visual question answering. IEEE Trans Neural Networks Learn Syst
Liu Y, Zhang X, Huang F, Zhou Z, Zhao Z, Li Z (2020) Visual question answering via combining inferential attention and semantic space mapping. Knowl-Based Syst 207:106339
Article Google Scholar
Lobry S, Marcos D, Murray J, Tuia D (2020) RSVQA: visual question answering for remote sensing data. IEEE Trans Geosci Remote Sens 58:8555–8566
Article Google Scholar
Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. In: Advances in neural information processing systems, pp 289–297
Google Scholar
Mafla A, Dey S, Biten AF, Gomez L, Karatzas D (2020) Fine-grained image classification and retrieval by combining visual and locally pooled textual features. arXiv preprint arXiv:2001.04732
Malinowski M, Rohrbach M, Fritz M (2017) Ask your neurons: a deep learning approach to visual question answering. Int J Comput Vis 125(1–3):110–135
Article MathSciNet Google Scholar
Marino K, Rastegari M, Farhadi A, Mottaghi R (2019) Ok-VQA: a visual question answering benchmark requiring external knowledge. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3195–3204
Google Scholar
Mishra A, Alahari K, Jawahar CV (2013) Image retrieval using textual cues. In: Proceedings of the IEEE international conference on computer vision, pp 3040–3047
Google Scholar
Mishra A, Shekhar S, Singh AK, Chakraborty A (2019) OCR-VQA: Visual question answering by reading text in images. In: 2019 international conference on document analysis and recognition (ICDAR). IEEE, pp 947–952
Chapter Google Scholar
Nguyen DK, Okatani T (2018) Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6087–6096
Google Scholar
Noh H, Kim T, Mun J, Han B (2019) Transfer learning via unsupervised task discovery for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8385–8394
Google Scholar
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lerer A (2017) Automatic different Pytorch
Google Scholar
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Chapter Google Scholar
Perronnin F, Dance C (2007) Fisher kernels on visual vocabularies for image categorization. In: 2007 IEEE conference on computer vision and pattern recognition. IEEE, pp 1–8
Google Scholar
Pirsiavash H, Ramanan D, Fowlkes C (2009) Bilinear classifiers for visual recognition. Adv Neural Inf Proces Syst 22:1482–1490
Google Scholar
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
Google Scholar
Sharma H, Jalal AS (2021) An improved attention and hybrid optimization technique for visual question answering. Neural Process Lett:1–22
Sharma H, Jalal AS (2021) A survey of methods, datasets and evaluation metrics for visual question answering. Image Vis Comput 116:104327
Article Google Scholar
Sharma H, Jalal AS (2021) Visual question answering model based on graph neural network and contextual attention. Image Vis Comput 110:104165
Article Google Scholar
Sharma H, Jalal AS (2021) Image captioning improved visual question answering. Multimed Tools Appl. https://doi.org/10.1007/s11042-021-11276-2
Shrestha R, Kafle K, Kanan C (2019) Answer them all! Toward universal visual question answering models. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10472–10481
Google Scholar
Sidorov O, Hu R, Rohrbach M, Singh A (2020) Textcaps: a dataset for image captioning with reading comprehension. In: European conference on computer vision. Springer, Cham, pp 742–758
Google Scholar
Singh A, Natarajan V, Jiang Y, Chen X, Shah M, Rohrbach M, Parikh D (2018) Pythia-a platform for vision & language research. In: SysML workshop. NeurIPS, 2019
Google Scholar
Singh A, Natarajan V, Shah M, Jiang Y, Chen X, Batra D, Rohrbach M (2019) Towards VQA models that can read. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8317–8326
Google Scholar
Singh A, Pang G, Toh M, Huang J, Galuba W, Hassner T (2021) TextOCR: towards large-scale end-to-end reasoning for arbitrary-shaped scene text. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8802–8812
Google Scholar
Singh AK, Mishra A, Shekhar S, Chakraborty A (2019) From strings to things: knowledge-enabled VQA model that can read and reason. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4602–4612
Google Scholar
Sun B, Yao Z, Zhang Y, Yu L (2020) Local relation network with multilevel attention for visual question answering. J Vis Commun Image Represent 73:102762
Article Google Scholar
Teney D, Anderson P, He X, Van Den Hengel A (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4223–4232
Google Scholar
Toor AS, Wechsler H, Nappi M (2019) Biometric surveillance using visual question answering. Pattern Recogn Lett 126:111–118
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Google Scholar
Veit A, Matera T, Neumann L, Matas J, Belongie S (2016) Coco-text: dataset and benchmark for text detection and recognition in natural images. arXiv. preprint arXiv:1601.07140
Vinyals O, Fortunato M, Jaitly N (2015) Pointer networks. In advances in neural information processing systems, pp 2692–2700
Google Scholar
Wu Q, Teney D, Wang P, Shen C, Dick A, van den Hengel A (2017) Visual question answering: a survey of methods and datasets. Comput Vis Image Underst 163:21–40
Article Google Scholar
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 21–29)
Yang Z, Lu Y, Wang J, Yin X, Florencio D, Wang L, Luo J (2021) TAP: text-aware pre-training for text-VQA and text-caption. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8751–8761
Google Scholar
Yu Z, Yu J, Xiang C, Fan J, Tao D (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Network Learn Syst 29(12):5947–5959
Article Google Scholar
Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6281–6290
Google Scholar
Zellers R, Bisk Y, Farhadi A, Choi Y (2019) From recognition to cognition: visual commonsense reasoning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6720–6731
Google Scholar
Zhan H, Xiong P, Wang X, Xin WANG, Yang L (2022) Visual question answering by pattern matching and reasoning. Neurocomputing 467:323–336
Article Google Scholar
Zhang S, Chen M, Chen J, Zou F, Li YF, Lu P (2021) Multimodal feature-wise co-attention method for visual question answering. Inform Fusion 73:1–10
Article Google Scholar
Zhang W, Yu J, Hu H, Hu H, Qin Z (2020) Multimodal feature fusion by relational reasoning and attention for visual question answering. Inform Fusion 55:116–126
Article Google Scholar
Zhang W, Yu J, Wang Y, Wang W (2021) Multimodal deep fusion for image question answering. Knowl-Based Syst 212:106639
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering and Applications, GLA University, Mathura, India
Himanshu Sharma & Anand Singh Jalal

Authors

Himanshu Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Anand Singh Jalal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Himanshu Sharma.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sharma, H., Jalal, A.S. Improving visual question answering by combining scene-text information. Multimed Tools Appl 81, 12177–12208 (2022). https://doi.org/10.1007/s11042-022-12317-0

Download citation

Received: 31 July 2020
Revised: 06 December 2021
Accepted: 17 January 2022
Published: 19 February 2022
Issue Date: April 2022
DOI: https://doi.org/10.1007/s11042-022-12317-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving visual question answering by combining scene-text information

Abstract

Access this article

Similar content being viewed by others

Two-Stage Multimodality Fusion for High-Performance Text-Based Visual Question Answering

Comparison of Visual Question Answering Datasets for Improving Their Reading Capabilities

OECA-Net: A co-attention network for visual question answering based on OCR scene text feature enhancement

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving visual question answering by combining scene-text information

Abstract

Access this article

Similar content being viewed by others

Two-Stage Multimodality Fusion for High-Performance Text-Based Visual Question Answering

Comparison of Visual Question Answering Datasets for Improving Their Reading Capabilities

OECA-Net: A co-attention network for visual question answering based on OCR scene text feature enhancement

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation