MUST-VQA: MUltilingual Scene-Text VQA

Vivoli, Emanuele; Biten, Ali Furkan; Mafla, Andres; Karatzas, Dimosthenis; Gomez, Lluis

doi:10.1007/978-3-031-25069-9_23

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13804))

Included in the following conference series:

European Conference on Computer Vision

1515 Accesses

Abstract

In this paper, we present a framework for Multilingual Scene Text Visual Question Answering that deals with new languages in a zero-shot fashion. Specifically, we consider the task of Scene Text Visual Question Answering (STVQA) in which the question can be asked in different languages and it is not necessarily aligned to the scene text language. Thus, we first introduce a natural step towards a more generalized version of STVQA: MUST-VQA. Accounting for this, we discuss two evaluation scenarios in the constrained setting, namely IID and zero-shot and we demonstrate that the models can perform on a par on a zero-shot setting. We further provide extensive experimentation and show the effectiveness of adapting multilingual language models into STVQA tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Multilingual Approach to Scene Text Visual Question Answering

Zero-Shot Translation of Attention Patterns in VQA Models to Natural Language

BIVL-Net: Bidirectional Vision-Language Guidance for Visual Question Answering

Notes

References

Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Word spotting and recognition with embedded attributes. IEEE Trans. Pattern Anal. Mach. Intell. 36(12), 2552–2566 (2014)
Article Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086 (2018)
Google Scholar
Biten, A.F., Litman, R., Xie, Y., Appalaraju, S., Manmatha, R.: Latr: layout-aware transformer for scene-text vqa. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16548–16558 (2022)
Google Scholar
Biten, A.F., Tito, R., Gomez, L., Valveny, E., Karatzas, D.: Ocr-idl: Ocr annotations for industry document library dataset. arXiv preprint arXiv:2202.12985 (2022)
Biten, A.F., et al.: Icdar 2019 competition on scene text visual question answering. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1563–1570. IEEE (2019)
Google Scholar
Biten, A.F., et al.: Scene text visual question answering. In: ICCV, pp. 4291–4301 (2019)
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguistics 5, 135–146 (2017)
Article Google Scholar
Borisyuk, F., Gordo, A., Sivakumar, V.: Rosetta: Large scale system for text detection and recognition in images. In: SIGKDD, pp. 71–79 (2018)
Google Scholar
Crystal, D.: Two thousand million? English today 24(1), 3–6 (2008)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Geirhos, R., Jacobsen, J.H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., Wichmann, F.A.: Shortcut learning in deep neural networks. Nature Mach. Intell. 2(11), 665–673 (2020)
Article Google Scholar
Gómez, L., Biten, A.F., Tito, R., Mafla, A., Rusiñol, M., Valveny, E., Karatzas, D.: Multimodal grid features and cell pointers for scene text visual question answering. Pattern Recogn. Lett. 150, 242–249 (2021)
Article Google Scholar
Han, W., Huang, H., Han, T.: Finding the evidence: Localization-aware answer prediction for text visual question answering. arXiv preprint arXiv:2010.02582 (2020)
Heinzerling, B., Strube, M.: Bpemb: tokenization-free pre-trained subword embeddings in 275 languages. arXiv preprint arXiv:1710.02187 (2017)
Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9992–10002 (2020)
Google Scholar
Kant, Y., Batra, D., Anderson, P., Schwing, A., Parikh, D., Lu, J., Agrawal, H.: Spatially aware multimodal transformers for TextVQA. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 715–732. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_41
Chapter Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017)
Article MathSciNet Google Scholar
Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Mafla, A., Dey, S., Biten, A.F., Gomez, L., Karatzas, D.: Fine-grained image classification and retrieval by combining visual and locally pooled textual features. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2950–2959 (2020)
Google Scholar
Mathew, M., Bagal, V., Tito, R.P., Karatzas, D., Valveny, E., Jawahar, C.: Infographicvqa. arXiv preprint arXiv:2104.12756 (2021)
Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: a dataset for VQA on document images. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2200–2209 (2021)
Google Scholar
Mikulyte, G., Gilbert, D.: An efficient automated data analytics approach to large scale computational comparative linguistics. CoRR (2020)
Google Scholar
Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A.: Ocr-vqa: visual question answering by reading text in images. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 947–952. IEEE (2019)
Google Scholar
Brugués i Pujolràs, J., Gómez i Bigordà, L., Karatzas, D.: A multilingual approach to scene text visual question answering. In: Uchida, S., Barney, E., Eglin, V. (eds) Document Analysis Systems. DAS 2022. LNCS, vol. 13237, pp. 65–79. Springer, Cham. https://doi.org/10.1007/978-3-031-06555-2_5
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)
MathSciNet MATH Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp. 91–99 (2015)
Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Shazeer, N.: Glu variants improve transformer. arXiv preprint arXiv:2002.05202 (2020)
Sidorov, O., Hu, R., Rohrbach, M., Singh, A.: Textcaps: a dataset for image captioning with reading comprehension. arXiv preprint arXiv:2003.12462 (2020)
Singh, A., et al.: Towards vqa models that can read. In: CVPR, pp. 8317–8326 (2019)
Google Scholar
Vaswani, A., Shazeer, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Xue, L., et al.: mt5: a massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934 (2020)
Yang, Z., et al.: Tap: text-aware pre-training for text-VQA and text-caption. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8751–8761 (2021)
Google Scholar
Zhu, Q., Gao, C., Wang, P., Wu, Q.: Simple is not easy: a simple strong baseline for textvqa and textcaps. arXiv preprint arXiv:2012.05153 (2020)

Download references

Acknowledgments

This work has been supported by projects PDC2021-121512-I00, PLEC2021-00785, PID2020-116298GB-I00, ACE034/21/000084, the CERCA Programme/Generalitat de Catalunya, AGAUR project 2019PROD00090 (BeARS), the Ramon y Cajal RYC2020-030777-I/AEI/10.13039/501100011033 and PhD scholarship from UAB (B18P0073).

Author information

Authors and Affiliations

University of Florence, Florence, Italy
Emanuele Vivoli
Computer Vision Center, UAB, Barcelona, Spain
Emanuele Vivoli, Ali Furkan Biten, Andres Mafla, Dimosthenis Karatzas & Lluis Gomez

Authors

Emanuele Vivoli
View author publications
You can also search for this author in PubMed Google Scholar
Ali Furkan Biten
View author publications
You can also search for this author in PubMed Google Scholar
Andres Mafla
View author publications
You can also search for this author in PubMed Google Scholar
Dimosthenis Karatzas
View author publications
You can also search for this author in PubMed Google Scholar
Lluis Gomez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Emanuele Vivoli .

Editor information

Editors and Affiliations

IBM Research - MIT-IBM Watson AI Lab, Massachusetts, USA
Leonid Karlinsky
Technion – Israel Institute of Technology, Haifa, Israel
Tomer Michaeli
Kyoto University, Kyoto, Japan
Ko Nishino

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vivoli, E., Biten, A.F., Mafla, A., Karatzas, D., Gomez, L. (2023). MUST-VQA: MUltilingual Scene-Text VQA. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13804. Springer, Cham. https://doi.org/10.1007/978-3-031-25069-9_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-25069-9_23
Published: 14 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25068-2
Online ISBN: 978-3-031-25069-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics