MAPM: multiscale attention pre-training model for TextVQA

Yang, Yue; Yu, Yue; Li, Yingying

doi:10.1007/s10489-024-05727-0

MAPM: multiscale attention pre-training model for TextVQA

Published: 15 August 2024

Volume 54, pages 10401–10413, (2024)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Yue Yang¹,
Yue Yu¹ &
Yingying Li¹

169 Accesses
Explore all metrics

Abstract

Text Visual Question Answering (TextVQA) task aims to enable models to read and answer questions based on images with text. Existing attention-based methods for TextVQA tasks often face challenges in effectively aligning local features between modalities during multimodal information interaction. This misalignment hinders their performance in accurately answering questions based on images with text. To address this issue, the Multiscale Attention Pre-training Model (MAPM) is proposed to enhance multimodal feature fusion. MAPM introduces the multiscale attention modules, which facilitate finegrained local feature enhancement and global feature fusion across modalities. By adopting these modules, MAPM achieves superior performance in aligning and integrating visual and textual information. Additionally, MAPM benefits from being pre-trained with scene text, employing three pre-training tasks: masked language model, visual region matching, and OCR visual text matching. This pre-training process establishes effective semantic alignment relationships among different modalities. Experimental evaluations demonstrate the superiority of MAPM, achieving a 1.2% higher accuracy compared to state-of-the-art models on the TextVQA dataset, especially when handling numerical data within images.

Graphical abstract

Multiscale Attention Pre-training Model (MAPM) is proposed to enhance local fine-grained features (Joint Attention Module) and effectively addresses redundancy in global features (Global Attention Module) in text VQA task. Three pre-training tasks are designed to enhance the model’s expressive power and address the issue of cross modal semantic alignment

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Two-Stage Multimodality Fusion for High-Performance Text-Based Visual Question Answering

SPCA-Net: a based on spatial position relationship co-attention network for visual question answering

Article 16 June 2022

Multi-level, multi-modal interactions for visual question answering over text in images

Article 25 November 2021

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Availability of data and materials

The datasets generated during and/or analysed during the current study are available in the TextVQA the ST-VQA. For data preprocessing, please refer to here.

References

Almazán J, Gordo A, Fornés A et al (2014) Word spotting and recognition with embedded attributes. IEEE Trans Pattern Anal Mach Intell 36(12):2552–2566
Article Google Scholar
Biten AF, Tito R, Mafla A et al (2019) Scene text visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4291–4301
Biten AF, Litman R, Xie Y et al (2022) Latr: layout-aware transformer for scenetext vqa. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16548–16558
Chia YK, Han VTY, Ghosal D et al (2024) Puzzlevqa: diagnosing multimodal reasoning challenges of language models with abstract visual patterns. arXiv:2403.13315
Devlin J, Chang MW, Lee K et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Gao D, Li K, Wang R et al (2020) Multi-modal graph neural network for joint reasoning on vision and scene text. In: Proceedingsof the IEEE/CVF conference on computer vision and pattern recognition, pp 12746–12756
Guo Z, Han D (2023) Sparse co-attention visual question answering networks based on thresholds. Appl Intell 53(1):586–600
Article Google Scholar
Hu R, Singh A, Darrell T et al (2020) Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9992–10002
Hu Z, Yang P, Jiang Y et al (2024) Prompting large language model with context and pre-answer for knowledge-based vqa. Pattern Recogn 151:110399
Article Google Scholar
Joulin A, Grave E, Bojanowski P et al (2016) Fasttext.zip: compressing text classification models. arXiv preprint arXiv:1612.03651
Kant Y, Batra D, Anderson P et al (2020) Spatially aware multimodal transformers for textvqa. In: Computer vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX, pp 715–732
Liu Y, Zhang S, Jin L et al (2019) Omnidirectional scene text detection with sequential-free box discretization, pp 3052–3058
Lu J, Batra D, Parikh D et al (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inf Process Syst 32
Lu X, Fan Z, Wang Y et al (2021) Localize, group, and select: boosting text-vqa by scene text modeling. 2021 IEEE/CVF International conference on computer vision workshops (ICCVW), pp 2631–2639
Peng L, Yang Y, Wang Z et al (2022) Mra-net: improving vqa via multi-modal relation attention network. IEEE Trans Pattern Anal Mach Intell 44(1):318–329
Article Google Scholar
Shen X, Han D, Guo Z et al (2022) Local self-attention in transformer for visual question answering. Appl Intell, pp 1–18
Singh A, Natarajan V, Shah M et al (2019) Towards vqa models that can read. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8317–8326
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Xiao J, Shang X, Yao A et al (2021) Next-qa: next phase of question-answering to explaining temporal actions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 9777–9786
Yang Z, Lu Y, Wang J et al (2021) Tap: text-aware pre-training for text-vqa and text-caption. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8751–8761
Yang Z, Gan Z, Wang J et al (2022) An empirical study of gpt-3 for fewshot knowledge-based vqa. Proceedings of the AAAI Conference on Artificial Intelligence 36(3):3081–3089
Article Google Scholar
Zhu Q, Gao C, Wang P et al (2021) Simple is not easy: a simple strong baseline for textvqa and textcaps. In: Proceedings of the AAAI conference on artificial intelligence, pp 3608–3615

Download references

Acknowledgements

This work is supported by National Natural Science Foundation of China (61807002).

Funding

This work is supported by National Natural Science Foundation of China (61807002).

Author information

Authors and Affiliations

School of Computer Science and Technology, Beijing Institute of Technology, No. 5, Zhongguancun South Street, Haidian District, Beijing, 100081, China
Yue Yang, Yue Yu & Yingying Li

Authors

Yue Yang
View author publications
You can also search for this author inPubMed Google Scholar
Yue Yu
View author publications
You can also search for this author inPubMed Google Scholar
Yingying Li
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Yingying Li and Yue Yang. The first draft of the manuscript was written by Yue Yang and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Yue Yu.

Ethics declarations

Conflict of interest/Competing interests

There is no conflict of interest/Competing interests.

Ethics approval

All authors promise that our work meets the requirements of ethical responsibilities.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, Y., Yu, Y. & Li, Y. MAPM: multiscale attention pre-training model for TextVQA. Appl Intell 54, 10401–10413 (2024). https://doi.org/10.1007/s10489-024-05727-0

Download citation

Accepted: 04 August 2024
Published: 15 August 2024
Issue Date: November 2024
DOI: https://doi.org/10.1007/s10489-024-05727-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MAPM: multiscale attention pre-training model for TextVQA

Abstract

Graphical abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Two-Stage Multimodality Fusion for High-Performance Text-Based Visual Question Answering

SPCA-Net: a based on spatial position relationship co-attention network for visual question answering

Multi-level, multi-modal interactions for visual question answering over text in images

Explore related subjects

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest/Competing interests

Ethics approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now