STR Transformer: A Cross-domain Transformer for Scene Text Recognition

Wu, Xing; Tang, Bin; Zhao, Ming; Wang, Jianjia; Guo, Yike

doi:10.1007/s10489-022-03728-5

STR Transformer: A Cross-domain Transformer for Scene Text Recognition

Published: 31 May 2022

Volume 53, pages 3444–3458, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Xing Wu ORCID: orcid.org/0000-0001-5331-022X^1,2,
Bin Tang¹,
Ming Zhao³,
Jianjia Wang^1,2 &
…
Yike Guo⁴

1121 Accesses
7 Citations
1 Altmetric
Explore all metrics

Abstract

Scene text recognition is an indispensable part of computer vision, which aims to extract text information from an image. However, effective extraction of texts following spelling rules remains a challenge for scene text recognition. We propose a cross-domain Transformer, called STR Transformer (STRT), which can not only extract texts from an image but also correct characters effectively according to their spelling rules. Specifically, we propose a Spline Transformer to extract hierarchical features of images without the convolution layers, which has the flexibility to build models with various scales and has linear computational complexity with respect to image size. Furthermore, an iterative Text Transformer is designed to predict the probability distribution of current character in the character sequence, which can effectively reduce the impact of noise. Extensive experiments demonstrate that the proposed STRT outperforms state-of-the-art methods on various benchmark datasets of scene text recognition. The qualitative and quantitative analysis proves the effectiveness and efficiency of the proposed STRT method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CBAM: Convolutional Block Attention Module

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

Deep learning models for digital image processing: a review

Article 07 January 2024

References

Olszewska J I (2015) Active contour based optical character recognition for automated scene understanding. Neurocomputing 161:65–71
Article Google Scholar
Karaoglu S, Tao R, Gevers T, Smeulders Arnold WM (2016) Words matter: Scene text for image classification and retrieval. IEEE Trans Multimed 19(5):1063–1076
Article Google Scholar
Singh A, Natarajan V, Shah M, Jiang Y, Chen X, Batra D, Parikh D, Rohrbach M (2019) Towards vqa models that can read. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8317–8326
Wei L, Chen C, Wong K Y, Su Z, Han J (2016) Star-net: A spatial attention residue network for scene text recognition. In: British Machine Vision Conference 2016
Shi B, Yang M, Wang X, Lyu P, Yao C, Bai X (2018) Aster: An attentional scene text recognizer with flexible rectification. IEEE Trans Pattern Anal Mach Intell 41(9):2035–2048
Article Google Scholar
Yu D, Li X, Zhang C, Liu T, Ding E (2020) Towards accurate scene text recognition with semantic reasoning networks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Fang S, Xie H, Wang Y, Mao Z, Zhang Y (2021) Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 7098–7107
Chen Y, Zhuang T, Guo K (2021) Memory network with hierarchical multi-head attention for aspect-based sentiment analysis
Sun S- (2021) Self-attention enhanced cnns with average margin loss for chinese zero pronoun resolution
Cho K, van Merrienboer B, Gulcehre C, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: Conference on Empirical Methods in Natural Language Processing (EMNLP 2014)
Neumann L, Matas J (2012) Real-time scene text localization and recognition. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition
Risnumawan A, Shivakumara P, Chan C S, Tan C L (2014) A robust arbitrary text detection system for natural scene images. Expert Syst Appl 41(18):8027–8048
Article Google Scholar
Wu X, Zhong M, Guo Y, Fujita H (2020) The assessment of small bowel motility with attentive deformable neural network. Inf Sci 508:22–32
Article Google Scholar
Wu X, Chen C, Zhong M, Wang J (2021) Hal: Hybrid active learning for efficient labeling in medical domain
Wu X, Chen C, Zhong M, Wang J, Shi J (2021) Covid-al: The diagnosis of covid-19 with deep active learning. Med Image Anal 68:101913
Article Google Scholar
Baek J, Kim G, Lee J, Park S, Han D, Yun S, Oh S J, Lee H (2019) What is wrong with scene text recognition model comparisons? dataset and model analysis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4715– 4723
Su B, Lu S (2014) Accurate scene text recognition based on recurrent neural network
Shi B, Xiang B, Cong Y (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell 39(11):2298–2304
Article Google Scholar
Su B, Lu S (2017) Accurate recognition of words in scenes without character segmentation using recurrent neural network. Pattern Recogn 63:397–405
Article Google Scholar
Li W, Wang Q, Wu J, Yu Z (2021) Piecewise convolutional neural networks with position attention and similar bag attention for distant supervision relation extraction
Pei M, Wu X, Guo Y, Fujita H (2017) Small bowel motility assessment based on fully convolutional networks and long short-term memory. Knowl-Based Syst 121:163–172
Article Google Scholar
Li H, Wang P, Shen C, Zhang G (2019) Show, attend and read: A simple and strong baseline for irregular text recognition. Proceedings of the AAAI Conference on Artificial Intelligence 33:8610–8617
Article Google Scholar
Cheng Z, Xu Y, Fan B, Yi N, Zhou S (2018) Aon: Towards arbitrarily-oriented text recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Li H, Wang P, Shen C (2017) Towards end-to-end text spotting with convolutional recurrent neural networks. In: Proceedings of the IEEE international conference on computer vision, pp 5238–5246
Wang P, Li H, Shen C (2021) Towards end-to-end text spotting in natural scenes
Lyu P, Liao M, Yao C, Wu W, Bai X (2018) Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 67–83
Xing L, Tian Z, Huang W, Scott M R (2019) Convolutional character networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9126–9136
He P, Huang W, Qiao Y, Loy C C, Tang X (2016) Reading scene text in deep convolutional sequences. In: Thirtieth AAAI conference on artificial intelligence
Ma X, He K, Zhang D, Li D (2021) Pieed: Position information enhanced encoder-decoder framework for scene text recognition
Yin G, Chen F, Dong Y, Li G (2021) Knowledge-aware recommendation model with dynamic co-attention and attribute regularize
Lee C Y, Osindero S (2016) Recursive recurrent nets with attention modeling for ocr in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, PMLR, pp 10347–10357
Atienza R (2021) Vision transformer for fast and efficient scene text recognition. In: International Conference on Document Analysis and Recognition, Springer, pp 319–334
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10012–10022
Al-Rfou R, Choe D, Constant N, Guo M, Jones L (2019) Character-level language modeling with deeper self-attention. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 3159–3166
Wang T, Zhu Y, Jin L, Luo C, Chen X, Wu Y, Wang Q, Cai M (2020) Decoupled attention network for text recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 34, pp 12216–12224
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2014) Synthetic data and artificial neural networks for natural scene text recognition. Neural Information Processing Systems
Gupta A, Vedaldi A, Zisserman A (2016) Synthetic data for text localisation in natural images. In: IEEE Conference on Computer Vision and Pattern Recognition
Karatzas D, Shafait F, Uchida S, Iwamura M, Bigorda L G I, Mestre S R, Romeu J M, Mota D F, Almazán J, Heras L D L (2013) Icdar 2013 robust reading competition
Kai W, Babenko B, Belongie S (2012) End-to-end scene text recognition. In: IEEE International Conference on Computer Vision
Mishra A, Alahari K, Jawahar CV (2012) Scene text recognition using higher order language priors. In: BMVC-British Machine Vision Conference, BMVA
Karatzas D, Bigorda L G I, Nicolaou A, Ghosh S, Bagdanov A D, Iwamura M, Matas J, Neumann L, Chandrasekhar V, Lu S, Shafait F, Uchida S, Valveny E (2015) Icdar 2015 competition on robust reading
Phan T Q, Shivakumara P, Tian S, Tan C L (2014) Recognizing text with perspective distortion in natural scenes. In: IEEE International Conference on Computer Vision
Cubuk E D, Zoph B, Shlens J, Le Q V (2020) Randaugment: Practical automated data augmentation with a reduced search space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp 702–703
Shi B, Wang X, Lyu P, Cong Y, Xiang B (2016) Robust scene text recognition with automatic rectification. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Wang J, Hu X (2017) Gated recurrent convolution neural network for ocr. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp 334–343
Borisyuk F, Gordo A, Sivakumar V (2018) Rosetta: Large scale system for text detection and recognition in images. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp 71–79
Atienza R (2021) Vision transformer for fast and efficient scene text recognition. In: International Conference on Document Analysis and Recognition, Springer, pp 319–334
Zhang Y, Gueguen L, Zharkov I, Zhang P, Seifert K, Kadlec B (2017) Uber-text: A large-scale dataset for optical character recognition from street-level imagery. In: SUNw: Scene Understanding Workshop-CVPR, vol 2017, p 5

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant No. 62172267), the Natural Science Foundation of Shanghai, China (Grant No. 20ZR1420400), the State Key Program of National Natural Science Foundation of China (Grant No. 61936001) and the Key Research Project of Zhejiang Laboratory (No. 2021PE0AC02).

Author information

Authors and Affiliations

School of Computer Engineering and Science, Shanghai University, Shanghai, 200444, China
Xing Wu, Bin Tang & Jianjia Wang
Shanghai Institute for Advanced Communication and Data Science, Shanghai University, Shanghai, 200444, China
Xing Wu & Jianjia Wang
CSSC Ocean Exploration Technology Research Institute Co., Ltd., WuXi, 214000, China
Ming Zhao
Hong Kong Baptist University, Hong Kong, China
Yike Guo

Authors

Xing Wu
View author publications
You can also search for this author in PubMed Google Scholar
Bin Tang
View author publications
You can also search for this author in PubMed Google Scholar
Ming Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Jianjia Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yike Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xing Wu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, X., Tang, B., Zhao, M. et al. STR Transformer: A Cross-domain Transformer for Scene Text Recognition. Appl Intell 53, 3444–3458 (2023). https://doi.org/10.1007/s10489-022-03728-5

Download citation

Accepted: 06 May 2022
Published: 31 May 2022
Issue Date: February 2023
DOI: https://doi.org/10.1007/s10489-022-03728-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

STR Transformer: A Cross-domain Transformer for Scene Text Recognition

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

Deep learning models for digital image processing: a review

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

STR Transformer: A Cross-domain Transformer for Scene Text Recognition

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

Deep learning models for digital image processing: a review

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation