Abstract
Scene text recognition is an indispensable part of computer vision, which aims to extract text information from an image. However, effective extraction of texts following spelling rules remains a challenge for scene text recognition. We propose a cross-domain Transformer, called STR Transformer (STRT), which can not only extract texts from an image but also correct characters effectively according to their spelling rules. Specifically, we propose a Spline Transformer to extract hierarchical features of images without the convolution layers, which has the flexibility to build models with various scales and has linear computational complexity with respect to image size. Furthermore, an iterative Text Transformer is designed to predict the probability distribution of current character in the character sequence, which can effectively reduce the impact of noise. Extensive experiments demonstrate that the proposed STRT outperforms state-of-the-art methods on various benchmark datasets of scene text recognition. The qualitative and quantitative analysis proves the effectiveness and efficiency of the proposed STRT method.
Similar content being viewed by others
References
Olszewska J I (2015) Active contour based optical character recognition for automated scene understanding. Neurocomputing 161:65–71
Karaoglu S, Tao R, Gevers T, Smeulders Arnold WM (2016) Words matter: Scene text for image classification and retrieval. IEEE Trans Multimed 19(5):1063–1076
Singh A, Natarajan V, Shah M, Jiang Y, Chen X, Batra D, Parikh D, Rohrbach M (2019) Towards vqa models that can read. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8317–8326
Wei L, Chen C, Wong K Y, Su Z, Han J (2016) Star-net: A spatial attention residue network for scene text recognition. In: British Machine Vision Conference 2016
Shi B, Yang M, Wang X, Lyu P, Yao C, Bai X (2018) Aster: An attentional scene text recognizer with flexible rectification. IEEE Trans Pattern Anal Mach Intell 41(9):2035–2048
Yu D, Li X, Zhang C, Liu T, Ding E (2020) Towards accurate scene text recognition with semantic reasoning networks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Fang S, Xie H, Wang Y, Mao Z, Zhang Y (2021) Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 7098–7107
Chen Y, Zhuang T, Guo K (2021) Memory network with hierarchical multi-head attention for aspect-based sentiment analysis
Sun S- (2021) Self-attention enhanced cnns with average margin loss for chinese zero pronoun resolution
Cho K, van Merrienboer B, Gulcehre C, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: Conference on Empirical Methods in Natural Language Processing (EMNLP 2014)
Neumann L, Matas J (2012) Real-time scene text localization and recognition. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition
Risnumawan A, Shivakumara P, Chan C S, Tan C L (2014) A robust arbitrary text detection system for natural scene images. Expert Syst Appl 41(18):8027–8048
Wu X, Zhong M, Guo Y, Fujita H (2020) The assessment of small bowel motility with attentive deformable neural network. Inf Sci 508:22–32
Wu X, Chen C, Zhong M, Wang J (2021) Hal: Hybrid active learning for efficient labeling in medical domain
Wu X, Chen C, Zhong M, Wang J, Shi J (2021) Covid-al: The diagnosis of covid-19 with deep active learning. Med Image Anal 68:101913
Baek J, Kim G, Lee J, Park S, Han D, Yun S, Oh S J, Lee H (2019) What is wrong with scene text recognition model comparisons? dataset and model analysis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4715– 4723
Su B, Lu S (2014) Accurate scene text recognition based on recurrent neural network
Shi B, Xiang B, Cong Y (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell 39(11):2298–2304
Su B, Lu S (2017) Accurate recognition of words in scenes without character segmentation using recurrent neural network. Pattern Recogn 63:397–405
Li W, Wang Q, Wu J, Yu Z (2021) Piecewise convolutional neural networks with position attention and similar bag attention for distant supervision relation extraction
Pei M, Wu X, Guo Y, Fujita H (2017) Small bowel motility assessment based on fully convolutional networks and long short-term memory. Knowl-Based Syst 121:163–172
Li H, Wang P, Shen C, Zhang G (2019) Show, attend and read: A simple and strong baseline for irregular text recognition. Proceedings of the AAAI Conference on Artificial Intelligence 33:8610–8617
Cheng Z, Xu Y, Fan B, Yi N, Zhou S (2018) Aon: Towards arbitrarily-oriented text recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Li H, Wang P, Shen C (2017) Towards end-to-end text spotting with convolutional recurrent neural networks. In: Proceedings of the IEEE international conference on computer vision, pp 5238–5246
Wang P, Li H, Shen C (2021) Towards end-to-end text spotting in natural scenes
Lyu P, Liao M, Yao C, Wu W, Bai X (2018) Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 67–83
Xing L, Tian Z, Huang W, Scott M R (2019) Convolutional character networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9126–9136
He P, Huang W, Qiao Y, Loy C C, Tang X (2016) Reading scene text in deep convolutional sequences. In: Thirtieth AAAI conference on artificial intelligence
Ma X, He K, Zhang D, Li D (2021) Pieed: Position information enhanced encoder-decoder framework for scene text recognition
Yin G, Chen F, Dong Y, Li G (2021) Knowledge-aware recommendation model with dynamic co-attention and attribute regularize
Lee C Y, Osindero S (2016) Recursive recurrent nets with attention modeling for ocr in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, PMLR, pp 10347–10357
Atienza R (2021) Vision transformer for fast and efficient scene text recognition. In: International Conference on Document Analysis and Recognition, Springer, pp 319–334
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10012–10022
Al-Rfou R, Choe D, Constant N, Guo M, Jones L (2019) Character-level language modeling with deeper self-attention. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 3159–3166
Wang T, Zhu Y, Jin L, Luo C, Chen X, Wu Y, Wang Q, Cai M (2020) Decoupled attention network for text recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 34, pp 12216–12224
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2014) Synthetic data and artificial neural networks for natural scene text recognition. Neural Information Processing Systems
Gupta A, Vedaldi A, Zisserman A (2016) Synthetic data for text localisation in natural images. In: IEEE Conference on Computer Vision and Pattern Recognition
Karatzas D, Shafait F, Uchida S, Iwamura M, Bigorda L G I, Mestre S R, Romeu J M, Mota D F, Almazán J, Heras L D L (2013) Icdar 2013 robust reading competition
Kai W, Babenko B, Belongie S (2012) End-to-end scene text recognition. In: IEEE International Conference on Computer Vision
Mishra A, Alahari K, Jawahar CV (2012) Scene text recognition using higher order language priors. In: BMVC-British Machine Vision Conference, BMVA
Karatzas D, Bigorda L G I, Nicolaou A, Ghosh S, Bagdanov A D, Iwamura M, Matas J, Neumann L, Chandrasekhar V, Lu S, Shafait F, Uchida S, Valveny E (2015) Icdar 2015 competition on robust reading
Phan T Q, Shivakumara P, Tian S, Tan C L (2014) Recognizing text with perspective distortion in natural scenes. In: IEEE International Conference on Computer Vision
Cubuk E D, Zoph B, Shlens J, Le Q V (2020) Randaugment: Practical automated data augmentation with a reduced search space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp 702–703
Shi B, Wang X, Lyu P, Cong Y, Xiang B (2016) Robust scene text recognition with automatic rectification. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Wang J, Hu X (2017) Gated recurrent convolution neural network for ocr. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp 334–343
Borisyuk F, Gordo A, Sivakumar V (2018) Rosetta: Large scale system for text detection and recognition in images. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp 71–79
Atienza R (2021) Vision transformer for fast and efficient scene text recognition. In: International Conference on Document Analysis and Recognition, Springer, pp 319–334
Zhang Y, Gueguen L, Zharkov I, Zhang P, Seifert K, Kadlec B (2017) Uber-text: A large-scale dataset for optical character recognition from street-level imagery. In: SUNw: Scene Understanding Workshop-CVPR, vol 2017, p 5
Acknowledgements
This work is supported by the National Natural Science Foundation of China (Grant No. 62172267), the Natural Science Foundation of Shanghai, China (Grant No. 20ZR1420400), the State Key Program of National Natural Science Foundation of China (Grant No. 61936001) and the Key Research Project of Zhejiang Laboratory (No. 2021PE0AC02).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wu, X., Tang, B., Zhao, M. et al. STR Transformer: A Cross-domain Transformer for Scene Text Recognition. Appl Intell 53, 3444–3458 (2023). https://doi.org/10.1007/s10489-022-03728-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03728-5