Skip to main content
Log in

STR Transformer: A Cross-domain Transformer for Scene Text Recognition

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Scene text recognition is an indispensable part of computer vision, which aims to extract text information from an image. However, effective extraction of texts following spelling rules remains a challenge for scene text recognition. We propose a cross-domain Transformer, called STR Transformer (STRT), which can not only extract texts from an image but also correct characters effectively according to their spelling rules. Specifically, we propose a Spline Transformer to extract hierarchical features of images without the convolution layers, which has the flexibility to build models with various scales and has linear computational complexity with respect to image size. Furthermore, an iterative Text Transformer is designed to predict the probability distribution of current character in the character sequence, which can effectively reduce the impact of noise. Extensive experiments demonstrate that the proposed STRT outperforms state-of-the-art methods on various benchmark datasets of scene text recognition. The qualitative and quantitative analysis proves the effectiveness and efficiency of the proposed STRT method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Olszewska J I (2015) Active contour based optical character recognition for automated scene understanding. Neurocomputing 161:65–71

    Article  Google Scholar 

  2. Karaoglu S, Tao R, Gevers T, Smeulders Arnold WM (2016) Words matter: Scene text for image classification and retrieval. IEEE Trans Multimed 19(5):1063–1076

    Article  Google Scholar 

  3. Singh A, Natarajan V, Shah M, Jiang Y, Chen X, Batra D, Parikh D, Rohrbach M (2019) Towards vqa models that can read. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8317–8326

  4. Wei L, Chen C, Wong K Y, Su Z, Han J (2016) Star-net: A spatial attention residue network for scene text recognition. In: British Machine Vision Conference 2016

  5. Shi B, Yang M, Wang X, Lyu P, Yao C, Bai X (2018) Aster: An attentional scene text recognizer with flexible rectification. IEEE Trans Pattern Anal Mach Intell 41(9):2035–2048

    Article  Google Scholar 

  6. Yu D, Li X, Zhang C, Liu T, Ding E (2020) Towards accurate scene text recognition with semantic reasoning networks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  7. Fang S, Xie H, Wang Y, Mao Z, Zhang Y (2021) Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 7098–7107

  8. Chen Y, Zhuang T, Guo K (2021) Memory network with hierarchical multi-head attention for aspect-based sentiment analysis

  9. Sun S- (2021) Self-attention enhanced cnns with average margin loss for chinese zero pronoun resolution

  10. Cho K, van Merrienboer B, Gulcehre C, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: Conference on Empirical Methods in Natural Language Processing (EMNLP 2014)

  11. Neumann L, Matas J (2012) Real-time scene text localization and recognition. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition

  12. Risnumawan A, Shivakumara P, Chan C S, Tan C L (2014) A robust arbitrary text detection system for natural scene images. Expert Syst Appl 41(18):8027–8048

    Article  Google Scholar 

  13. Wu X, Zhong M, Guo Y, Fujita H (2020) The assessment of small bowel motility with attentive deformable neural network. Inf Sci 508:22–32

    Article  Google Scholar 

  14. Wu X, Chen C, Zhong M, Wang J (2021) Hal: Hybrid active learning for efficient labeling in medical domain

  15. Wu X, Chen C, Zhong M, Wang J, Shi J (2021) Covid-al: The diagnosis of covid-19 with deep active learning. Med Image Anal 68:101913

    Article  Google Scholar 

  16. Baek J, Kim G, Lee J, Park S, Han D, Yun S, Oh S J, Lee H (2019) What is wrong with scene text recognition model comparisons? dataset and model analysis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4715– 4723

  17. Su B, Lu S (2014) Accurate scene text recognition based on recurrent neural network

  18. Shi B, Xiang B, Cong Y (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell 39(11):2298–2304

    Article  Google Scholar 

  19. Su B, Lu S (2017) Accurate recognition of words in scenes without character segmentation using recurrent neural network. Pattern Recogn 63:397–405

    Article  Google Scholar 

  20. Li W, Wang Q, Wu J, Yu Z (2021) Piecewise convolutional neural networks with position attention and similar bag attention for distant supervision relation extraction

  21. Pei M, Wu X, Guo Y, Fujita H (2017) Small bowel motility assessment based on fully convolutional networks and long short-term memory. Knowl-Based Syst 121:163–172

    Article  Google Scholar 

  22. Li H, Wang P, Shen C, Zhang G (2019) Show, attend and read: A simple and strong baseline for irregular text recognition. Proceedings of the AAAI Conference on Artificial Intelligence 33:8610–8617

    Article  Google Scholar 

  23. Cheng Z, Xu Y, Fan B, Yi N, Zhou S (2018) Aon: Towards arbitrarily-oriented text recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  24. Li H, Wang P, Shen C (2017) Towards end-to-end text spotting with convolutional recurrent neural networks. In: Proceedings of the IEEE international conference on computer vision, pp 5238–5246

  25. Wang P, Li H, Shen C (2021) Towards end-to-end text spotting in natural scenes

  26. Lyu P, Liao M, Yao C, Wu W, Bai X (2018) Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 67–83

  27. Xing L, Tian Z, Huang W, Scott M R (2019) Convolutional character networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9126–9136

  28. He P, Huang W, Qiao Y, Loy C C, Tang X (2016) Reading scene text in deep convolutional sequences. In: Thirtieth AAAI conference on artificial intelligence

  29. Ma X, He K, Zhang D, Li D (2021) Pieed: Position information enhanced encoder-decoder framework for scene text recognition

  30. Yin G, Chen F, Dong Y, Li G (2021) Knowledge-aware recommendation model with dynamic co-attention and attribute regularize

  31. Lee C Y, Osindero S (2016) Recursive recurrent nets with attention modeling for ocr in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition

  32. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations

  33. Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, PMLR, pp 10347–10357

  34. Atienza R (2021) Vision transformer for fast and efficient scene text recognition. In: International Conference on Document Analysis and Recognition, Springer, pp 319–334

  35. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10012–10022

  36. Al-Rfou R, Choe D, Constant N, Guo M, Jones L (2019) Character-level language modeling with deeper self-attention. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 3159–3166

  37. Wang T, Zhu Y, Jin L, Luo C, Chen X, Wu Y, Wang Q, Cai M (2020) Decoupled attention network for text recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 34, pp 12216–12224

  38. Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2014) Synthetic data and artificial neural networks for natural scene text recognition. Neural Information Processing Systems

  39. Gupta A, Vedaldi A, Zisserman A (2016) Synthetic data for text localisation in natural images. In: IEEE Conference on Computer Vision and Pattern Recognition

  40. Karatzas D, Shafait F, Uchida S, Iwamura M, Bigorda L G I, Mestre S R, Romeu J M, Mota D F, Almazán J, Heras L D L (2013) Icdar 2013 robust reading competition

  41. Kai W, Babenko B, Belongie S (2012) End-to-end scene text recognition. In: IEEE International Conference on Computer Vision

  42. Mishra A, Alahari K, Jawahar CV (2012) Scene text recognition using higher order language priors. In: BMVC-British Machine Vision Conference, BMVA

  43. Karatzas D, Bigorda L G I, Nicolaou A, Ghosh S, Bagdanov A D, Iwamura M, Matas J, Neumann L, Chandrasekhar V, Lu S, Shafait F, Uchida S, Valveny E (2015) Icdar 2015 competition on robust reading

  44. Phan T Q, Shivakumara P, Tian S, Tan C L (2014) Recognizing text with perspective distortion in natural scenes. In: IEEE International Conference on Computer Vision

  45. Cubuk E D, Zoph B, Shlens J, Le Q V (2020) Randaugment: Practical automated data augmentation with a reduced search space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp 702–703

  46. Shi B, Wang X, Lyu P, Cong Y, Xiang B (2016) Robust scene text recognition with automatic rectification. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  47. Wang J, Hu X (2017) Gated recurrent convolution neural network for ocr. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp 334–343

  48. Borisyuk F, Gordo A, Sivakumar V (2018) Rosetta: Large scale system for text detection and recognition in images. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp 71–79

  49. Atienza R (2021) Vision transformer for fast and efficient scene text recognition. In: International Conference on Document Analysis and Recognition, Springer, pp 319–334

  50. Zhang Y, Gueguen L, Zharkov I, Zhang P, Seifert K, Kadlec B (2017) Uber-text: A large-scale dataset for optical character recognition from street-level imagery. In: SUNw: Scene Understanding Workshop-CVPR, vol 2017, p 5

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant No. 62172267), the Natural Science Foundation of Shanghai, China (Grant No. 20ZR1420400), the State Key Program of National Natural Science Foundation of China (Grant No. 61936001) and the Key Research Project of Zhejiang Laboratory (No. 2021PE0AC02).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xing Wu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, X., Tang, B., Zhao, M. et al. STR Transformer: A Cross-domain Transformer for Scene Text Recognition. Appl Intell 53, 3444–3458 (2023). https://doi.org/10.1007/s10489-022-03728-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03728-5

Keywords

Navigation