Abstract
Scene text recognition (STR) technology has a rapid development with the rise of deep learning. Recently, the encoder-decoder framework based on attention mechanism is widely used in STR for better recognition. However, the commonly used Long Short Term Memory (LSTM) network in the framework tends to ignore certain position or visual information. To address this problem, we propose a Position Information Enhanced Encoder-Decoder (PIEED) framework for scene text recognition, in which an addition position information enhancement (PIE) module is proposed to compensate the shortage of the LSTM network. Our module tends to retain more position information in the feature sequence, as well as the context information extracted by the LSTM network, which is helpful to improve the recognition accuracy of the text without context. Besides that, our fusion decoder can make full use of the output of the proposed module and the LSTM network, so as to independently learn and preserve useful features, which is helpful to improve the recognition accuracy while not increase the number of arguments. Our overall framework can be trained end-to-end only using images and ground truth. Extensive experiments on several benchmark datasets demonstrate that our proposed framework surpass state-of-the-art ones on both regular and irregular text recognition.
Similar content being viewed by others
References
Neumann L, Matas J (2016) Real-Time Lexicon-Free Scene Text Localization and Recognition. IEEE Trans Pattern Anal Mach Intell 38(9):1872–1885
Rodriguez J, Gordo A, Perronnin F (2015) Label embedding: a frugal baseline for text recognition. Int J Comput Vis 113:193– 207
Bai X, Yao C, Liu W (2016) Strokelets: a learned Multi-Scale Mid-Level representation for scene text recognition. IEEE Trans Image Process 25(6):2789–2802
Li S, Tang M, Guo Q, Lei J, Zhang J (2017) Deep neural network with attention model for scene text recognition. IET Comput Vis 11(7):605–612
Huang Y, Sun X, Jin L, Luo C (2020) EPAN: Effective Parts attention network for scene text recognition. Neurocomputing 376:202–213
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2016) Reading text in the wild with convolutional neural networks. Int J Comput Vis 116(1):1–20
Wang Y, Wang M, Fujita H (2020) Word Sense Disambiguation: A comprehensive knowledge exploitation framework. Knowledge-Based Systems. https://doi.org/10.1016/j.knosys.2019.105030
Soni R, Kumar B, Chand S (2019) Text detection and localization in natural scene images based on text awareness score. Appl Intell 49:1376–1405
Baek J, Kim G, Lee J, Park S, Han D, Yun S, Oh S, Lee H (2019) What is wrong with scene text recognition model comparisons? dataset and model analysis. In: IEEE International Conference on Computer Vision, pp 4715–4723
Shi B, Yang M, Wang X, Lyu P, Co Yao, Bai X (2019) Aster: an attentional scene text recognizer with flexible rectification. IEEE Trans Pattern Anal Mach Intell 41(9):2035–2048
Zhan F, Lu S (2019) ESIR: End-to-end scene text recognition via iterative image rectification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 2059–2068
Luo C, Jin L, Sun Z (2019) MORAN: A multi-object rectified attention network for scene text recognition. Pattern Recogn 90:109–118
Yang M, Guan Y, Liao M, He X, Bian K, Bai S, Yao C, Bai X (2019) Symmetry-constrained rectification network for scene text recognition. In: IEEE International Conference on Computer Vision, pp 9147–9156
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
Lee C, Osindero S (2016) Recursive recurrent nets with attention modeling for ocr in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 2231–2239
Shi B, Bai X, Yao C (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell 39(11):2298–2304
Shi B, Wang X, Lyu P, Yao C, Bai X (2016) Robust scene text recognition with automatic rectification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 4167–4176
Gers F, Schraudolph N, Schmidhuber J (2002) Learning precise timing with LSTM recurrent networks. J Mach Learn Res 3(8):115–143
Graves A, Fernández S, Gomez F, Schmidhuber J (2019) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: International Conference on Machine Learning, pp 369-376
Cheng Z, Bai F, Xu Y, Zheng G, Pu S, Zhou S (2017) Focusing attention: Towards accurate text recognition in natural images. In: IEEE International Conference on Computer Vision, pp 5076–5084
Qiao Z, Zhou Y, Yang D, Zhou Y, Wang W (2020) SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13528–13537
Yang X, He D, Zhou Z, Kifer D, Giles C (2017) Learning to Read Irregular Text with Attention Mechanisms. In: International Joint Conference on Artificial Intelligence, pp 3280–3286
Li H, Wang P, Shen C, Zhang G (2019) Show, attend and read: A simple and strong baseline for irregular text recognition. In: AAAI Conference on Artificial Intelligence, pp 8610–8617
Wang P, Yang L, Li H, Deng Y, Shen C, Zhang Y (2020) A holistic representation guided attention network for scene text recognition. Neurocomputing 414:67–75
Neumann L, Matas J (2010) A method for text localization and recognition in real-world images. In: Asian Conference on Computer Vision, pp 770–783
Epshtein B, Ofek E, Wexler Y (2010) Detecting text in natural scenes with stroke width transform. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 2963–2970
Yao C, Bai X, Liu W (2014) A unified framework for multi oriented text detection and recognition. IEEE Trans Image Process 23(11):4737–4749
Gao Y, Chen Y, Wang J, Tang M, Lu H (2019) Reading scene text with fully convolutional sequence modeling. Neurocomputing 339:161–170
Su B, Lu S (2017) Accurate recognition of words in scenes without character segmentation using recurrent neural network. Pattern Recogn 63:397–405
Phan T, Shivakumara P, Tian S, Tan C (2019) Recognizing text with perspective distortion in natural scenes. In: IEEE International Conference on Computer Vision, pp 569–576
Jaderberg M, Simonyan K, Zisserman A (2015) Spatial transformer networks. In: Advances in Neural Information Processing Systems, pp 2017–2025
Sutskever I, Vinyals O, Le Q (2014) Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp 3104–3112
Luong M T, Pham H, Manning C D (2015) Effective approaches to attention-based neural machine translation. Computer Science
Litman R, Anschel O, Tsiper S, Litman R, Mazor S, Manmatha R (2020) SCATTER: Selective context attentional scene text recognizer. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 11962–11972
Chen K, Wang T, Zhu Y, Jin L, Luo C (2020) Adaptive embedding gate for attention-based scene text recognition. Neurocomputing 381:261–271
Wang T, Zhu Y, Jin L, Luo C, Chen X, Wu Y, Wang Q, Cai M (2020) Decoupled Attention Network for Text Recognition. In: AAAI Conference on Artificial Intelligence, pp 12216–12224
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems, pp 5998–6008
Sheng F, Chen Z, Xu B (2019) NRTR: A no-recurrence sequence-to-sequence model for scene text recognition. In: International Conference on Document Analysis and Recognition, pp 781– 786
Mishra A, Alahari K, Jawahar C (2012) Scene text recognition using higher order language priors. In: British Machine Vision Conference, pp 1–11
Wang K, Babenko B, Belongie S (2011) End-to-end scene text recognition. In: IEEE International Conference on Computer Vision, pp 1457–1464
Karatzas D, Shafait F, Uchida S, Iwamura M et al (2013) ICDAR 2013 robust reading competition. In: International Conference on Document Analysis and Recognition, pp 1484–1493
Karatzas D, Gomez-Bigorda L, Nicolaou A, Ghosh S et al (2015) ICDAR 2015 competition on robust reading. In: International Conference on Document Analysis and Recognition, pp 1156–1160
Quy T, Shivakumara P, Tian S, Lim T (2013) Recognizing text with perspective distortion in natural scenes. In: IEEE International Conference on Computer Vision, pp 569–576
Risnumawan A, Shivakumara P, Chan C, Tan C (2014) A robust arbitrary text detection system for natural scene images. Expert Syst Appl 41(18):8027–8048
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2016) Reading text in the wild with convolutional neural networks. Int J Comput Vis 116(1):1–20
Gupta A, Vedaldi A, Zisserman A (2016) Synthetic data for text localisation in natural images. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 2315–2324
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z et al (2019) Pytorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp 8026–8037
Liu W, Chen C, Wong K, Su Z, Han J (2016) STAR-Net: A Spatial Attention Residue Network for Scene Text Recognition. In: British Machine Vision Conference, pp 2–7
Liu W, Chen C, Wong K (2018) Char-Net: A Character-Aware Neural Network for Distorted Scene Text Recognition. In: AAAI on Artificial Intelligence, pp 7154–7161
Cheng Z, Xu Y, Bai F, Niu Y, Pu S, Zhou S (2018) Aon: Towards arbitrarily-oriented text recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 5571–5579
Xie Z, Huang Y, Zhu Y, Jin L, Liu Y, Xie L (2019) Aggregation cross-entropy for sequence recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 6538–6547
Wan Z, He M, Chen H, Bai X, Yao C (2019) Textscanner: Reading characters in order for robust scene text recognition. In: AAAI Conference on Artificial Intelligence, pp 12120–12127
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ma, X., He, K., Zhang, D. et al. PIEED: Position information enhanced encoder-decoder framework for scene text recognition. Appl Intell 51, 6698–6707 (2021). https://doi.org/10.1007/s10489-021-02219-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02219-3