PIEED: Position information enhanced encoder-decoder framework for scene text recognition

Ma, Xitao; He, Kai; Zhang, Dazhuang; Li, Dashuang

doi:10.1007/s10489-021-02219-3

PIEED: Position information enhanced encoder-decoder framework for scene text recognition

Published: 10 February 2021

Volume 51, pages 6698–6707, (2021)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Xitao Ma¹,
Kai He ORCID: orcid.org/0000-0002-8529-7960¹,
Dazhuang Zhang¹ &
…
Dashuang Li¹

562 Accesses
10 Citations
1 Altmetric
Explore all metrics

Abstract

Scene text recognition (STR) technology has a rapid development with the rise of deep learning. Recently, the encoder-decoder framework based on attention mechanism is widely used in STR for better recognition. However, the commonly used Long Short Term Memory (LSTM) network in the framework tends to ignore certain position or visual information. To address this problem, we propose a Position Information Enhanced Encoder-Decoder (PIEED) framework for scene text recognition, in which an addition position information enhancement (PIE) module is proposed to compensate the shortage of the LSTM network. Our module tends to retain more position information in the feature sequence, as well as the context information extracted by the LSTM network, which is helpful to improve the recognition accuracy of the text without context. Besides that, our fusion decoder can make full use of the output of the proposed module and the LSTM network, so as to independently learn and preserve useful features, which is helpful to improve the recognition accuracy while not increase the number of arguments. Our overall framework can be trained end-to-end only using images and ground truth. Extensive experiments on several benchmark datasets demonstrate that our proposed framework surpass state-of-the-art ones on both regular and irregular text recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Representation and Correlation Enhanced Encoder-Decoder Framework for Scene Text Recognition

FACLSTM: ConvLSTM with focused attention for scene text recognition

Article 15 January 2020

SAM: Self Attention Mechanism for Scene Text Recognition Based on Swin Transformer

References

Neumann L, Matas J (2016) Real-Time Lexicon-Free Scene Text Localization and Recognition. IEEE Trans Pattern Anal Mach Intell 38(9):1872–1885
Article Google Scholar
Rodriguez J, Gordo A, Perronnin F (2015) Label embedding: a frugal baseline for text recognition. Int J Comput Vis 113:193– 207
Article Google Scholar
Bai X, Yao C, Liu W (2016) Strokelets: a learned Multi-Scale Mid-Level representation for scene text recognition. IEEE Trans Image Process 25(6):2789–2802
Article MathSciNet Google Scholar
Li S, Tang M, Guo Q, Lei J, Zhang J (2017) Deep neural network with attention model for scene text recognition. IET Comput Vis 11(7):605–612
Article Google Scholar
Huang Y, Sun X, Jin L, Luo C (2020) EPAN: Effective Parts attention network for scene text recognition. Neurocomputing 376:202–213
Article Google Scholar
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2016) Reading text in the wild with convolutional neural networks. Int J Comput Vis 116(1):1–20
Article MathSciNet Google Scholar
Wang Y, Wang M, Fujita H (2020) Word Sense Disambiguation: A comprehensive knowledge exploitation framework. Knowledge-Based Systems. https://doi.org/10.1016/j.knosys.2019.105030
Soni R, Kumar B, Chand S (2019) Text detection and localization in natural scene images based on text awareness score. Appl Intell 49:1376–1405
Article Google Scholar
Baek J, Kim G, Lee J, Park S, Han D, Yun S, Oh S, Lee H (2019) What is wrong with scene text recognition model comparisons? dataset and model analysis. In: IEEE International Conference on Computer Vision, pp 4715–4723
Shi B, Yang M, Wang X, Lyu P, Co Yao, Bai X (2019) Aster: an attentional scene text recognizer with flexible rectification. IEEE Trans Pattern Anal Mach Intell 41(9):2035–2048
Article Google Scholar
Zhan F, Lu S (2019) ESIR: End-to-end scene text recognition via iterative image rectification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 2059–2068
Luo C, Jin L, Sun Z (2019) MORAN: A multi-object rectified attention network for scene text recognition. Pattern Recogn 90:109–118
Article Google Scholar
Yang M, Guan Y, Liao M, He X, Bian K, Bai S, Yao C, Bai X (2019) Symmetry-constrained rectification network for scene text recognition. In: IEEE International Conference on Computer Vision, pp 9147–9156
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
Lee C, Osindero S (2016) Recursive recurrent nets with attention modeling for ocr in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 2231–2239
Shi B, Bai X, Yao C (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell 39(11):2298–2304
Article Google Scholar
Shi B, Wang X, Lyu P, Yao C, Bai X (2016) Robust scene text recognition with automatic rectification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 4167–4176
Gers F, Schraudolph N, Schmidhuber J (2002) Learning precise timing with LSTM recurrent networks. J Mach Learn Res 3(8):115–143
MathSciNet MATH Google Scholar
Graves A, Fernández S, Gomez F, Schmidhuber J (2019) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: International Conference on Machine Learning, pp 369-376
Cheng Z, Bai F, Xu Y, Zheng G, Pu S, Zhou S (2017) Focusing attention: Towards accurate text recognition in natural images. In: IEEE International Conference on Computer Vision, pp 5076–5084
Qiao Z, Zhou Y, Yang D, Zhou Y, Wang W (2020) SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13528–13537
Yang X, He D, Zhou Z, Kifer D, Giles C (2017) Learning to Read Irregular Text with Attention Mechanisms. In: International Joint Conference on Artificial Intelligence, pp 3280–3286
Li H, Wang P, Shen C, Zhang G (2019) Show, attend and read: A simple and strong baseline for irregular text recognition. In: AAAI Conference on Artificial Intelligence, pp 8610–8617
Wang P, Yang L, Li H, Deng Y, Shen C, Zhang Y (2020) A holistic representation guided attention network for scene text recognition. Neurocomputing 414:67–75
Article Google Scholar
Neumann L, Matas J (2010) A method for text localization and recognition in real-world images. In: Asian Conference on Computer Vision, pp 770–783
Epshtein B, Ofek E, Wexler Y (2010) Detecting text in natural scenes with stroke width transform. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 2963–2970
Yao C, Bai X, Liu W (2014) A unified framework for multi oriented text detection and recognition. IEEE Trans Image Process 23(11):4737–4749
Article MathSciNet Google Scholar
Gao Y, Chen Y, Wang J, Tang M, Lu H (2019) Reading scene text with fully convolutional sequence modeling. Neurocomputing 339:161–170
Article Google Scholar
Su B, Lu S (2017) Accurate recognition of words in scenes without character segmentation using recurrent neural network. Pattern Recogn 63:397–405
Article Google Scholar
Phan T, Shivakumara P, Tian S, Tan C (2019) Recognizing text with perspective distortion in natural scenes. In: IEEE International Conference on Computer Vision, pp 569–576
Jaderberg M, Simonyan K, Zisserman A (2015) Spatial transformer networks. In: Advances in Neural Information Processing Systems, pp 2017–2025
Sutskever I, Vinyals O, Le Q (2014) Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp 3104–3112
Luong M T, Pham H, Manning C D (2015) Effective approaches to attention-based neural machine translation. Computer Science
Litman R, Anschel O, Tsiper S, Litman R, Mazor S, Manmatha R (2020) SCATTER: Selective context attentional scene text recognizer. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 11962–11972
Chen K, Wang T, Zhu Y, Jin L, Luo C (2020) Adaptive embedding gate for attention-based scene text recognition. Neurocomputing 381:261–271
Article Google Scholar
Wang T, Zhu Y, Jin L, Luo C, Chen X, Wu Y, Wang Q, Cai M (2020) Decoupled Attention Network for Text Recognition. In: AAAI Conference on Artificial Intelligence, pp 12216–12224
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems, pp 5998–6008
Sheng F, Chen Z, Xu B (2019) NRTR: A no-recurrence sequence-to-sequence model for scene text recognition. In: International Conference on Document Analysis and Recognition, pp 781– 786
Mishra A, Alahari K, Jawahar C (2012) Scene text recognition using higher order language priors. In: British Machine Vision Conference, pp 1–11
Wang K, Babenko B, Belongie S (2011) End-to-end scene text recognition. In: IEEE International Conference on Computer Vision, pp 1457–1464
Karatzas D, Shafait F, Uchida S, Iwamura M et al (2013) ICDAR 2013 robust reading competition. In: International Conference on Document Analysis and Recognition, pp 1484–1493
Karatzas D, Gomez-Bigorda L, Nicolaou A, Ghosh S et al (2015) ICDAR 2015 competition on robust reading. In: International Conference on Document Analysis and Recognition, pp 1156–1160
Quy T, Shivakumara P, Tian S, Lim T (2013) Recognizing text with perspective distortion in natural scenes. In: IEEE International Conference on Computer Vision, pp 569–576
Risnumawan A, Shivakumara P, Chan C, Tan C (2014) A robust arbitrary text detection system for natural scene images. Expert Syst Appl 41(18):8027–8048
Article Google Scholar
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2016) Reading text in the wild with convolutional neural networks. Int J Comput Vis 116(1):1–20
Article MathSciNet Google Scholar
Gupta A, Vedaldi A, Zisserman A (2016) Synthetic data for text localisation in natural images. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 2315–2324
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z et al (2019) Pytorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp 8026–8037
Liu W, Chen C, Wong K, Su Z, Han J (2016) STAR-Net: A Spatial Attention Residue Network for Scene Text Recognition. In: British Machine Vision Conference, pp 2–7
Liu W, Chen C, Wong K (2018) Char-Net: A Character-Aware Neural Network for Distorted Scene Text Recognition. In: AAAI on Artificial Intelligence, pp 7154–7161
Cheng Z, Xu Y, Bai F, Niu Y, Pu S, Zhou S (2018) Aon: Towards arbitrarily-oriented text recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 5571–5579
Xie Z, Huang Y, Zhu Y, Jin L, Liu Y, Xie L (2019) Aggregation cross-entropy for sequence recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 6538–6547
Wan Z, He M, Chen H, Bai X, Yao C (2019) Textscanner: Reading characters in order for robust scene text recognition. In: AAAI Conference on Artificial Intelligence, pp 12120–12127

Download references

Author information

Authors and Affiliations

Tianjin University, Tianjin, 300072, China
Xitao Ma, Kai He, Dazhuang Zhang & Dashuang Li

Authors

Xitao Ma
View author publications
You can also search for this author in PubMed Google Scholar
Kai He
View author publications
You can also search for this author in PubMed Google Scholar
Dazhuang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Dashuang Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kai He.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ma, X., He, K., Zhang, D. et al. PIEED: Position information enhanced encoder-decoder framework for scene text recognition. Appl Intell 51, 6698–6707 (2021). https://doi.org/10.1007/s10489-021-02219-3

Download citation

Accepted: 14 January 2021
Published: 10 February 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s10489-021-02219-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PIEED: Position information enhanced encoder-decoder framework for scene text recognition

Abstract

Access this article

Similar content being viewed by others

Representation and Correlation Enhanced Encoder-Decoder Framework for Scene Text Recognition

FACLSTM: ConvLSTM with focused attention for scene text recognition

SAM: Self Attention Mechanism for Scene Text Recognition Based on Swin Transformer

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

PIEED: Position information enhanced encoder-decoder framework for scene text recognition

Abstract

Access this article

Similar content being viewed by others

Representation and Correlation Enhanced Encoder-Decoder Framework for Scene Text Recognition

FACLSTM: ConvLSTM with focused attention for scene text recognition

SAM: Self Attention Mechanism for Scene Text Recognition Based on Swin Transformer

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation