Abstract
Aiming at the problems of character segmentation and dictionary dependence in text recognition in natural scenes, a text recognition algorithm based on Attention mechanism and connection time classification (CTC) loss is proposed. Convolutional neural network and bidirectional long short – term memory network are used to realize image feature coding, which avoids the gradient vanishing problem of recurrent neural network (RNN) with the increase of time. And the Attention-CTC structure is used to decode the feature sequence, which effectively solves the problem of unconstrained attention decoding. The algorithm avoids extra processing of alignment and subsequent syntax processing, and improves the speed of training convergence and significantly improves the recognition rate of text. It has a certain research value in recognition accuracy. Experimental results show that the algorithm has good robustness to text images with fuzzy fonts and complex background.
Similar content being viewed by others
References
Alazab M, Khan S, Krishnan SSR et al (2020) A multidirectional LSTM model for predicting the stability of a smart grid. IEEE Access PP(99):1–11
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations, San Diego, pp 89–93
Bahdanau D, Chorowski J, Serdyuk D, et al. End-to-end attention-based large vocabulary speech recognition. Shanghai: The 41st IEEE International Conference on Acoustics, Speech and Signal Processing, 2016:4945–4949.
Bai X, Shi B, Yao C (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell 39(11):2298–2304
Chen ZJ, Chen DP, Zhang YS, Cheng XZ et al (2020) Deep learning for autonomous ship-oriented small ship detection. Saf Sci 130:132–141
Chen JN, Gao S, Sun HZ et al (2020) An end-to-end speech recognition algorithm based on attention mechanism. Syst Eng Soc China:6–14
Chen ZJ, Cai H, Zhang YS, Wu CZ et al (2020) A novel sparse representation model for pedestrian abnormal trajectory understanding. Expert Syst Appl 144:516–525
Chen JN, Gao S, Sun HZ et al (2020) An end-to-end speech recognition algorithm based on attention mechanism. Syst Eng Soc China:640–646
Danish V, Alazab M, Sobia W, Hamad N, et al. IMCFN: Image-based malware classification using fine-tuned convolutional neural network architecture. Computer Networks, 2020:171–177.
Fernández-Díaz M, Gallardo-Antolín A (2020) An attention long short-term memory based system for automatic classification of speech intelligibility. Eng Appl Artif Intell 96:1–8
Ganesh J, Hubert C (2020) Data augmentation for handwritten digit recognition using generative adversarial networks. Multimed Tools Appl 79:35055–35068
Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation. 2014 IEEE conference on computer vision and pattern recognition, Columbus, OH, 2014, pp. 580–587.
Graves A, Gomez F (2016) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. International Conference on Machine Learning, Hong Kong, pp 742–748
Hakak S, Alazab M, Khan S, … Khan WZ (2021) An ensemble machine learning approach through effective feature extraction to classify fake news. Futur Gener Comput Syst 117:114–123
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Hori T, Watanabe S, Zhang Y et al (2017) Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM. IEEE International Conference, USA, pp 1672–1679
Huang XH, Qiao LS, Yu WT et al (2020) End-to-end sequence labeling via convolutional recurrent neural network with a connectionist temporal classification layer. Int J Comput Intell Syst 13(1):66–73
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. International Conference on Machine Learning, Lille Grand Palais, pp 682–689
Jabbari M, Khushaba RN, Nazarpour K (2020) EMG-based hand gesture classification with long short-term memory deep recurrent neural networks. Ann Conf Canadian Med Biol Eng Soc:3302–3305
Kim S, Hori T, Watanabe S. Joint CTC-attention based end-to-end speech recognition using multi-task learning. New Orleans: The 42nd IEEE International Conference on Acoustics, Speech and Signal Processing, 2017:798–805.
Liu W, Anguelov D, Erhan D, et al. SSD: Single shot multibox detector. European Conference on Computer Vision, 2016:21–37.
Luong MT, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. Lisbon: Empirical Methods Nat Language Process:316–325
Qu S, Xi Y, Ding S (2017) Visual attention based on long-short term memory model for image caption generation. Melbourne: Control Decis Conf:234–239
Redmon J, Divvala S, Girshick R et al (2016) You only look once: unified, real-time object detection. Proc IEEE Conf Comput Vis Pattern Recognit:779–788
Ren S, He K, Girshick R, … Sun J (2017 Jun) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Sitalakshmi V, Mamoun A, Qing Y (2018) Use of data visualisation for zero-day malware detection. Security Commun Networks 2018:807–816
Szegedy C, Vanhoucke V, Ioffe S et al (2016) Rethinking the inception architecture for computer vision. Computer Vision and Pattern Recognition, Las Vegas, pp 272–281
Szegedy C, Ioffe S, Vanhoucke V et al (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. The Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, pp 626–634
Tian Z, Huang W, He T, et al. Detecting text in natural image with connectionist text proposal network. Springer, Cham, 2016. LNCS, vol. 9912, pp. 56–72.
Tsai ST, Kuo EJ, Tiwary P (2020) Learning molecular dynamics with simple language model built upon long short-term memory neural network. Nat Commun 11(1):1015–1021
Wang LL, Wang BQ, Zhao PP et al (2020) Malware detection algorithm based on the attention mechanism and ResNet. Chin J Electron 29(6):473–480
Xiong HP, Chen XX, Chen CW (2018) Text location in image based on convolution neural network. Electronic Sci Technol 31(1):51–59
Xu K, Li D, Cassimatis N et al (2018) LCANet: end-to-end lipreading with cascaded attention-CTC. Xi’an: China Automatic Face Gesture Recogn:351–360
Xu MX, Du XY, Wang DH (2019) Super-resolution restoration of single vehicle image based on ESPCN-VISR model. Adv Sci Industry Res Center: Sci Eng Res Center:517–528
Xue HT, Yang JD, Tan KD (2015) Application of an improved BP neural network in handwriting recognition. Electronic Sci Technol 28(5):20–27
Yin Z, Tang CH, Zhang XX (2016) Image recognition based on improved sparse auto-encoder. Electronic Sci Technol 29(1):124–127
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Jiang, Y., Jiang, Z., He, L. et al. Text recognition in natural scenes based on deep learning. Multimed Tools Appl 81, 10545–10559 (2022). https://doi.org/10.1007/s11042-022-12024-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-12024-w