Skip to main content
Log in

Text recognition in natural scenes based on deep learning

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Aiming at the problems of character segmentation and dictionary dependence in text recognition in natural scenes, a text recognition algorithm based on Attention mechanism and connection time classification (CTC) loss is proposed. Convolutional neural network and bidirectional long short – term memory network are used to realize image feature coding, which avoids the gradient vanishing problem of recurrent neural network (RNN) with the increase of time. And the Attention-CTC structure is used to decode the feature sequence, which effectively solves the problem of unconstrained attention decoding. The algorithm avoids extra processing of alignment and subsequent syntax processing, and improves the speed of training convergence and significantly improves the recognition rate of text. It has a certain research value in recognition accuracy. Experimental results show that the algorithm has good robustness to text images with fuzzy fonts and complex background.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Alazab M, Khan S, Krishnan SSR et al (2020) A multidirectional LSTM model for predicting the stability of a smart grid. IEEE Access PP(99):1–11

    Google Scholar 

  2. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations, San Diego, pp 89–93

    Google Scholar 

  3. Bahdanau D, Chorowski J, Serdyuk D, et al. End-to-end attention-based large vocabulary speech recognition. Shanghai: The 41st IEEE International Conference on Acoustics, Speech and Signal Processing, 2016:4945–4949.

  4. Bai X, Shi B, Yao C (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell 39(11):2298–2304

    Google Scholar 

  5. Chen ZJ, Chen DP, Zhang YS, Cheng XZ et al (2020) Deep learning for autonomous ship-oriented small ship detection. Saf Sci 130:132–141

    Article  Google Scholar 

  6. Chen JN, Gao S, Sun HZ et al (2020) An end-to-end speech recognition algorithm based on attention mechanism. Syst Eng Soc China:6–14

  7. Chen ZJ, Cai H, Zhang YS, Wu CZ et al (2020) A novel sparse representation model for pedestrian abnormal trajectory understanding. Expert Syst Appl 144:516–525

    Google Scholar 

  8. Chen JN, Gao S, Sun HZ et al (2020) An end-to-end speech recognition algorithm based on attention mechanism. Syst Eng Soc China:640–646

  9. Danish V, Alazab M, Sobia W, Hamad N, et al. IMCFN: Image-based malware classification using fine-tuned convolutional neural network architecture. Computer Networks, 2020:171–177.

  10. Fernández-Díaz M, Gallardo-Antolín A (2020) An attention long short-term memory based system for automatic classification of speech intelligibility. Eng Appl Artif Intell 96:1–8

    Article  Google Scholar 

  11. Ganesh J, Hubert C (2020) Data augmentation for handwritten digit recognition using generative adversarial networks. Multimed Tools Appl 79:35055–35068

    Article  Google Scholar 

  12. Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation. 2014 IEEE conference on computer vision and pattern recognition, Columbus, OH, 2014, pp. 580–587.

  13. Graves A, Gomez F (2016) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. International Conference on Machine Learning, Hong Kong, pp 742–748

    Google Scholar 

  14. Hakak S, Alazab M, Khan S, … Khan WZ (2021) An ensemble machine learning approach through effective feature extraction to classify fake news. Futur Gener Comput Syst 117:114–123

    Article  Google Scholar 

  15. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  16. Hori T, Watanabe S, Zhang Y et al (2017) Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM. IEEE International Conference, USA, pp 1672–1679

    Google Scholar 

  17. Huang XH, Qiao LS, Yu WT et al (2020) End-to-end sequence labeling via convolutional recurrent neural network with a connectionist temporal classification layer. Int J Comput Intell Syst 13(1):66–73

    Article  Google Scholar 

  18. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. International Conference on Machine Learning, Lille Grand Palais, pp 682–689

    Google Scholar 

  19. Jabbari M, Khushaba RN, Nazarpour K (2020) EMG-based hand gesture classification with long short-term memory deep recurrent neural networks. Ann Conf Canadian Med Biol Eng Soc:3302–3305

  20. Kim S, Hori T, Watanabe S. Joint CTC-attention based end-to-end speech recognition using multi-task learning. New Orleans: The 42nd IEEE International Conference on Acoustics, Speech and Signal Processing, 2017:798–805.

  21. Liu W, Anguelov D, Erhan D, et al. SSD: Single shot multibox detector. European Conference on Computer Vision, 2016:21–37.

  22. Luong MT, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. Lisbon: Empirical Methods Nat Language Process:316–325

  23. Qu S, Xi Y, Ding S (2017) Visual attention based on long-short term memory model for image caption generation. Melbourne: Control Decis Conf:234–239

  24. Redmon J, Divvala S, Girshick R et al (2016) You only look once: unified, real-time object detection. Proc IEEE Conf Comput Vis Pattern Recognit:779–788

  25. Ren S, He K, Girshick R, … Sun J (2017 Jun) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  Google Scholar 

  26. Sitalakshmi V, Mamoun A, Qing Y (2018) Use of data visualisation for zero-day malware detection. Security Commun Networks 2018:807–816

    Google Scholar 

  27. Szegedy C, Vanhoucke V, Ioffe S et al (2016) Rethinking the inception architecture for computer vision. Computer Vision and Pattern Recognition, Las Vegas, pp 272–281

    Google Scholar 

  28. Szegedy C, Ioffe S, Vanhoucke V et al (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. The Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, pp 626–634

    Google Scholar 

  29. Tian Z, Huang W, He T, et al. Detecting text in natural image with connectionist text proposal network. Springer, Cham, 2016. LNCS, vol. 9912, pp. 56–72.

  30. Tsai ST, Kuo EJ, Tiwary P (2020) Learning molecular dynamics with simple language model built upon long short-term memory neural network. Nat Commun 11(1):1015–1021

    Article  Google Scholar 

  31. Wang LL, Wang BQ, Zhao PP et al (2020) Malware detection algorithm based on the attention mechanism and ResNet. Chin J Electron 29(6):473–480

    Google Scholar 

  32. Xiong HP, Chen XX, Chen CW (2018) Text location in image based on convolution neural network. Electronic Sci Technol 31(1):51–59

    Google Scholar 

  33. Xu K, Li D, Cassimatis N et al (2018) LCANet: end-to-end lipreading with cascaded attention-CTC. Xi’an: China Automatic Face Gesture Recogn:351–360

  34. Xu MX, Du XY, Wang DH (2019) Super-resolution restoration of single vehicle image based on ESPCN-VISR model. Adv Sci Industry Res Center: Sci Eng Res Center:517–528

  35. Xue HT, Yang JD, Tan KD (2015) Application of an improved BP neural network in handwriting recognition. Electronic Sci Technol 28(5):20–27

    Google Scholar 

  36. Yin Z, Tang CH, Zhang XX (2016) Image recognition based on improved sparse auto-encoder. Electronic Sci Technol 29(1):124–127

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhongyu Jiang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, Y., Jiang, Z., He, L. et al. Text recognition in natural scenes based on deep learning. Multimed Tools Appl 81, 10545–10559 (2022). https://doi.org/10.1007/s11042-022-12024-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-12024-w

Keywords

Navigation