Skip to main content
Log in

Scene text recognition using residual convolutional recurrent neural network

  • Original Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

Text is a significant tool for human communication, and text recognition in scene images becomes more and more important. In this paper, we propose a residual convolutional recurrent neural network for solving the task of scene text recognition. The general convolutional recurrent neural network (CRNN) is realized by combining convolutional neural network (CNN) with recurrent neural network (RNN). The CNN part extracts features and the RNN part encodes and decodes feature sequences. In order to improve the accuracy rate of scene text recognition based on CRNN, we explore different deeper CNN architectures to get feature descriptors and analyze the corresponding text recognition results. Specifically, VGG and ResNet are introduced to train these different deep models and obtain the encoding information of images. The experimental results on public datasets demonstrate the effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Thome, N., Vacavant, A., Robinault, L., Miguet, S.: A cognitive and video-based approach for multinational license plate recognition. Mach. Vis. Appl. 22(2), 389–407 (2011)

    Article  Google Scholar 

  2. Kheyrollahi, A., Breckon, T.: Automatic real-time road marking recognition using a feature driven approach. Mach. Vis. Appl. 23(1), 123–133 (2012)

    Article  Google Scholar 

  3. Rodriguez, J., Perronnin, F.: Label embedding for text recognition. In: British Machine Vision Conference, pp. 5.1–5.12 (2013)

  4. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild with convolutional neural networks. Int. J. Comput. Vis. 116(1), 1–20 (2016)

    Article  MathSciNet  Google Scholar 

  5. Shi, B., Bai, X., Yao, C.: An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition. arxiv (2015)

  6. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2017)

    Article  Google Scholar 

  7. Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv (2014)

  8. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: International IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  9. Yao, X., Han, J., Zhang, D., Nie, F.: Revisiting co-saliency detection: a novel approach based on two-stage multi-view spectral rotation co-clustering. IEEE Trans. Image Process. 26(7), 3196–3209 (2017)

    Article  MathSciNet  Google Scholar 

  10. Zhang, D., Meng, D., Li, C., Jiang, L., Zhao, Q., Han, J.: Co-saliency detection via a self-paced multiple-instance learning framework. IEEE Trans. Pattern Anal. Mach. Intell. 39(5), 865–878 (2017)

    Article  Google Scholar 

  11. Jian, M., Qi, Q., Dong, J., Sun, X., Sun, Y., Lam, K.: Saliency detection using quatemionic distance based weber descriptor and object cues. In: Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 1–4 (2016)

  12. Jian, M., Lam, K., Dong, J., Shen, L.: Visual-patch-attention-aware saliency detection. IEEE Trans. Cybern. 45(8), 1575–1586 (2015)

    Article  Google Scholar 

  13. Han, J., Zhang, D., Cheng, G., Liu, N., Xu, D.: Advanced deep-learning techniques for salient and category-specific object detection: a survey. IEEE Signal Process. Mag. 35(1), 84–100 (2018)

    Article  Google Scholar 

  14. Shen, J., Peng, J., Dong, X., Shao, L., Porikli, F.: Higher-order energies for image segmentation. IEEE Trans. Image Process. 26(10), 4911–4922 (2017)

    Article  MathSciNet  Google Scholar 

  15. Wang, W., Shen, J., Yang, R., Porikli, F.: Saliency-aware video object segmentation. IEEE Trans. Pattern Anal Mach Intell 40(1), 20–33 (2018)

    Article  Google Scholar 

  16. Han, J., Quan, R., Zhang, D., Nie, F.: Robust object co-segmentation using background prior. IEEE Trans. Image Process. 27(4), 1639–1651 (2017)

    Article  MathSciNet  Google Scholar 

  17. Shen, J., Du, Y., Wang, W., Li, X.: Lazy random walks for superpixel segmentation. IEEE Trans. Image Process. 23(4), 1451–1462 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  18. Shen, J., Hao, X., Liang, Z., Liu, Y., Wang, W., Shao, L.: Real-time superpixel segmentation by DBSCAN clustering algorithm. IEEE Trans. Image Process. 25(12), 5933–5942 (2016)

    Article  MathSciNet  Google Scholar 

  19. Cheng, G., Yang, C., Yao, X., Guo, L., Han, J.: When deep learning meets metric learning: remote sensing image scene classification via learning discriminative CNNs. IEEE Trans. Geosci. Remote Sens. 56(5), 2811–2821 (2018)

    Article  Google Scholar 

  20. Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: International Conference on Computer Vision, 1457–1464 (2011)

  21. Wang, T., Wu, D., Coates, A., Ng, A.: End-to-end text recognition with convolutional neural networks. In: International Conference on Pattern Recognition, pp. 3304–3308 (2012)

  22. Bissacco, A., Cummins, M., Netzer, Y., Neven, H.: PhotoOCR: Reading text in uncontrolled conditions. In: International Conference on Computer Vision, pp. 785-792 (2013)

  23. Neumann, L., Matas, J.: Scene text localization and recognition with oriented stroke detection. In: International Conference on Computer Vision, pp. 97–104 (2013)

  24. Lee, C., Bhardwaj, A., Di, W., Jagadeesh, V., Piramuthu, R.: Region-based discriminative feature pooling for scene text recognition. In: International Conference on Computer Vision and Pattern Recognition, pp. 4050–4057 (2014)

  25. Yao, C., Bai, X., Shi, B., Liu, W.: Strokelets: a learned multi-scale representation for scene text recognition. In: International Conference on Computer Vision and Pattern Recognition, pp. 4042–4049 (2014)

  26. Almazn, J., Gordo, A., Forns, A., Valveny, E.: Word spotting and recognition with embedded attributes. IEEE Trans. Pattern Anal. Mach. Intell. 36(12), 2552–2566 (2014)

    Article  Google Scholar 

  27. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Deep structured output learning for unconstrained text recognition. Eprint Arxiv. 24(6), 603–611 (2015)

  28. Su, B., Lu, S.: Accurate scene text recognition based on recurrent neural network. In: Asian Conference on Computer Vision, pp. 35–48 (2015)

  29. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: International Conference on Computer Vision and Pattern Recognition, pp. 886–893 (2005)

  30. Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: International Conference on Computer Vision and Pattern Recognition, pp. 4168–4176 (2016)

  31. Wang, W., Shen, J., Shao, L.: Video salient object detection via fully convolutional networks. IEEE Trans. Image Process. 27(1), 38–49 (2018)

    Article  MathSciNet  Google Scholar 

  32. Wang, W., Shen, J.: Deep visual attention prediction. IEEE Trans. Image Process. 27(5), 2368–2378 (2018)

    Article  MathSciNet  Google Scholar 

  33. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 4480–456 (2015)

  34. Belagiannis, V., Wang, X., Shitrit, H., et al.: Parsing human skeletons in an operating room. Mach. Vis. Appl. 27(7), 1035–1046 (2016)

    Article  Google Scholar 

  35. Sebastien, R., Fredericm, J.: A novel target detection algorithm combining foreground and background manifold-based models. Mach. Vis. Appl. 27(3), 363–375 (2016)

    Article  Google Scholar 

  36. He, P., Huang, W., Qiao, Y., Loy, C., Tang, X.: Reading scene text in deep convolutional sequences. In: AAAI Conference on Artificial Intelligence, pp. 3501–3508 (2016)

  37. Bengio, Y., Simard, P., Frasconi, P.: Learning long term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)

    Article  Google Scholar 

  38. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  39. Gers, F., Schraudolph, N., Schmidhuber, J.: Learning precise timing with LSTM recurrent networks. J. Mach. Learn. Res. 3(1), 115–143 (2003)

    MathSciNet  MATH  Google Scholar 

  40. Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: International Conference on Acoustics, Speech, and Signal Processing, pp. 6645–6649 (2013)

  41. Graves, A., Fernandez, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: International Conference on Machine Learning, pp. 369–376 (2006)

  42. Zeiler, M.: ADADELTA: An Adaptive Learning Rate Method. arXiv (2012)

  43. Rumelhart, D., Hinton, G., Williams, R.: Learning internal representations by error propagation. Parallel Distrib. Process. 1, 318–362 (1986)

    Google Scholar 

  44. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. NIPS Deep Learning Workshop (2014)

  45. Lucas, S., Panaretos, A., Sosa, L., Tang, A., Wong, S., Yang, R., et al.: Robust reading competitions: entries, results, and future directions. Int. J. Doc. Anal. Recognit. 7(2), 105–122 (2005)

    Article  Google Scholar 

  46. Mishra, A., Alahari, K., Jawahar, C.: Scene Text Recognition using higher OrScene text recognition using higher order language priors. In: British Machine Vision Conference, pp. 1–11 (2012)

  47. Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Bigorda, L., Mestre, S., et al.: ICDAR 2013 robust reading competition. Int. Conf. Doc. Anal. Recognit. 2013, 1484–1493 (2013)

    Google Scholar 

  48. Bhunia, A., Kumar, G., Roy, P., Balasubramanian, R., Pal, U.: Text recognition in scene image and video frame using color channel selection. Multimedia Tools Appl. 77(7), 8551–8578 (2018)

    Article  Google Scholar 

  49. Lee C., Osindero, S.: Recursive recurrent nets with attention modeling for OCR in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2231–2239 (2016)

Download references

Acknowledgements

This work was supported in part by the Beijing Natural Science Foundation under Grant 4182056. Specialized Fund for Joint Building Program of Beijing Municipal Education Commission.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sanyuan Zhao.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lei, Z., Zhao, S., Song, H. et al. Scene text recognition using residual convolutional recurrent neural network. Machine Vision and Applications 29, 861–871 (2018). https://doi.org/10.1007/s00138-018-0942-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00138-018-0942-y

Keywords

Navigation