Skip to main content
Log in

Scene text detection and recognition: recent advances and future trends

  • Review Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Text, as one of the most influential inventions of humanity, has played an important role in human life, so far from ancient times. The rich and precise information embodied in text is very useful in a wide range of vision-based applications, therefore text detection and recognition in natural scenes have become important and active research topics in computer vision and document analysis. Especially in recent years, the community has seen a surge of research efforts and substantial progresses in these fields, though a variety of challenges (e.g. noise, blur, distortion, occlusion and variation) still remain. The purposes of this survey are three-fold: 1) introduce up-to-date works, 2) identify state-of-the-art algorithms, and 3) predict potential research directions in the future. Moreover, this paper provides comprehensive links to publicly available resources, including benchmark datasets, source codes, and online demos. In summary, this literature review can serve as a good reference for researchers in the areas of scene text detection and recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Tsai S S, Chen H, Chen D, Schroth G, Grzeszczuk R, Girod B. Mobile visual search on printed documents using text and low bit-rate features. In: Proceedings of the 18th IEEE International Conference on Image Processing. 2011, 2601–2604

    Google Scholar 

  2. Barber D B, Redding J D, McLain T W, Beard R W, Taylor CN. Vision-based target geo-location using a fixed-wing miniature air vehicle. Journal of Intelligent and Robotic Systems, 2006, 47(4): 361–382

    Article  Google Scholar 

  3. Kisacanin B, Pavlovic V, Huang T S. Real-time vision for humancomputer interaction. Springer Science and Business Media, 2005

    Google Scholar 

  4. DeSouza G N, Kak A C. Vision for mobile robot navigation: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(2): 237–267

    Article  Google Scholar 

  5. Ham Y K, Kang M S, Chung H K, Park R H, Park G T. Recognition of raised characters for automatic classification of rubber tires. Optical Engineering, 1995, 34(1): 102–109

    Article  Google Scholar 

  6. Yao C, Zhang X, Bai X, Liu W, Tu Z. Rotation-invariant features for multi-oriented text detection in natural images. PloS one, 2013, 8(8): e70173

    Article  Google Scholar 

  7. Yao C, Bai X, Shi B, Liu W. Strokelets: A learned multi-scale representation for scene text recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2014, 4042-4049

  8. Chen X, Yuille A L. Detecting and reading text in natural scenes. In: Proceedings of 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2004, 2

    Google Scholar 

  9. Epshtein B, Ofek E, Wexler Y. Detecting text in natural scenes with stroke width transform. In: Proceedings of 2010 IEEE Conference on Computer Vision and Pattern Recognition. 2010, 2963–2970

    Chapter  Google Scholar 

  10. Neumann L, Matas J. A method for text localization and recognition in real-world images. Lecture Notes in Computer Science, 2011, 6494, 770–783

    Article  Google Scholar 

  11. Wang K, Babenko B, Belongie S. End-to-end scene text recognition. In: Proceedings of 2011 IEEE International Conference on Computer Vision. 2011, 1457–1464

    Chapter  Google Scholar 

  12. Yao C, Bai X, Liu W, Ma Y, Tu Z. Detecting texts of arbitrary orientations in natural images. In: Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. 2012, 1083–1090

    Google Scholar 

  13. Neumann L, Matas J. Real-time scene text localization and recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2012, 3538–3545

    Google Scholar 

  14. Novikova T, Barinova O, Kohli P, Lempitsky V. Large-lexicon attribute-consistent text recognition in natural images. In: Proceedings of 12th European Conference on Computer Vision. 2012, 752–765

    Google Scholar 

  15. Mishra A, Alahari K, Jawahar C V. Scene text recognition using higher order language priors. In: Proceedings of the 23rd British Machine Vision Conference. 2012

    Google Scholar 

  16. Weinman J J, Butler Z, Knoll D, Field J. Toward integrated scene text reading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(2): 375–387

    Article  Google Scholar 

  17. Bissacco A, Cummins M, Netzer Y, Neven, H. Photoocr: reading text in uncontrolled conditions. In: Proceedings of IEEE International Conference on Computer Vision. 2013, 785–792

    Google Scholar 

  18. Phan T Q, Shivakumara P, Tian S, Tan C L. Recognizing text with perspective distortion in natural scenes. In: Proceedings of IEEE International Conference on Computer Vision. 2013, 569–576

    Google Scholar 

  19. Jaderberg M, Vedaldi A, Zisserman A. Deep features for text spotting. In: Proceedings of the 13th European Conference on Computer Vision. 2014, 512–528

    Google Scholar 

  20. Almazan J, Gordo A, Fornes A, Valveny, E. Word spotting and recognition with embedded attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(12): 2552–2566

    Article  Google Scholar 

  21. Chen D, Luettin J, Shearer K. A survey of text detection and recognition in images and videos. Institut Dalle Molle d’Intelligence Artificielle Perceptive Research Report IDIAP-RR 00-38. 2000

    Google Scholar 

  22. Jung K, Kim K I, Jain A K. Text information extraction in images and video: a survey. Pattern recognition, 2004, 37(5): 977–997

    Article  Google Scholar 

  23. Liang J, Doermann D, Li H. Camera-based analysis of text and documents: a survey. International Journal of Document Analysis and Recognition, 2005, 7(2–3): 84–104

    Article  Google Scholar 

  24. Zhang H, Zhao K, Song Y Z, Guo J. Text extraction from natural scene image: a survey. Neurocomputing, 2013, 122: 310–323

    Article  Google Scholar 

  25. Uchida S. Text localization and recognition in images and video. Handbook of Document and Recognition. London: Springer, 2014, 843–883

    Chapter  Google Scholar 

  26. Kang L, Li Y, Doermann D. Orientation robust text line detection in natural images. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2014, 4034–4041

    Google Scholar 

  27. Pan Y F, Hou X, Liu C L. A hybrid approach to detect and localize texts in natural scene images. IEEE Transactions on Image Processing, 2011, 20(3): 800–813

    Article  MathSciNet  Google Scholar 

  28. Yi C, Tian Y L. Text string detection from natural scenes by structurebased partition and grouping. IEEE Transactions on Image Processing, 2011, 20(9): 2594–2605

    Article  MathSciNet  Google Scholar 

  29. Huang W, Lin Z, Yang J C, Wang J. Text localization in natural images using stroke feature transform and text covariance descriptors. In: Proceedings of IEEE International Conference on Computer Vision. 2013, 1241–1248

    Google Scholar 

  30. Huang W, Qiao Y, Tang X. Robust scene text detection with convolution neural network induced Mser trees. In: Proceedings of European Conference on Computer Vision. 2014, 497–511

    Google Scholar 

  31. Mishra A, Alahari K, Jawahar C V. Top-down and bottom-up cues for scene text recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2012, 2687–2694

    Google Scholar 

  32. Shi C Z, Wang C H, Xiao B H, Zhang Y. Scene text recognition using part-based tree-structured character detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2013, 2961–2968

    Google Scholar 

  33. Lee C Y, Bhardwaj A, Di W, Jagadeesh, V. Region-based discriminative feature pooling for scene text recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2014, 4050–4057

    Google Scholar 

  34. Yao C, Bai X, Liu W. A unified framework for multi-oriented text detection and recognition. IEEE Transactions on Image Processing, 2014, 23(11): 4737–4749

    Article  MathSciNet  Google Scholar 

  35. Zhong Y, Karu K, Jain A K. Locating text in complex color images. In: Proceedings of the 3rd IEEE Conference on Document Analysis and Recognition. 1995, 146–149

    Chapter  Google Scholar 

  36. Kim K I, Jung K, Kim J H. Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2003, 25(12): 1631–1639

    Article  MathSciNet  Google Scholar 

  37. Gllavata J, Ewerth R, Freisleben B. Text detection in images based on unsupervised classification of high-frequency wavelet coefficients. In: Proceedings of the 17th IEEE International Conference on Pattern Recognition. 2004, 425–428

    Google Scholar 

  38. Li H, Doermann D, Kia O. Automatic text detection and tracking in digital video. IEEE Transactions on Image Processing, 2000, 9(1): 147–156

    Article  Google Scholar 

  39. Leibe B, Schiele B. Scale-invariant object categorization using a scaleadaptive mean-shift search. Lecture Notes in Computer Science, 2004, 3175: 145–153

    Article  Google Scholar 

  40. Lyu M R, Song J, Cai M. A comprehensive method for multilingual video text detection, localization, and extraction. IEEE Transactions on Circuits and Systems for Video Technology, 2005, 15(2): 243–255

    Article  Google Scholar 

  41. Zhong Y, Zhang H, Jain A K. Automatic caption localization in compressed video. IEEE Transactions on Pattern Analysis and Machine Intelligenc, 2000, 22(4): 385–392

    Article  Google Scholar 

  42. Viola P, Jones M. Fast and robust classification using asymmetric adaboost and a detector cascade. In: Proceedings of Advances in Neural Information Processing System, 2001, 14

    Google Scholar 

  43. Lucas S M. Icdar 2005 text locating competition results. In: Proceedings of the 8th International Conference on Document Analysis and Recognition. 2005, 80–84

    Google Scholar 

  44. Wu V, Manmatha R, Riseman E M. Finding text in images. In: Proceedings of the 2nd ACM international conference on Digital libraries. 1997, 3–12

    Chapter  Google Scholar 

  45. Wolf C, Jolion J M. Extraction and recognition of artificial text in multimedia documents. Formal Pattern Analysis and Applications, 2004, 6(4): 309–326

    MathSciNet  Google Scholar 

  46. Wang K, Belongie S. Word spotting in the wild. In: Proceedings of European Conference on Computer Vision. 2010, 591–604

    Google Scholar 

  47. Jain A K, Yu B. Automatic text location in images and video frames. Pattern Recognition, 1998, 31(12): 2055–2076

    Article  Google Scholar 

  48. Chen H, Tsai S S, Schroth G, Chen D m. Robust text detection in natural images with edge-enhanced maximally stable extremal regions. In: Proceedings of the 18th IEEE International Conference on Image Processing. 2011, 2609–2612

    Google Scholar 

  49. Yin X C, Yin X, Huang K, Hao H W. Robust text detection in natural scene images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(5): 970–983

    Article  Google Scholar 

  50. Wright J, Yang A Y, Ganesh A, Sastry S S. Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(2): 210–227

    Article  Google Scholar 

  51. Elad M, Aharon M. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 2006, 15(12): 3736–3745

    Article  MathSciNet  Google Scholar 

  52. Zhao M, Li S, Kwok J. Text detection in images using sparse representation with discriminative dictionaries. Image and Vision Computing, 2010, 28(12): 1590–1599

    Article  Google Scholar 

  53. Shivakumara P, Phan T Q, Tan C L. A laplacian approach to multioriented text detection in video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(2): 412–419

    Article  Google Scholar 

  54. Liu Y X, Ikenaga T. A contour-based robust algorithm for text detection in color images. IEICE Transactions on Information and Systems, 2006, 89(3): 1221–1230

    Article  Google Scholar 

  55. Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2005, 1, 886–893

    Google Scholar 

  56. Lafferty J, McCallum A, Pereira F C N. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning. 2001, 282–289

    Google Scholar 

  57. Sawaki M, Murase H, Hagita N. Automatic acquisition of context-based images templates for degraded character recognition in scene images. In: Proceedings of the 15th International Conference on Pattern Recognition. 2000, 4, 15–18

    Article  Google Scholar 

  58. Zhou J, Lopresti D. Extracting text from www images. In: Proceedings of the 4th International Conference on Document Analysis and Recognition. 1997, 1, 248–252

    Article  Google Scholar 

  59. Zhou J, Lopresti D P, Lei Z. Ocr for world wide web images. In: Proceedings of Society of Photographic Instrumentation Engineers. 1997, 58

    Google Scholar 

  60. de Campos T, Babu B R, Varma M. Character recognition in natural images. In: Proceedings of the International Conference on Computer Vision Theory and Applications, 2009

    Google Scholar 

  61. Smith R. Limits on the application of frequency-based language models to Ocr. In: Proceedings of International Conference on Document Analysis and Recognition. 2011, 538–542

    Google Scholar 

  62. Matas J, Chum O, Urban M, Pajdla T. Robust wide-baseline stereo from maximally stable extremal regions. Image and Vision Computing, 2004, 22(10): 761–767

    Article  Google Scholar 

  63. Mohri M, Pereira F, Riley M. Weighted finite-state transducers in speech recognition. Computer Speech and Language, 2002, 16(1): 69–88

    Article  Google Scholar 

  64. Rodriguez-Serrano J A, Perronnin F C. Label embedding for text recognition. In: Proceedings of the British Machine Vision Conference, 2013

    Google Scholar 

  65. Neumann L, Matas J. Text localization in real-world images using efficiently pruned exhaustive search. In: Proceedings of International Conference on Document Analysis and Recognition. 2011, 687-691

  66. Neumann L, Matas J. Scene text localization and recognition with oriented stroke detection. In: Proceedings of IEEE International Conference on Computer Vision. 2013, 97–104

    Google Scholar 

  67. Le Cun B B, Denker J S, Henderson D, Howard R E, Hubbard W, Jackel L D. Handwritten digit recognition with a back-propagation network. In: Proceedings of Advances in Neural Information Processing Systems. 1990

    Google Scholar 

  68. Farabet C, Couprie C, Najman L, LeCun, Y. Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1915–1929

    Article  Google Scholar 

  69. Taigman Y, Yang M, Ranzato M A, Wolf, L. Deepface: closing the gap to human-level performance in face verification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2014, 1701–1708

    Google Scholar 

  70. Girshick R, Donahue J, Darrell T, Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2014, 580–587

    Google Scholar 

  71. Lee C Y, Xie S, Gallagher P, Zhang Z Y, Tu Z W. Deeply-supervised nets. arXiv preprint arXiv:1409.5185. 2014

    Google Scholar 

  72. Coates A, Carpenter B, Case C, Satheesh S, Suresh B, Wang T, Wu D J, Ng AY. Text detection and character recognition in scene images with unsupervised feature learning. In: Proceedings of International Conference on Document Analysis and Recognition. 2011, 440–445

    Google Scholar 

  73. Wang T, Wu D J, Coates A, Ng A Y. End-to-end text recognition with convolutional neural networks. In: Proceedings of the 21st International Conference on Pattern Recognition. 2012, 3304–3308

    Google Scholar 

  74. Karaoglu S, Van Gemert J C, Gevers T. Object reading: text recognition for object recognition. Lecture Notes in Computer Science, 2012, 7585: 456–465

    Article  Google Scholar 

  75. Google Goggles. https://play.google.com/store/apps

  76. Lucas S M, Panaretos A, Sosa L, et al. ICDAR 2003 robust reading competitions. In: Proceedings of the 12th International Conference on Document Analysis and Recognition. 2003, 2, 682–682

    Article  Google Scholar 

  77. Shahab A, Shafait F, Dengel A. ICDAR 2011 robust reading competition challenge 2: reading text in scene images. In: Proceedings of International Conference on Document Analysis and Recognition. 2011, 1491–1496

    Google Scholar 

  78. Karatzas D, Shafait F, Uchida S, Iwamura, M. ICDAR 2013 robust reading competition. In: Proceedings of Document Analysis and Recognition. 2013, 1484–1493

    Google Scholar 

  79. Nagy R, Dicker A, Meyer-Wegener K. NEOCR: a configurable dataset for natural image text recognition. Camera-Based Document Analysis and Recognition. Berlin: Springer, 2012: 150–163

    Chapter  Google Scholar 

  80. Lee S H, Cho M S, Jung K, Kim J H. Scene text extraction with edge constraint and text collinearity link. In: Proceedings of International Conference on Pattern Recognition. 2010, 3983–3986

    Google Scholar 

  81. de Campos T, Babu B R, Varma M. Character recognition in natural images. In: Proceedings of International Conference on Computer Vision Theory and Applications, 2009

    Google Scholar 

  82. Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng A Y. Reading digits in natural images with unsupervised feature learning. In: Proceedings of NIPS workshop on deep learning and unsupervised feature learning. 2011, (2), 5

    Google Scholar 

  83. Yi C, Tian Y. Text extraction from scene images by character appearance and structure modeling. Computer Vision and Image Understanding, 2013, 117(2): 182–194

    Article  Google Scholar 

  84. Wolf C, Jolion J M. Object count/area graphs for the evaluation of object detection and segmentation algorithms. International Journal of Document Analysis and Recognition, 2006, 8(4): 280–296

    Article  Google Scholar 

  85. Yin X C, Yin X, Huang K, Hao H W. Accurate and robust text detection: a step-in for text retrieval in natural scene images. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2013, 1091–1092

    Google Scholar 

  86. Neumann L, Matas J. On combining multiple segmentations in scene text recognition. In: Proceedings of the 12th International Conference on Document Analysis and Recognition. 2013, 523–527

    Google Scholar 

  87. Koo H I, Kim D H. Scene text detection via connected component clustering and nontext filtering. IEEE Transactions on Image Processing, 2013, 22(6): 2296–2305

    Article  MathSciNet  Google Scholar 

  88. Shi C, Wang C, Xiao B, Zhang Y, Gao S. Scene text detection using graph model built upon maximally stable extremal regions. Pattern Recognition Letters, 2013, 34(2): 107–116

    Article  Google Scholar 

  89. Yi C, Tian Y. Text detection in natural scene images by stroke gabor words. In: Proceedings of International Conference on Document Analysis and Recognition, 2011, 177–181

    Google Scholar 

  90. Freeman H, Shapira R. Determining the minimum-area encasing rectangle for an arbitrary closed curve. Communications of the ACM, 1975, 18(7): 409–413

    Article  MATH  MathSciNet  Google Scholar 

  91. Everingham M, Van Gool L, Williams C K I, Winn J, Zisserman A. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 2010, 88(2): 303–338

    Article  Google Scholar 

  92. Goel V, Mishra A, Alahari K, Jawahar C V. Whole is greater than sum of parts: recognizing scene text words. In: Proceedings of the 12th International Conference on Document Analysis and Recognition. 2013, 398–402

    Google Scholar 

  93. Yildirim G, Achanta R, SÃijsstrunk S. Text recognition in natural images using multiclass hough forests. In: Proceedings of International Conference on Computer Vision Theory and Applications. 2013, 737–741

    Google Scholar 

  94. ABBYY FineReader 9.0. http://www.abbyy.com/

  95. Jaderberg M, Simonyan K, Vedaldi A, Zisserman A. Synthetic data and artificial neural networks for natural scene text recognition. 2014, arXiv preprint arXiv:1406.2227

    Google Scholar 

  96. Su B, Lu S. Accurate scene text recognition based on recurrent neural network. In: Proceedings of Computer Vision-ACCV, 2014

    Google Scholar 

  97. Jaderberg M, Simonyan K, Vedaldi A, Zisserman A. Reading text in the wild with convolutional neural networks. 2014, arXiv preprint arXiv:1412.1842

    Google Scholar 

  98. Jaderberg M, Simonyan K, Vedaldi A, Zisserman A. Deep structured output learning for unconstrained text recognition. 2014, arXiv reprint arXiv: 1412.5903

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiang Bai.

Additional information

Yingying Zhu received her BS in electronics and information engineering from Huazhong University of Science and Technology (HUST), China in 2011. She is currently a PhD student in the School of Electronic Information and Communications, HUST. Her research areas mainly include text/traffic sign detection and recognition in natural images.

Cong Yao received his BS and PhD in electronics and information engineering from Huazhong University of Science and Technology (HUST), China in 2008 and 2014, respectively. He was a visiting research scholar with Temple University, USA in 2013. His research has focused on computer vision and machine learning, in particular, the area of text detection and recognition in natural images.

Xiang Bai received his BS, MS, and PhD degrees from Huazhong University of Science and Technology (HUST), China in 2003, 2005, and 2009, respectively, all in electronics and information engineering. He is currently a professor in the School of Electronic Information and Communications, HUST, where he is also the Vice Director of the National Center of Anti-Counterfeiting Technology, China. His research interests include object recognition, shape analysis, scene text recognition, and intelligent systems.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, Y., Yao, C. & Bai, X. Scene text detection and recognition: recent advances and future trends. Front. Comput. Sci. 10, 19–36 (2016). https://doi.org/10.1007/s11704-015-4488-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-015-4488-0

Keywords

Navigation