Abstract
Text displayed in a video is an essential part for the high-level semantic information of the video content. Therefore, video text can be used as a valuable source for automated video indexing in digital video libraries. In this paper, we propose a workflow for video text detection and recognition. In the text detection stage, we have developed a fast localization-verification scheme, in which an edge-based multi-scale text detector first identifies potential text candidates with high recall rate. Then, detected candidate text lines are refined by using an image entropy-based filter. Finally, Stroke Width Transform (SWT)- and Support Vector Machine (SVM)-based verification procedures are applied to eliminate the false alarms. For text recognition, we have developed a novel skeleton-based binarization method in order to separate text from complex backgrounds to make it processible for standard OCR (Optical Character Recognition) software. Operability and accuracy of proposed text detection and binarization methods have been evaluated by using publicly available test data sets.
Similar content being viewed by others
Notes
Mediaglobe is a SME project of the THESEUS research program, supported by the German Federal Ministry of Economics and Technology on the basis of a decision by the German Bundestag, cf. http://www.projekt-mediaglobe.de/ (last access: 14/09/2012).
Text localization is the first task of “reading text in born-digital images (web and email)” challenge.
http://www.yovisto.com/labs/VideoOCR/ (last access: 14/09/2012)
http://trecvid.nist.gov/ (last access: 14/09/2012)
http://code.google.com/p/tesseract-ocr/ (last access: 14/09/2012)
http://liris.cnrs.fr/christian.wolf/software/binarize/index.html (last access:14/09/2012)
http://hunspell.sourceforge.net/ (last access: 14/09/2012)
References
Anthimopoulos M, Gatos B, Pratikakis I (2010) A two-stage scheme for text detection in video images. J Image Vis Comput 28:1413–1426
Bhaskar H, Mihaylova L (2010) Combined feature-level video indexing using block-based motion estimation. In: Proc. of 13th conference on information fusion (FUSION). Edinburgh, pp 1–8
Canny J (1986) A computational approach to edge detection. IEEE Trans Pattern Anal Mach Intell 8(6):679–698
Chen D, Odobez JM, Bourlard H (2004) Text detection and recognition in images and video frames. J Pattern Recogn Soc 37(3):595–608
Deza MM, Deza E (2009) Encyclopedia of distances. Springer
Epshtein B, Ofek E, Wexler Y (2010) Detecting text in natural scenes with stroke width transform. In: Proc. of international conference on computer vision and pattern recognition, pp 2963–2970
Gllavata J, Ewerth R, Freisleben B (2004) Text detection in images based on unsupervised classification of high-frequency wavelet coefficients. In: Proceedings of 17th international conference on (ICPR’04), vol 1, pp 425–428
Gllavata J, Qeli E, Freisleben B (2006) Detecting text in videos using fuzzy clustering ensembles. In: Proceedings of the 8th IEEE international symposium on multimedia, ISM ’06. IEEE Computer Society. Washington, DC, pp 283–290
Hanif SM, Prevost L (2009) Text detection and localization in complex scene images using constrained adaboost algorithm. In: Proceedings of the 2009 10th international conference on document analysis and recognition, ICDAR ’09. IEEE Computer Society. Washington, DC, pp 1–5
Hua XS, Chen XY, Zhang HJ (2001) Automatic location of text in video frames. In: Proc. of ACM multimedia 2001 workshops: multimedia information retrieval, pp 24–27
Hua XS, Liu WY, Zhang HJ (2004) An automatic performance evaluation protocol for video text detection algorithms. IEEE Trans Circuits Syst Video Technol 14(4):498–507
ICDAR RWR (2011) http://www.cvc.uab.es/icdar2011competition/?com=results (last access: 10/07/2012)
Jung K, Kim KI, Jain AK (2004) Text information extraction in images and video: a survey. Pattern Recogn 37(5):977–997
Karatzas D, Mestre SR, Mas J, Nourbakhsh F, Roy PP (2011) Icdar 2011 robust reading competition: challenge 1: reading text in born-digital images (web and email). In: Proc. international conference on document analysis and recognition (ICDAR). Beijing, pp 1485–1490
Keysers D (2006) Comparison and combination of state-of-the-art techniques for handwritten character recognition: topping the mnist benchmark
Kim HH (2011) Toward video semantic search based on a structured folksonomy. J Am Soc Inf Sci Technol 62(3):478–492
Kim KI, Jung K, Park SH, Kim HJ (2001) Support vector machine-based text detection in digital video. Pattern Recogn 34(2):527–529
Li H, Kia O, Doermann D (1999) Text emhancement in digital video. In: Proc. of SPIE, document recognition IV, pp 1–8
Li H, Doermann DS, Kia OE (2000) Automatic text detection and tracking in digital video. IEEE Trans Image Process 9(1):147–156
Lienhart R, Wernicke A (2002) Localizing and segmenting text in images and videos. IEEE Trans Circuits Syst Video Technol 12(4):256–268
Niblack W (1986) An introduction to digital image processing. Prentice-Hall, Englewood Cliffs
Ojala T, Pietikäinen M, Harwood D (1996) A comparative study of texture measures with classification based on featured distributions. Pattern Recogn 29(1):51–59
Otsu N (1978) A threshold selection method from gray level histogram. IEEE Trans Syst Man Cybern 19(1):62–66
Pan YF, Hou X, Liu CL (2008) A robust system to detect and localize texts in natural scene images. In: Proceedings of the 2008 the eighth IAPR international workshop on document analysis systems, DAS ’08. IEEE Computer Society. Washington, DC, pp 35–42
Qian X, Liu G, Wang H, Su R (2007) Text detection, localization and tracking in compressed video. In: Proc. of international conference on signal processing: image communication, pp 752–768
Sato T, Kanade T, Hughes EK, Smith MA, Satoh S (1999) Video OCR: indexing digital new libraries by recognition of superimposed captions. Multimedia Syst 7(5):385–395
Sauvola J, Pietikainen M (2000) Adaptive document image binarization. Pattern Recogn 33(2):225–236
Serra J (1983) Image analysis and mathematical morphology. Academic Press, Orlando
Shivakumara P, Phan TQ, Tan CL (2009) Video text detection based on filters and edge features. In: Proc. of the 2009 international conference on multimedia and expo. IEEE, pp 1–4
Sobel I (1990) An isotropic 3×3 image gradient operator. In: Machine version for three-dimentional scenes, pp 376–379
Sobottka K, Bunke H, Kronenberg H (1999) Identification of terxt on colored book and journal covers. In: Proc. of international conference on document analysis and recognition, pp 57–63
Sonnenburg S, Rätsch G, Henschel S, Widmer C, Behr J, Zien A, Bona, FD, Binder A, Gehl C, Franc V (2010) The shogun machine learning toolbox. J Mach Learn Res 11:1799–1802
Thillou CM, Gosselin B (2007) Color text extraction with selective metric-based clustering. Comput Vis Image Underst 107:1–2
Wolf C, Jolion JM, Chassaing F (2002) Text localization, enhancement and binarization in multimedia documents. In: Proc. of the international conference on pattern recognition, vol 2, pp 1037–1040
Yang H, Siebert M, Lühner P, Sack H, Meinel C (2011) Automatic lecture video indexing using video OCR technology. In: Proc. of international symposium on multimedia (ISM), pp 111–116
Zeng C, Ma H (2010) Robust head-shoulder detection by pca-based multilevel hog-lbp detector for people counting. In: Proceedings of the 2010 20th international conference on pattern recognition, ICPR ’10. IEEE Computer Society. Washington, DC, pp 2069–2072
Zhao M, Li S, Kwok J (2010) Text detection in images using sparse representation with discriminative dictionaries. J Image Vis Comput 28:1590–1599
Zhong Y, Zhang HJ, Jain A (2000) Automatic caption localization in compressed video. IEEE Trans Pattern Anal Mach Intell 22(4):385–392
Zhou Z, Li L, Tan CL (2010) Edge based binarization for video text images. In: Proc. of 20th international conference on pattern recognition. Singapore, pp 133–136
Acknowledgement
This work has been supported by the Mediaglobe project. Mediaglobe is a SME project of the THESEUS research program, supported by the German Federal Ministry of Economics and Technology on the basis of a decision by the German Bundestag (FKZ: 01MQ09031).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yang, H., Quehl, B. & Sack, H. A framework for improved video text detection and recognition. Multimed Tools Appl 69, 217–245 (2014). https://doi.org/10.1007/s11042-012-1250-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-012-1250-6