Text detection, localization, and tracking in compressed video
Introduction
Video sequences are usually integrated with audio, image, graph, text and so on. At present, video sequences become an indispensable part of people's daily life. With the rapid improvements in video compression technology, the expansion of low-cost storage media and the information explosion over the Internet, digital video libraries come into reality and pervade in the near future. Accordingly, the fast video accessing and browsing is required in many areas such as video conferencing, remote video based education, video-on-demand systems and so on [1], [2]. Video texts provide more intuitive information for audiences to grasp the meaning of video content. The clues provided by video texts have been widely used in semantic based video analysis, indexing and retrieval [3], [4], [5], [6], [7], [8], [9]. Video texts in the sport video programs and news always indicate the happening of highlight events, to which the audiences pay more attention.
In general, there are two types of texts in video sequences, namely, the scene texts and the artificial texts. The scene texts came from cameras and are naturally embedded in scenes, such as the text in trademarks, signpost and so on [9]. Artificial texts are purposely added to video frames during video editing [10]. Thus, they are closely related to the content of the video. As addressed in Ref. [11], text-based video retrieval, indexing and abstracting are more reliable than audio- and image-based methods due to the fact that many existing commercial optical character recognition (OCR) systems are far more robust than the speech analysis techniques and visual object analysis systems [3], [4], [5], [7], [8]. The fundamental problem of semantic-based video indexing and retrieval turns to that of video text detection, localization and tracking.
The rest of this paper is organized as follows. In Section 2, brief review is given for the related work on text detection, localization and tracking. In Section 3, our video text detection, localization and tracking methods based on DCT texture are proposed. Experimental results and their discussions are given in Section 4. Conclusions are finally drawn in Section 5.
Section snippets
Related work on text detection, localization and tracking
Heretofore, text-detection methods can be classified into three categories. The first one consists of connected component-based methods, e.g. [12], [13], [14], [15], which assume that the text regions have uniform colors and satisfy certain size, shape, and spatial alignment constraints. However, these methods are not effective when the text have similar colors with background. The second one consists of the texture-based methods, e.g. [7], [10], [16], [17], which assume that the text regions
DCT coefficients based text detection, localization and tracking
The block diagram of the proposed text detection, localization, and tracking framework for compressed videos is shown in Fig. 1. It consists of four parts, namely, candidate text block detection, text region verification, text line localization, and text tracking, which are described in detail in the following parts of this section.
Experimental results
In order to evaluate the performance of the proposed text detection, localization and tracking methods, about 20 video sequences with different resolutions containing Chinese and English texts in different font sizes are used. Those videos are converted into MPEG-2 coded videos, with 12 frames in a group of pictures. Totally, about 100 min of video segment extracted from documentaries, i.e. Wild Australasia and Foxes of the Kalahari, commercial movies, news, and live sports videos are used to
Conclusion
This paper proposes a unified framework for video text detection, localization, and tracking in compressed videos based on block DCT coefficients. Seven DCT coefficients of an 8×8 block of an I-frame are selected to represent the texture intensity of the block approximately. They can capture horizontal, vertical and diagonal texture information, which can be used for video text detection. Candidate text block regions are verified by texture constraints. Accurate box of each text line can be
Acknowledgments
This work is partially supported by National Natural Science Foundation of China (NSFC; Project No. 60572045), Ministry of Education of China Doctorate Program (Project No. 20050698033) and Microsoft Research Asia.
References (42)
- et al.
An integrated system for content-based video retrieval and browsing
Pattern Recognit.
(1997) - et al.
Automatic text detection and removal in video sequences
Pattern Recognit. Lett.
(2003) - et al.
Fast and robust text detection in images and video frames
Image Vision Comput.
(2005) - et al.
Locating text in complex color images
Pattern Recognit.
(1995) - et al.
A localization/verification scheme for finding text in images and video frames based on contrast independent features and machine learning methods
Signal Process.: Image Commun.
(2004) Minimax estimation of a multivariate normal mean under arbitrary quadratic loss
J. Multivariate Anal.
(1976)- W. Qi, L. Gu, H. Jiang, X.-R. Chen, H.-J. Zhang, Integrating visual, audio and text analysis for news video, in:...
- F. Wang, Y. Ma, H. Zhang, J. Li, A generic framework for semantic sports video analysis using dynamic Bayesian...
- et al.
A hybrid approach to news video classification with multimodal features
Proc. Int. Conf. Inf. Commun. Signal Process.
(2003) - et al.
Multimedia event-based video indexing using time intervals
IEEE Trans. Multimedia
(2005)
Video OCR: indexing digital news libraries by recognition of superimposed captions
Multimedia Syst.
Automatic text detection and tracking in digital video
IEEE Trans. Image Process.
A comprehensive method for multilingual video text detection, localization, and extraction
IEEE Trans. Circuits Syst. Video Technol.
Automatic caption localization in compressed video
IEEE Trans. Pattern Anal. Mach. Intell.
Video text detection and segmentation for optical character recognition
Multimedia Syst.
Localizing and segmenting text in images and videos
IEEE Trans. Circuits Syst. Video Technol.
Cited by (59)
Non-stationary content-adaptive projector resolution enhancement
2021, Signal Processing: Image CommunicationCitation Excerpt :Although the methods proposed in [1,6,13] increased the perceived image resolution, they did not consider the point spread function of the projector-lens system. There are many further methods that have been introduced for projector resolution enhancement, some based on filtering in the frequency domain [14–18] and others in the spatial domain [19–25]. Our work will focus on the latter, spatial domain approach, building on the work of Ma et al. [24] that proposed a projector resolution enhancement method by introducing a band-limited Wiener deconvolution filter approximated in the spatial domain, which allowed for much faster filtering than in the frequency domain [26].
Application of daisy descriptor for language identification in the wild
2021, Multimedia Tools and ApplicationsLyric video analysis using text detection and tracking
2020, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)Extraction of character regions through machine learning and filtering
2019, International Journal of Recent Technology and Engineering