Text detection, localization, and tracking in compressed video

https://doi.org/10.1016/j.image.2007.06.005Get rights and content

Abstract

Video text information plays an important role in semantic-based video analysis, indexing and retrieval. Video texts are closely related to the content of a video. Usually, the fundamental steps of text-based video analysis, browsing and retrieval consist of video text detection, localization, tracking, segmentation and recognition. Video sequences are commonly stored in compressed formats where MPEG coding techniques are often adopted. In this paper, a unified framework for text detection, localization, and tracking in compressed videos using the discrete cosines transform (DCT) coefficients is proposed. A coarse to fine text detection method is used to find text blocks in terms of the block DCT texture intensity information. The DCT texture intensity of an 8×8 block of an intra-frame is approximately represented by seven AC coefficients. The candidate text block regions are further verified and refined. The text block region localization and tracking are carried out by virtue of the horizontal and vertical block texture intensity projection profiles. The appearing and disappearing frames of each text line are determined by the text tracking. The final experimental results show the effectiveness of the proposed methods.

Introduction

Video sequences are usually integrated with audio, image, graph, text and so on. At present, video sequences become an indispensable part of people's daily life. With the rapid improvements in video compression technology, the expansion of low-cost storage media and the information explosion over the Internet, digital video libraries come into reality and pervade in the near future. Accordingly, the fast video accessing and browsing is required in many areas such as video conferencing, remote video based education, video-on-demand systems and so on [1], [2]. Video texts provide more intuitive information for audiences to grasp the meaning of video content. The clues provided by video texts have been widely used in semantic based video analysis, indexing and retrieval [3], [4], [5], [6], [7], [8], [9]. Video texts in the sport video programs and news always indicate the happening of highlight events, to which the audiences pay more attention.

In general, there are two types of texts in video sequences, namely, the scene texts and the artificial texts. The scene texts came from cameras and are naturally embedded in scenes, such as the text in trademarks, signpost and so on [9]. Artificial texts are purposely added to video frames during video editing [10]. Thus, they are closely related to the content of the video. As addressed in Ref. [11], text-based video retrieval, indexing and abstracting are more reliable than audio- and image-based methods due to the fact that many existing commercial optical character recognition (OCR) systems are far more robust than the speech analysis techniques and visual object analysis systems [3], [4], [5], [7], [8]. The fundamental problem of semantic-based video indexing and retrieval turns to that of video text detection, localization and tracking.

The rest of this paper is organized as follows. In Section 2, brief review is given for the related work on text detection, localization and tracking. In Section 3, our video text detection, localization and tracking methods based on DCT texture are proposed. Experimental results and their discussions are given in Section 4. Conclusions are finally drawn in Section 5.

Section snippets

Related work on text detection, localization and tracking

Heretofore, text-detection methods can be classified into three categories. The first one consists of connected component-based methods, e.g. [12], [13], [14], [15], which assume that the text regions have uniform colors and satisfy certain size, shape, and spatial alignment constraints. However, these methods are not effective when the text have similar colors with background. The second one consists of the texture-based methods, e.g. [7], [10], [16], [17], which assume that the text regions

DCT coefficients based text detection, localization and tracking

The block diagram of the proposed text detection, localization, and tracking framework for compressed videos is shown in Fig. 1. It consists of four parts, namely, candidate text block detection, text region verification, text line localization, and text tracking, which are described in detail in the following parts of this section.

Experimental results

In order to evaluate the performance of the proposed text detection, localization and tracking methods, about 20 video sequences with different resolutions containing Chinese and English texts in different font sizes are used. Those videos are converted into MPEG-2 coded videos, with 12 frames in a group of pictures. Totally, about 100 min of video segment extracted from documentaries, i.e. Wild Australasia and Foxes of the Kalahari, commercial movies, news, and live sports videos are used to

Conclusion

This paper proposes a unified framework for video text detection, localization, and tracking in compressed videos based on block DCT coefficients. Seven DCT coefficients of an 8×8 block of an I-frame are selected to represent the texture intensity of the block approximately. They can capture horizontal, vertical and diagonal texture information, which can be used for video text detection. Candidate text block regions are verified by texture constraints. Accurate box of each text line can be

Acknowledgments

This work is partially supported by National Natural Science Foundation of China (NSFC; Project No. 60572045), Ministry of Education of China Doctorate Program (Project No. 20050698033) and Microsoft Research Asia.

References (42)

  • T. Sato et al.

    Video OCR: indexing digital news libraries by recognition of superimposed captions

    Multimedia Syst.

    (1999)
  • Y.-K. Lim, S.-H. Choi, S.-W. Lee, Text extraction in MPEG compressed video for content-based indexing, in: Proceedings...
  • Y. Cui, Q. Huang, Character extraction of license plates from video, in: Proceedings of International Conference on...
  • H. Li et al.

    Automatic text detection and tracking in digital video

    IEEE Trans. Image Process.

    (2000)
  • M.R. Lyu et al.

    A comprehensive method for multilingual video text detection, localization, and extraction

    IEEE Trans. Circuits Syst. Video Technol.

    (2005)
  • A.K. Jain, B. Yu, Automatic text location in images and video frames, in: Proceedings of International Conference...
  • V.Y. Mariano, R. Kasturi, Locating uniform-colored text in video frames, in: Proceedings of International Conference on...
  • Y. Zhong et al.

    Automatic caption localization in compressed video

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2000)
  • X. Qian, G. Liu, Text detection, localization and segmentation in compressed videos, in: Proceedings of International...
  • C.-W. Ngo et al.

    Video text detection and segmentation for optical character recognition

    Multimedia Syst.

    (2005)
  • R. Lienhart et al.

    Localizing and segmenting text in images and videos

    IEEE Trans. Circuits Syst. Video Technol.

    (2002)
  • Cited by (59)

    • Non-stationary content-adaptive projector resolution enhancement

      2021, Signal Processing: Image Communication
      Citation Excerpt :

      Although the methods proposed in [1,6,13] increased the perceived image resolution, they did not consider the point spread function of the projector-lens system. There are many further methods that have been introduced for projector resolution enhancement, some based on filtering in the frequency domain [14–18] and others in the spatial domain [19–25]. Our work will focus on the latter, spatial domain approach, building on the work of Ma et al. [24] that proposed a projector resolution enhancement method by introducing a band-limited Wiener deconvolution filter approximated in the spatial domain, which allowed for much faster filtering than in the frequency domain [26].

    • Lyric video analysis using text detection and tracking

      2020, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    • Extraction of character regions through machine learning and filtering

      2019, International Journal of Recent Technology and Engineering
    View all citing articles on Scopus
    View full text