Text detection, localization, and tracking in compressed video

doi:10.1016/j.image.2007.06.005

Signal Processing: Image Communication

Volume 22, Issue 9, October 2007, Pages 752-768

https://doi.org/10.1016/j.image.2007.06.005 Get rights and content

Abstract

Video text information plays an important role in semantic-based video analysis, indexing and retrieval. Video texts are closely related to the content of a video. Usually, the fundamental steps of text-based video analysis, browsing and retrieval consist of video text detection, localization, tracking, segmentation and recognition. Video sequences are commonly stored in compressed formats where MPEG coding techniques are often adopted. In this paper, a unified framework for text detection, localization, and tracking in compressed videos using the discrete cosines transform (DCT) coefficients is proposed. A coarse to fine text detection method is used to find text blocks in terms of the block DCT texture intensity information. The DCT texture intensity of an 8×8 block of an intra-frame is approximately represented by seven AC coefficients. The candidate text block regions are further verified and refined. The text block region localization and tracking are carried out by virtue of the horizontal and vertical block texture intensity projection profiles. The appearing and disappearing frames of each text line are determined by the text tracking. The final experimental results show the effectiveness of the proposed methods.

Introduction

Video sequences are usually integrated with audio, image, graph, text and so on. At present, video sequences become an indispensable part of people's daily life. With the rapid improvements in video compression technology, the expansion of low-cost storage media and the information explosion over the Internet, digital video libraries come into reality and pervade in the near future. Accordingly, the fast video accessing and browsing is required in many areas such as video conferencing, remote video based education, video-on-demand systems and so on [1], [2]. Video texts provide more intuitive information for audiences to grasp the meaning of video content. The clues provided by video texts have been widely used in semantic based video analysis, indexing and retrieval [3], [4], [5], [6], [7], [8], [9]. Video texts in the sport video programs and news always indicate the happening of highlight events, to which the audiences pay more attention.

In general, there are two types of texts in video sequences, namely, the scene texts and the artificial texts. The scene texts came from cameras and are naturally embedded in scenes, such as the text in trademarks, signpost and so on [9]. Artificial texts are purposely added to video frames during video editing [10]. Thus, they are closely related to the content of the video. As addressed in Ref. [11], text-based video retrieval, indexing and abstracting are more reliable than audio- and image-based methods due to the fact that many existing commercial optical character recognition (OCR) systems are far more robust than the speech analysis techniques and visual object analysis systems [3], [4], [5], [7], [8]. The fundamental problem of semantic-based video indexing and retrieval turns to that of video text detection, localization and tracking.

The rest of this paper is organized as follows. In Section 2, brief review is given for the related work on text detection, localization and tracking. In Section 3, our video text detection, localization and tracking methods based on DCT texture are proposed. Experimental results and their discussions are given in Section 4. Conclusions are finally drawn in Section 5.

Section snippets

Related work on text detection, localization and tracking

Heretofore, text-detection methods can be classified into three categories. The first one consists of connected component-based methods, e.g. [12], [13], [14], [15], which assume that the text regions have uniform colors and satisfy certain size, shape, and spatial alignment constraints. However, these methods are not effective when the text have similar colors with background. The second one consists of the texture-based methods, e.g. [7], [10], [16], [17], which assume that the text regions

DCT coefficients based text detection, localization and tracking

The block diagram of the proposed text detection, localization, and tracking framework for compressed videos is shown in Fig. 1. It consists of four parts, namely, candidate text block detection, text region verification, text line localization, and text tracking, which are described in detail in the following parts of this section.

Experimental results

In order to evaluate the performance of the proposed text detection, localization and tracking methods, about 20 video sequences with different resolutions containing Chinese and English texts in different font sizes are used. Those videos are converted into MPEG-2 coded videos, with 12 frames in a group of pictures. Totally, about 100 min of video segment extracted from documentaries, i.e. Wild Australasia and Foxes of the Kalahari, commercial movies, news, and live sports videos are used to

Conclusion

This paper proposes a unified framework for video text detection, localization, and tracking in compressed videos based on block DCT coefficients. Seven DCT coefficients of an 8×8 block of an I-frame are selected to represent the texture intensity of the block approximately. They can capture horizontal, vertical and diagonal texture information, which can be used for video text detection. Candidate text block regions are verified by texture constraints. Accurate box of each text line can be

Acknowledgments

This work is partially supported by National Natural Science Foundation of China (NSFC; Project No. 60572045), Ministry of Education of China Doctorate Program (Project No. 20050698033) and Microsoft Research Asia.

References (42)

H.J. Zhang et al.
An integrated system for content-based video retrieval and browsing
Pattern Recognit.
(1997)
C.W. Lee et al.
Automatic text detection and removal in video sequences
Pattern Recognit. Lett.
(2003)
Q. Ye et al.
Fast and robust text detection in images and video frames
Image Vision Comput.
(2005)
Y. Zhong et al.
Locating text in complex color images
Pattern Recognit.
(1995)
D. Chen et al.
A localization/verification scheme for finding text in images and video frames based on contrast independent features and machine learning methods
Signal Process.: Image Commun.
(2004)
J.O. Berger
Minimax estimation of a multivariate normal mean under arbitrary quadratic loss
J. Multivariate Anal.
(1976)
W. Qi, L. Gu, H. Jiang, X.-R. Chen, H.-J. Zhang, Integrating visual, audio and text analysis for news video, in:...
F. Wang, Y. Ma, H. Zhang, J. Li, A generic framework for semantic sports video analysis using dynamic Bayesian...
P. Wang et al.
A hybrid approach to news video classification with multimodal features
Proc. Int. Conf. Inf. Commun. Signal Process.
(2003)
C.G.M. Snoek et al.
Multimedia event-based video indexing using time intervals
IEEE Trans. Multimedia
(2005)

T. Sato et al.

Video OCR: indexing digital news libraries by recognition of superimposed captions

Multimedia Syst.

(1999)

Y.-K. Lim, S.-H. Choi, S.-W. Lee, Text extraction in MPEG compressed video for content-based indexing, in: Proceedings...

Y. Cui, Q. Huang, Character extraction of license plates from video, in: Proceedings of International Conference on...

H. Li et al.

Automatic text detection and tracking in digital video

IEEE Trans. Image Process.

(2000)

M.R. Lyu et al.

A comprehensive method for multilingual video text detection, localization, and extraction

IEEE Trans. Circuits Syst. Video Technol.

(2005)

A.K. Jain, B. Yu, Automatic text location in images and video frames, in: Proceedings of International Conference...

V.Y. Mariano, R. Kasturi, Locating uniform-colored text in video frames, in: Proceedings of International Conference on...

Y. Zhong et al.

Automatic caption localization in compressed video

IEEE Trans. Pattern Anal. Mach. Intell.

(2000)

X. Qian, G. Liu, Text detection, localization and segmentation in compressed videos, in: Proceedings of International...

C.-W. Ngo et al.

Video text detection and segmentation for optical character recognition

Multimedia Syst.

(2005)

R. Lienhart et al.

Localizing and segmenting text in images and videos

IEEE Trans. Circuits Syst. Video Technol.

(2002)

Cited by (59)

Non-stationary content-adaptive projector resolution enhancement
2021, Signal Processing: Image Communication
Citation Excerpt :
Although the methods proposed in [1,6,13] increased the perceived image resolution, they did not consider the point spread function of the projector-lens system. There are many further methods that have been introduced for projector resolution enhancement, some based on filtering in the frequency domain [14–18] and others in the spatial domain [19–25]. Our work will focus on the latter, spatial domain approach, building on the work of Ma et al. [24] that proposed a projector resolution enhancement method by introducing a band-limited Wiener deconvolution filter approximated in the spatial domain, which allowed for much faster filtering than in the frequency domain [26].
Digital imagery and video can have content at higher resolutions than can be projected by most data projectors, which has led to a variety of techniques to improve the high-resolution perception from lower-resolution displays. However, the downsampling procedures frequently used to fit an original image or video of high-resolution into a lower-resolution projector cause a frustrating loss of fine structures in the projected imagery. Since the human visual system is more sensitive to certain image phenomena, such as text and edges, an optimal approach to preserving fine structures should further sharpen such displayed content. On the other hand, the human visual system is also very sensitive to aliasing effects in motion, such that over-sharpening can lead to significant motion artifacts.
In this paper, a new non-stationary content-adaptive resolution enhancement scheme is proposed. Our main objective in this study is to reduce the severity of artifacts due to the image enhancement processes. To achieve this goal, distribution-based text detection and hypothesis-testing-based motion detection methods are developed. Three spatial kernels, each constructed using a new band-limited Wiener deconvolution filter, are used to enhance a given image with different sharpening strengths, where the differently enhanced images are combined using a weighted non-stationary filter. For evaluation, a new visual projection assessment (VPA) dataset along with new metrics for quantifying motion artifacts are introduced. Experimental results show that the proposed non-stationary content-adaptive resolution enhancement scheme offers improved visual quality over the state-of-the-art while offering a reasonable balance between high text sharpness and reduced motion artifacts.
Fonts That Fit the Music: A Multimodal Design Trend Analysis of Lyric Videos
2022, IEEE Access
Application of daisy descriptor for language identification in the wild
2021, Multimedia Tools and Applications
Lyric Video Analysis Using Text Detection and Tracking
2020, arXiv
Lyric video analysis using text detection and tracking
2020, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Extraction of character regions through machine learning and filtering
2019, International Journal of Recent Technology and Engineering

View all citing articles on Scopus

View full text

Text detection, localization, and tracking in compressed video

Abstract

Introduction

Section snippets

Related work on text detection, localization and tracking

DCT coefficients based text detection, localization and tracking

Experimental results

Conclusion

Acknowledgments

Pattern Recognit.

Pattern Recognit. Lett.

Image Vision Comput.

Pattern Recognit.

Signal Process.: Image Commun.

J. Multivariate Anal.

A hybrid approach to news video classification with multimodal features

Proc. Int. Conf. Inf. Commun. Signal Process.

Multimedia event-based video indexing using time intervals

IEEE Trans. Multimedia

Video OCR: indexing digital news libraries by recognition of superimposed captions

Multimedia Syst.

Automatic text detection and tracking in digital video

IEEE Trans. Image Process.

A comprehensive method for multilingual video text detection, localization, and extraction

IEEE Trans. Circuits Syst. Video Technol.

Automatic caption localization in compressed video

IEEE Trans. Pattern Anal. Mach. Intell.

Video text detection and segmentation for optical character recognition

Multimedia Syst.

Localizing and segmenting text in images and videos

IEEE Trans. Circuits Syst. Video Technol.