A novel mutual nearest neighbor based symmetry for text frame classification in video

doi:10.1016/j.patcog.2011.02.008

Pattern Recognition

Volume 44, Issue 8, August 2011, Pages 1671-1683

https://doi.org/10.1016/j.patcog.2011.02.008 Get rights and content

Abstract

In the field of multimedia retrieval in video, text frame classification is essential for text detection, event detection, event boundary detection, etc. We propose a new text frame classification method that introduces a combination of wavelet and median moment with k-means clustering to select probable text blocks among 16 equally sized blocks of a video frame. The same feature combination is used with a new Max–Min clustering at the pixel level to choose probable dominant text pixels in the selected probable text blocks. For the probable text pixels, a so-called mutual nearest neighbor based symmetry is explored with a four-quadrant formation centered at the centroid of the probable dominant text pixels to know whether a block is a true text block or not. If a frame produces at least one true text block then it is considered as a text frame otherwise it is a non-text frame. Experimental results on different text and non-text datasets including two public datasets and our own created data show that the proposed method gives promising results in terms of recall and precision at the block and frame levels. Further, we also show how existing text detection methods tend to misclassify non-text frames as text frames in term of recall and precision at both the block and frame levels.

Research Highlights

► A new wavelet–median moment feature to enhance gap between text and non-text pixel. ► Probable text blocks selection (PTBS) using k-means clustering among 16 blocks. ► Max–Min clustering to obtain dominant and high contrast pixels. ► A new mutual nearest neighbor concept (MMNS) to identify a true text block. ► The combination of PTBS and MNNS for achieving better results.

Introduction

Text frame classification aims to classify frames among a large collection of video frames into text and non-text frames. It is useful in applications such as video browsing, event detection, event boundary detection, text tracking, and text detection and extraction. Due to the semantic gap between low-level features and high-level events, it is difficult to come up with a generic Content-based Image Retrieval (CBIR) method or automatic annotation method to achieve a high accuracy of event detection [1]. In addition, the dynamic nature of events such as sports further complicates the analysis and impedes the implementation of such live event detection. In view of this difficulty, event detection is realized by detecting and recognizing the starting texts of the games or events involved. Therefore, to build a computationally efficient and accurate event detection system, accurate text frame classification is required before text detection and recognition [2]. However, no method exists in the literature that solely works on text frame classification.

While text frame classification invariably makes use of text detection techniques, it differs from the usual text detection methods in the following respects: (1) text frame classification is basically a screening process prior to text detection and recognition, (2) text frame classification should be simple and fast in order to quickly identify a frame as text or non-text, (3) text frame classification helps to reduce computational burden by avoiding expensive text detection methods on given unknown video frames many of which may turn out to just non-text, and (4) many existing text detection methods assume that the given input is a text frame and hence false positives may occur when a non-text frame is fed as input. In this paper, we propose a text frame classification method by dividing a video frame into small windows, which we call “blocks”, to look for probable text pixels among these blocks using a mutual nearest neighborhood symmetry concept. Any block that detects the presence of text is used as an indication that the frame under testing is a text frame. The rest of the paper is outlined as follows: In the next section, we will survey related works. We will present our proposed method in detail in Section 3, followed by a series of experiments in Section 4. Section 5 concludes this paper with discussions on future works.

Section snippets

Related work

The closest related work is that of Li et al. [3] for video text tracking. The system includes a component for text frame classification to find the first text frame in a video stream in order to start text tracking. The method of text frame classification is based on a supervised learning method using a neural network classifier. The method is thus dependent on the training set and requires considerable training time for the use of the neural network classifier. It serves also a different

The proposed method

The text detection methods surveyed in the preceding section cannot be used for text frame classification directly as video contains a large number of text and non-text frames. Besides, text detection methods generally work at the pixel level to locate text in video images and hence it is a time consuming process when non-text frames are fed as input for text detection. Thus there is a necessity for frame screening over a large number of video frames to know text frames before applying an

Experimental results

As there is no standard dataset for text frame classification available in the literature and this is the first attempt on frame classification, we create our own dataset which includes 1220 text frames and 800 non-text frames. We have done all the experimentations on a PC with P4 3 GHz processor with 1 GB RAM running Windows XP operating system. This dataset includes a variety of frames such as scene text, graphic text, various font sizes and scripts, various resolutions and background. We

Conclusion and future work

In this work, we have proposed a novel method for complex classification problem of text frames from large database containing key frames of both text and non-text. To the best of our knowledge, this is the first work that attempts to solve this text frame classification problem. The proposed Max–Min clustering approach helps in obtaining dominant and high contrast pixels to form text representatives for identifying text block. The main contribution of the work is in introducing the mutual

Acknowledgment

This work is done jointly by NUS and ISI, Kolkata, India. This research is also supported in part by the MDA grant R252-000-325-279 and A⁎STAR grant R252-000-402-305. Our special thanks to the anonymous reviewers for their constructive suggestions to improve the quality of the paper.

P. Shivakumara is a Research Fellow in the Department of Computer Science, School of Computing, National University of Singapore. He received B.Sc., M.Sc., M.Sc. Technology by research and Ph.D. degrees in computer science, respectively, in 1995, 1999, 2001 and 2005 from University of Mysore, Mysore, Karnataka, India. In addition to this, he has obtained educational degree (B.Ed.) in 1996 from Bangalore University, Bangalore, India.

From 1999 to 2005, he was Project Associate in the Department

References (35)

K. Jung et al.
Text information extraction in images and video: a survey
Pattern Recognition
(2004)
K. Jung
Neural network-based text location in color images
Pattern Recognition Lett.
(2001)
A.K. Jain et al.
Automatic text location in images and video frames
Pattern Recognition
(1998)
Q. Ye et al.
Fast and robust text detection in images and video frames
Image Vision Comput.
(2005)
P. Shivakumara et al.
Accurate video text detection through classification of low and high contrast images
Pattern Recognition
(2010)
F. Wang et al.
Structuring low quality videotaped lectures for cross-reference browsing by video text analysis
Pattern Recognition
(2008)
D. Chen et al.
A localization/verification scheme for finding text in images and video frames based on contrast independent features and machine learning
Signal Process.: Image Commun.
(2004)
E.K. Wong et al.
A new robust algorithm for video text extraction
Pattern Recognition
(2003)
X. Qian et al.
Text detection, localization and tracking in compressed video
Signal Process.: Image Commun.
(2007)
C. Xu, J. Wang, K. Wan, Y. Li, L. Duan, Live sports event detection based on broadcast video and web-casting text, in:...

D. Zhang, S. Chang, Event detection in baseball video using superimposed caption recognition, in: Proceedings of the...

H. Li et al.

Automatic text detection and tracking in digital video

IEEE Trans. Image Process.

(2000)

J. Zang, R. Kasturi, Extraction of text objects in video documents: recent progress, in: Proceedings of the DAS, 2008,...

X. Chen, A.L. Yuille, Detecting and reading text in natural scenes, in: Proceedings of the CVPR, 2004, pp....

X. Chen et al.

Automatic detection and recognition of signs from ,atural Scenes

IEEE Trans. Image Process.

(2004)

Y.F. Pan, X. Hou, C.L. Liu, A robust system to detect and localize texts in natural scene images, in: Proceedings of...

D. Doremann, J. Liang, H. Li, Progress in camera-based document image analysis, in: Proceedings of the ICDAR, 2003, pp....

Cited by (22)

A decisive content based image retrieval approach for feature fusion in visual and textual images
2019, Knowledge-Based Systems
Image content analysis plays a dynamic role in various computer vision applications. These contents can be either visual (i.e. color, shape, texture) or the textual (i.e. text appearing within images). Both the contents involve fundamental characteristics of an image and thus can be an enormous asset for any intelligent application. For content based image retrieval (CBIR) systems, most of the art methods are either annotated text based or the visual search based. Due to high demand of multitasking, there is a great need of a system that can combine visual as well as textual features. Consequently, this work proposes a decisive CBIR approach that combines visual and textual features to retrieve similar images. Firstly, the method classifies the query image as textual and non-textual. If any text appears within the image then the query image is classified as textual, and the text is detected and formed as Bag of Textual words. If the query image is classified as non-textual, the visual salient features are extracted and formed as Bag of Visual words. Next, the method fuses the visual and textual features, and top similar images are retrieved based on the fused feature vector. It supports three modes of retrieval: Image query, Keywords, and a combination of both. The experimental results on four datasets show the efficiency and accuracy of the proposed approach for visual and textual images.
Live detection of text in the natural environment using Convolutional Neural Network
2019, Future Generation Computer Systems
With the exponential growth in the quantity of born-digital images, the problem of comprehending text from natural scene images has acquired greater significance. This paper proposes a deep-learning based approach, to detect the presence of text in the natural scene images. The proposed approach is built with the capability to distinguish text and non-text images from the live-stream of the smart-phone camera, thereby eliminating the need for capturing the image, to locate the presence of text. A streamlined Convolutional Neural Network (CNN) MobileNet is harnessed for the process of distinguishing text images from non-text images. The proposed approach can be adopted as a filter, to decide whether to permit the image further down the processing pipeline for the text detection task, which in turn leads to the reduction in false-positives and false-negatives by not processing an image which does not have text. It was inferred from the experimental results that the width multiplier value of 0.75 and resolution multiplier of 224 yields the accuracy of 99.31% in classifying the text images from non-text images.
Text/non-text image classification in the wild with convolutional neural networks
2017, Pattern Recognition
Citation Excerpt :
We test our proposed method on ICDAR2003 to show that it works well on focused text images. In order to acquire intuitive and fair comparison results of the methods proposed in [10,11], we use the classification rate and the average processing time (APT) as the metrics. The results of different methods are listed in Table 3, which show that our method outperforms the video text frame classification methods [10,11].
Text in natural images is an important source of information, which can be utilized for many real-world applications. This work focuses on a new problem: distinguishing images that contain text from a large volume of natural images. To address this problem, we propose a novel convolutional neural network variant, called multi-scale spatial partition network (MSP-Net). The network classifies images that contain text or not, by predicting text existence in all image blocks, which are spatial partitions at multiple scales on an input image. The whole image is classified as a text image (an image containing text) as long as one of the blocks is predicted to contain text. The network classifies images very efficiently by predicting all blocks simultaneously in a single forward propagation. Through experimental evaluations and comparisons on public datasets, we demonstrate the effectiveness and robustness of the proposed method.
Natural neighbor: A self-adaptive neighborhood method without parameter K
2016, Pattern Recognition Letters
Citation Excerpt :
Additionally, analogous to KNN and RkNN, the mutual k-nearest neighbor (MkNN) capture the inter-connectivity of adjacent regions. Brito et al. firstly use the connectivity properties of mutual nearest neighborhood graphs [15], and recently it is effectively used in classification [16,17] and clustering [18]. MkNN method reduces the computational complexity for large data sets.
K-nearest neighbor (KNN) and reverse k-nearest neighbor (RkNN) are two bases of many well-established and high-performance pattern-recognition techniques, but both of them are vulnerable to their parameter choice. Essentially, the challenge is to detect the neighborhood of various data sets, while utterly ignorant of the data characteristic. In this paper, a novel concept in terms of nearest neighbor is proposed and named natural neighbor (NaN). In contrast to KNN and RkNN, it is a scale-free neighbor, and it can reflect a better data characteristics. This article discusses the theoretical model and applications of natural neighbor in a different field, and we demonstrate the improvement of the proposed neighborhood on both synthetic and real-world data sets.
Piece-wise linearity based method for text frame classification in video
2015, Pattern Recognition
Citation Excerpt :
All the experiments are conducted on a PC with a Core i5 2.60 GHz Processor having 4 GB RAM running on the Windows 7 operating system. We have used Recall (R), Precision (P), F-measure (F), False Positive rate (FP) and Average Time Processing (AVT) as measures to show that text detection methods may not be suitable for text frame classification by testing on both text and non-text frames and have followed the instructions stated in [21]. More details about the three measures can be found in [21].
The aim of text frame classification technique is to label a video frame as text or non-text before text detection and recognition. It is an essential step prior to text detection because text detection methods assume the input to be a text frame. Consequently, when a non-text frame is subjected to text detection, the precision of the text detection method decreases because of false positives. In this paper a new text frame classification approach based on component linearity is proposed. The method firstly obtains probable text clusters from the gradient values of the RGB images of an input video frame. The Sobel edges corresponding to the text cluster are then extracted and used for further processing. Next, the method proposes to eliminate false text components before undertaking a linearity check where the linearity of the text components is determined using their centroids in a piece-wise manner. If the components in a frame satisfy the defined linearity condition, then the frame is considered as a text frame; otherwise it is considered as a non-text frame. The proposed method has been tested on standard text and non-text datasets of different orientations to demonstrate that it is independent of orientation. A comparative study with the existing method shows that the proposed method is superior in terms of classification rate and processing time.
New Gradient-Spatial-Structural Features for video script identification
2015, Computer Vision and Image Understanding
Citation Excerpt :
This method explores wavelet-moments features with mutual nearest neighbor clustering to identify the blocks. According to the results reported in [56], the method gives a good precision for text block classification. We are inspired by the work presented in [57–59] for text recognition from natural scene images, where the authors propose a multiple hypotheses framework for text detection and recognition without segmenting text lines, words and characters.
Multi-script identification helps in automatically selecting an appropriate OCR when video has several scripts; however, script identification in video frames is challenging because low resolution and complex background of video often cause disconnections or the loss of text information. This paper presents a novel idea that integrates the Gradient-Spatial-Features (GSpF) and the Gradient-Structural-Features (GStF) at block level based on an error factor and the weights of the features to identify six video scripts, namely, Arabic, Chinese, English, Japanese, Korean and Tamil. Horizontal and vertical gradient values are first computed for each text block to increase the contrast of text pixels. Then the method divides the horizontal and the vertical gradient blocks into two equal parts at the centroid in the horizontal direction. Histogram operation on each part is performed to select dominant text pixels from respective subparts of the horizontal and the vertical gradient blocks, which results in text components. After extracting GSpF and GStF from the text components, we finally propose to integrate the spatial and the structural features based on end points, intersection points, junction points and straightness of the skeleton of text components in a novel way to identify the scripts. The method is evaluated on 970 video frames of six scripts which involves font, font size or contrast variations, and is compared with an existing method in terms of classification rate. Experimental results show that the proposed method achieves 83.0% average classification rate for video script identification. The method is also evaluated by testing on noisy images and scanned low resolution documents, illustrating the robustness and the extensibility of the proposed Gradient-Spatial-Structural Features.

View all citing articles on Scopus

From 1999 to 2005, he was Project Associate in the Department of Studies in Computer Science, University of Mysore, where he conducted research on document image analysis, including document image mosaicing, character recognition, skew detection, face detection and face recognition. He worked as a Research Fellow in the field of image processing and multimedia in the Department of Computer Science, School of Computing, National University of Singapore, from 2005 to 2007. He also worked as a Research Consultant in Nanyang Technological University, Singapore for a period of 6 months on image classification in 2007. He has published around 90 research papers in national, international conferences and journals. He has been reviewer for several conferences and journals.

His research interests are in the area of image processing, pattern recognition, including text extraction from video, document image processing, biometric applications and automatic writer identification.

Anjan Dutta received the B.Sc. degree in Mathematics from the University of Calcutta, Kolkata, India in the year 2006 and MCA degree in Computer Applications from the West Bengal University of Technology, Kolkata, India in the year 2009. Currently he is doing his Master degree in Computer Vision and Artificial Intelligence from the Universitat Autònoma de Barcelona, Barcelona, Spain and also at the same time he is working as a Ph.D. student in the Computer Vision Centre, Barcelona, Spain under the supervision of Dr. Josep Lladós and Dr. Umapada Pal. His main research interests include Graphics Recognition, Structural Pattern Recognition using graph matching technique.

Trung Quy Phan is pursuing the graduate degree from the Department of Computer Science, School of Computing, National University of Singapore, Singapore.

He is currently a Research Assistant with the School of Computing, National University of Singapore, Singapore. His current research interests include image and video analysis.

Chew Lim Tan is a Professor in the Department of Computer Science, School of Computing, National University of Singapore. He received his B.Sc. (Hons.) degree in physics in 1971 from University of Singapore, his M.Sc. degree in radiation studies in 1973 from University of Surrey, UK, and his Ph.D. degree in computer science in 1986 from University of Virginia, U.S.A. His research interests include document image analysis, text and natural language processing, neural networks and genetic programming. He has published more than 300 research publications in these areas. He is an associate editor of Pattern Recognition, associate editor of Pattern Recognition Letters, an editorial member of the International Journal on Document Analysis and Recognition. He is a member of the Governing Board of the International Association of Pattern Recognition (IAPR). He is also a senior member of IEEE.

Umapada Pal received his Ph.D. from Indian Statistical Institute and his Ph.D. work was on the development of Printed Bangla OCR system. He did his Post Doctoral research on the segmentation of touching English numerals at Institut National de Recherche en Informatique et en Automatique (INRIA), France. During July 1997–January 1998 he visited GSF-Forschungszentrum fur Umwelt und Gesundheit GmbH, Germany to work as a guest scientist in a project on image analysis. From January 1997, he is a Faculty member of the Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata. His primary research is Digital Document Processing. He has published 160 research papers in various international journals, conference proceedings and edited volumes. In 1995, he received student best paper award from Chennai Chapter of Computer Society of India. He received a merit certificate from Indian Science Congress Association in 1996. Because of his significant impact in the Document Analysis research domain of Indian language, TC-10 and TC-11 committees of International Association for Pattern Recognition (IAPR) presented ‘ICDAR Outstanding Young Researcher Award’ to Dr. Pal in 2003. In 2005–2006 Dr. Pal has received JSPS fellowship from Japan government. Dr. Pal has been serving as a program committee member of many conferences including International Conference on Document Analysis and Recognition (ICDAR), International Workshop on Document Image Analysis for Libraries (DIAL), International Workshop on Frontiers of Handwritten Recognition (IWFHR), International Conference on Pattern recognition (ICPR), etc. Also, he is the Asian PC-Chair for 10th ICDAR to be held at Barcelona, Spain in 2009. He has served as the guest editor of special issue of VIVEK journal on Document image analysis of Indian scripts. Also currently he is co-editing a Special issue of the journal of Electronic Letters on Computer Vision and Image Analysis. He is a life member of Indian unit of IAPR (IUPRAI) and senior life member of Computer Society of India.

View full text

A novel mutual nearest neighbor based symmetry for text frame classification in video

Abstract

Research Highlights

Introduction

Section snippets

Related work

The proposed method

Experimental results

Conclusion and future work

Acknowledgment

Pattern Recognition

Pattern Recognition Lett.

Pattern Recognition

Image Vision Comput.

Pattern Recognition

Pattern Recognition

Signal Process.: Image Commun.

Pattern Recognition

Signal Process.: Image Commun.

Automatic text detection and tracking in digital video

IEEE Trans. Image Process.

Automatic detection and recognition of signs from ,atural Scenes

IEEE Trans. Image Process.