A robust hybrid method for text detection in natural scenes by learning-based partial differential equations
Introduction
Text detection and recognition in natural scene images have received more and more attention in recent years [1], [2], [3], [4]. This is because text often provides critical information for understanding the high-level semantics of multimedia content, such as street view data [5], [6]. Moreover, the demand of a growing number of applications on mobile devices has brought great interest in this problem. Text detection from natural scene images is very challenging due to the complexity of background, uneven lighting, blurring, degradation, distortion, and the diversity of text patterns.
There have been a lot of methods for scene text detection, which can be roughly divided into three categories: sliding window based methods [7], [8], [9], connected component (CC) based methods [5], [10], [11], and hybrid methods [12]. Sliding window based methods search for possible texts in multi-scale windows in an image and then classify them into positives using a lot of texture features. However, they are often computationally expensive when a large number of windows, with various sizes, need to be checked and complex classification methods are used. CC-based methods firstly extract character candidates as connected components using some low-level features, e.g., color similarity and spatial layout. Then the character candidates are grouped into words after eliminating the wrong ones by connected components analysis (CCA). The hybrid method [12] creates a text region detector to estimate the probabilities of text position at different scales and extract character candidates (connected components) by local binarization. The CC-based and the hybrid methods are more popular than the sliding window based ones because they can achieve a high precision once the candidate characters are correctly detected and grouped. However, such a condition is not often met: the low-level operations are usually unreliable and sensitive to noise, which makes it difficult to extract the right character candidates. The large number of wrong character candidates can cause many difficulties in the post-processing, such as grouping and classification.
Recently, Liu et al. [13] have proposed a framework that learns partial differential equations (PDEs) from training image pairs, which has been successfully applied to several computer vision and image processing problems. It can handle some mid-and-high-level tasks that the traditional PDE-based methods cannot. In [13] they apply learning-based PDEs to object detection, color2gray, and demosaicking. In [14], they use an adaptive (learning-based) PDEs system for saliency detection. However, these methods [13], [14] may not handle text detection well. This is because text is a man-made object and its interpretation strongly depends on human perception. The complexity of backgrounds, flexible text styles, and variation of text contents make text detection more challenging than previous tasks.
In most cases, texts cover only a small area of the scene image (Fig. 1). So it will be beneficial for post-processing, like character candidate extraction, if we could narrow down the candidates of text region and eliminate annoying backgrounds. Although we cannot expect that the learned PDEs can produce an exactly binary text mask map, because the solution of the learned PDEs should be a more or less smooth function, learning-based PDEs can give us a relatively high quality confidence map as a good reference. So we use learning-based PDEs to design a text region detector. It is much faster than sliding window based methods because its complexity is only O(N), where N is the number of pixels in the image. Some examples containing the detected region candidates are shown in Fig. 1. To make our method complete and comparable to others, we further propose a simple method for detecting texts in the region candidates.
In summary, we propose a new robust hybrid method using learning-based PDEs. It incorporates both top-down scheme and bottom-up scheme to extract texts in natural scene images. Fig. 2 shows the flow chart of our system and Fig. 3 gives an example. A PDEs system is first learnt off-line with L1-norm regularization on the training images. In the top-down scheme, given a test image as the initial condition we solve the learnt PDEs to produce a high quality text confidence map (see Fig. 3(b)). Then we apply a local binarization algorithm (Niblack [15]) to the confidence map to extract text region candidates (see Fig. 3(c)). In the bottom-up scheme, we present a simple connected component based method and apply it to each region candidate to determine accurate text locations. We firstly perform mean shift algorithm and binarization (OTSU [16]) to extract character candidates (see Fig. 3(d)). Then we group these components to text lines simply based on their color and size (see Fig. 3(e)). Next we adopt a two-level classification scheme (character candidates classifier and text candidates classifier) to eliminate the non-text candidates. Then we obtain the final result (Fig. 3(f)). Our system is evaluated on several benchmark databases and has achieved higher F-measures than other methods. Note that the parameters and classifiers are only trained on the ICDAR 2011 database [17] as only this database provides the required training information. But the proposed approach still yields higher precisions and recalls on the SVT databases [5], [6] than other state-of-the-art methods. We summarize the contributions of this paper as follows:
- •
We propose a new hybrid method for text detection from natural scene images. Unlike previous methods, our method consists of loosely coupled top-down and bottom-up schemes, where the latter part can be replaced by any connect component based methods.
- •
We apply learning-based PDEs for computing a high quality text confidence map, upon which good text region candidates can be easily chosen. Unlike sliding window based methods, the complexity of learning-based PDEs for text candidate proposal is only O(N), where N is the number of pixels. So our learning-based PDEs are much faster. To our best knowledge, this is the first work that applies PDEs to text detection.
- •
We conduct extensive experiments on benchmark databases to prove the superiority of our method over the state-of-the-art ones in detection accuracy. Note that unlike previous approaches, after computing the text confidence map, all the procedures are very simple. The performance could be further improved if more sophisticated and ad hoc treatments are involved.
The rest of this paper is organized as follows. Section 2 briefly reviews the related work. Section 3 describes the top-down scheme and Section 4 describes the bottom-up scheme. We discuss the relationship between our method and some related work in Section 5. Section 6 presents experiments that compare the proposed method and the state-of-the-art ones on several public databases. We conclude our paper in Section 7.
Section snippets
Related work
Most sliding window based methods first search for possible texts in multi-scale windows and then estimate the text existence probability by using classifiers. Zhong et al. [1] adopt image transformations, such as discrete cosine transform and wavelet decomposition, to extract features. They remove the non-text regions by thresholding the filter responses. Kim et al. [7] extract texture features from all local windows in every layer of image pyramid, which enables the method to detect texts at
Top-down: text confidence map computation and text region candidates extraction
As mentioned before, the proposed method has both the top-down scheme and the bottom-up scheme. We introduce the top-down scheme in this section and leave the bottom-up scheme in next section. We first introduce learning-based PDEs and then apply them to compute the text confidence map. Then we adopt a local thresholding method for text region candidates extraction.
Bottom-up: text detection in the region candidates
Although using learning-based PDEs for text confidence map computation is our major contribution, to make our method complete and comparable to others, in this section we propose a simple framework to extract text strings in the detected text region candidates. It can be replaced by any CC-based methods and the performance can be further improved if more advanced CC-based methods are employed. The framework is the bottom-up scheme and is mainly inspired by [21], [26]. It consists of three main
Discussions
In this section, we discuss the connection between our methods and some related work.
Experimental results
In this section, we compare our method with several state-of-the-art methods on a variety of public databases, including the ICDAR 2005 [44], [45], ICDAR 2011 [17], and the two street view text databases, i.e., the SVT 2010 database [5] and the SVT 2011 database [6]. The performance of these methods is quantitatively measured by precision (P), recall (R), and F-measure (F). They are computed using the same definitions as those in [44], [45] at the image level. The overall performance values are
Conclusions
In this paper, we propose a novel hybrid approach for text detection in natural scene images. We apply learning-based PDEs to provide text region candidates and devise a simple connected components based method to locate the texts accurately in each text region candidate. Experiment results show the robustness and superiority of our method when compared to many state-of-the-art approaches. In the future, we plan to develop even better operators via learning-based PDEs for robust text region
Acknowledgment
Zhenyu Zhao is supported by National Natural Science Foundation of China (NSFC) (Grant no. 61473302). Zhouchen Lin is supported by National Basic Research Program of China (973 Program) (Grant no. 2015CB352502), NSFC (Grant nos. 61272341 and 61231002), and Microsoft Research Asia Collaborative Research Program.
Zhenyu Zhao received the B.S. degree in mathematics from University of Science and Technology in 2009, and the M.S. degree in system science from National University of Defense and Technology in 2011. He is currently pursuing the Ph.D. degree in applied mathematics, National University of Defense and Technology. His research interests include computer vision, pattern recognition and machine learning.
References (54)
- et al.
Text extraction from natural scene imagea survey
Neurocomputing
(2013) - et al.
Color text extraction with selective metric-based clustering
Comput. Vis. Image Underst.
(2007) - et al.
Toward designing intelligent PDEs for computer vision: an optimal control approach
Image Vis. Comput.
(2013) - et al.
Text analysis using local energy
Pattern Recognit.
(2001) - et al.
Text extraction from scene images by character appearance and structure modeling
Comput. Vis. Image Underst.
(2013) - et al.
Scene text detection using graph model built upon maximally stable extremal regions
Pattern Recognit. Lett.
(2013) - et al.
Automatic caption localization in compressed video
IEEE Trans. Pattern Anal. Mach. Intell.
(2000) - et al.
Scene text recognition using similarity and a lexicon with sparse belief propagation
IEEE Trans. Pattern Anal. Mach. Intell.
(2009) - et al.
Text detection and recognition in imagery: a survey
IEEE Trans. Pattern Anal. Mach. Intell.
(2015) - Boris Epshtein, Eyal Ofek, Yonatan Wexler, Detecting text in natural scenes with stroke width transform, in: CVPR,...
Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm
IEEE Trans. Pattern Anal. Mach. Intell.
Text string detection from natural scenes by structure-based partition and grouping
IEEE Trans. Image Process.
A hybrid approach to detect and localize texts in natural scene images
IEEE Trans. Image Process.
An Introduction to Digital Image Processing
A threshold selection method from gray-level histograms
Automatica
Automatic text detection and tracking in digital video
IEEE Trans. Image Process.
Text string detection from natural scenes by structure-based partition and grouping
IEEE Trans. Image Process.
A Laplacian approach to multi-oriented text detection in video
IEEE Trans. Pattern Anal. Mach. Intell.
Robust text detection in natural scene images
IEEE Trans. Pattern Anal. Mach. Intell.
Cited by (19)
A multi-rank two-dimensional CCA based on PDEs for multi-view feature extraction
2024, Expert Systems with ApplicationsA Novel Variational Model for Detail-Preserving Low-Illumination Image Enhancement
2022, Signal ProcessingCitation Excerpt :However, in the process of equalizing the histogram, the gray levels with fewer pixels may be merged into new gray levels, which will cause the image to be locally over-enhanced [2]. Partial-differential-equations-based methods [7] have the advantages of simplicity and flexibility, and are easy to be implemented. Tone mapping [8] is developed based on contrast enhancement techniques, which constructs a probability-based pixel mapping function to light up low-illumination images.
TEDLESS – Text detection using least-square SVM from natural scene
2020, Journal of King Saud University - Computer and Information SciencesCitation Excerpt :Char 74K contains 62 sub-dataset, containing samples for training characters A–Z, a–z and 0–9. And also provides Bad Image samples containing characters, digits of much-distorted form and Good Image samples containing characters, digits in good form (Zhao et al., 2015). The trained TEDLESS demands to be evaluated by means of manifold performance measures to sort out its exercitation.
CRF based text detection for natural scene images using convolutional neural network and context information
2018, NeurocomputingCitation Excerpt :Our method achieves Precision 0.83, Recall 0.69 and F-measure 0.75 and outperforms most of approaches on ICDAR 2005. Compared with the state-of-the-art algorithm of [38], the proposed method achieves competitive performance. Specifically, our method works better than the methods [26,39,40] which adopt MSER algorithm to extract character candidates.
Feature learning via partial differential equation with applications to face recognition
2017, Pattern RecognitionCitation Excerpt :In [25], they model the saliency detection task as learning a boundary condition of a PDE system. Zhao et al. [26] extend this model to text detection. The incapability of the existing methods in incorporating both discrimination and invariance into features motivates us to find new ways to feature learning, especially in the case of limited training samples.
Script independent approach for multi-oriented text detection in scene image
2017, NeurocomputingCitation Excerpt :Since this method depends on adaptive thresholds and a classifier for text detection, it may not be used for multi-lingual text images directly as discussed. Zhao et al. [19] proposed a robust hybrid method for text detection in natural scenes by generating a confidence map based on learning-based partial differential equations. Adaptive binarization and cluster analysis are proposed for the confidence map to detect text region candidates.
Zhenyu Zhao received the B.S. degree in mathematics from University of Science and Technology in 2009, and the M.S. degree in system science from National University of Defense and Technology in 2011. He is currently pursuing the Ph.D. degree in applied mathematics, National University of Defense and Technology. His research interests include computer vision, pattern recognition and machine learning.
Cong Fang received the bachelor׳s degree in electronic Science and technology (for optoelectronic technology) from Tianjin University in 2014. He is currently pursuing the Ph.D. degree with the School of Electronics Engineering and Computer Science, Peking University. His research interests include computer vision, pattern recognition, machine learning and optimization.
Zhouchen Lin received the PhD degree in applied mathematics from Peking University in 2000. Currently, he is a professor at the Key Laboratory of Machine Perception (MOE), School of Electronics Engineering and Computer Science, Peking University. He is also a chair professor at Northeast Normal University. He was a guest professor at Shanghai Jiaotong University, Beijing Jiaotong University, and Southeast University. He was also a guest researcher at the Institute of Computing Technology, Chinese Academic of Sciences. His research interests include computer vision, image processing, machine learning, pattern recognition, and numerical optimization. He is an associate editor of IEEE T. Pattern Analysis and Machine Intelligence and International J. Computer Vision and a senior member of the IEEE.
Yi Wu is a professor in the Department of Mathematics and System Science at the National University of Defense Technology in Changsha, China. He earned bachelor׳s and master׳s degrees in applied mathematics at the National University of Defense Technology in 1981 and 1988. He worked as a visiting researcher at New York State University in 1999. His research interests include applied mathematics, statistics, and data processing.