A robust hybrid method for text detection in natural scenes by learning-based partial differential equations

doi:10.1016/j.neucom.2015.06.019

Neurocomputing

Volume 168, 30 November 2015, Pages 23-34

https://doi.org/10.1016/j.neucom.2015.06.019 Get rights and content

Abstract

Learning-based partial differential equations (PDEs), which combine fundamental differential invariants into a non-linear regressor, have been successfully applied to several computer vision tasks. In this paper, we present a robust hybrid method that uses learning-based PDEs for detecting texts from natural scene images. Our method consists of both top-down and bottom-up processing, which are loosely coupled. We first use learning-based PDEs to produce a text confidence map. Text region candidates are then detected from the map by local binarization and connected component clustering. In each text region candidate, character candidates are detected based on their color similarity and then grouped into text candidates by simple rules. Finally, we adopt a two-level classification scheme to remove the non-text candidates. Our method has a flexible structure, where the latter part can be replaced with any connected component based methods to further improve the detection accuracy. Experimental results on public benchmark databases, ICDAR and SVT, demonstrate the superiority and robustness of our hybrid approach.

Introduction

Text detection and recognition in natural scene images have received more and more attention in recent years [1], [2], [3], [4]. This is because text often provides critical information for understanding the high-level semantics of multimedia content, such as street view data [5], [6]. Moreover, the demand of a growing number of applications on mobile devices has brought great interest in this problem. Text detection from natural scene images is very challenging due to the complexity of background, uneven lighting, blurring, degradation, distortion, and the diversity of text patterns.

There have been a lot of methods for scene text detection, which can be roughly divided into three categories: sliding window based methods [7], [8], [9], connected component (CC) based methods [5], [10], [11], and hybrid methods [12]. Sliding window based methods search for possible texts in multi-scale windows in an image and then classify them into positives using a lot of texture features. However, they are often computationally expensive when a large number of windows, with various sizes, need to be checked and complex classification methods are used. CC-based methods firstly extract character candidates as connected components using some low-level features, e.g., color similarity and spatial layout. Then the character candidates are grouped into words after eliminating the wrong ones by connected components analysis (CCA). The hybrid method [12] creates a text region detector to estimate the probabilities of text position at different scales and extract character candidates (connected components) by local binarization. The CC-based and the hybrid methods are more popular than the sliding window based ones because they can achieve a high precision once the candidate characters are correctly detected and grouped. However, such a condition is not often met: the low-level operations are usually unreliable and sensitive to noise, which makes it difficult to extract the right character candidates. The large number of wrong character candidates can cause many difficulties in the post-processing, such as grouping and classification.

Recently, Liu et al. [13] have proposed a framework that learns partial differential equations (PDEs) from training image pairs, which has been successfully applied to several computer vision and image processing problems. It can handle some mid-and-high-level tasks that the traditional PDE-based methods cannot. In [13] they apply learning-based PDEs to object detection, color2gray, and demosaicking. In [14], they use an adaptive (learning-based) PDEs system for saliency detection. However, these methods [13], [14] may not handle text detection well. This is because text is a man-made object and its interpretation strongly depends on human perception. The complexity of backgrounds, flexible text styles, and variation of text contents make text detection more challenging than previous tasks.

In most cases, texts cover only a small area of the scene image (Fig. 1). So it will be beneficial for post-processing, like character candidate extraction, if we could narrow down the candidates of text region and eliminate annoying backgrounds. Although we cannot expect that the learned PDEs can produce an exactly binary text mask map, because the solution of the learned PDEs should be a more or less smooth function, learning-based PDEs can give us a relatively high quality confidence map as a good reference. So we use learning-based PDEs to design a text region detector. It is much faster than sliding window based methods because its complexity is only O(N), where N is the number of pixels in the image. Some examples containing the detected region candidates are shown in Fig. 1. To make our method complete and comparable to others, we further propose a simple method for detecting texts in the region candidates.

In summary, we propose a new robust hybrid method using learning-based PDEs. It incorporates both top-down scheme and bottom-up scheme to extract texts in natural scene images. Fig. 2 shows the flow chart of our system and Fig. 3 gives an example. A PDEs system is first learnt off-line with L₁-norm regularization on the training images. In the top-down scheme, given a test image as the initial condition we solve the learnt PDEs to produce a high quality text confidence map (see Fig. 3(b)). Then we apply a local binarization algorithm (Niblack [15]) to the confidence map to extract text region candidates (see Fig. 3(c)). In the bottom-up scheme, we present a simple connected component based method and apply it to each region candidate to determine accurate text locations. We firstly perform mean shift algorithm and binarization (OTSU [16]) to extract character candidates (see Fig. 3(d)). Then we group these components to text lines simply based on their color and size (see Fig. 3(e)). Next we adopt a two-level classification scheme (character candidates classifier and text candidates classifier) to eliminate the non-text candidates. Then we obtain the final result (Fig. 3(f)). Our system is evaluated on several benchmark databases and has achieved higher F-measures than other methods. Note that the parameters and classifiers are only trained on the ICDAR 2011 database [17] as only this database provides the required training information. But the proposed approach still yields higher precisions and recalls on the SVT databases [5], [6] than other state-of-the-art methods. We summarize the contributions of this paper as follows:

•
We propose a new hybrid method for text detection from natural scene images. Unlike previous methods, our method consists of loosely coupled top-down and bottom-up schemes, where the latter part can be replaced by any connect component based methods.
•
We apply learning-based PDEs for computing a high quality text confidence map, upon which good text region candidates can be easily chosen. Unlike sliding window based methods, the complexity of learning-based PDEs for text candidate proposal is only O(N), where N is the number of pixels. So our learning-based PDEs are much faster. To our best knowledge, this is the first work that applies PDEs to text detection.
•
We conduct extensive experiments on benchmark databases to prove the superiority of our method over the state-of-the-art ones in detection accuracy. Note that unlike previous approaches, after computing the text confidence map, all the procedures are very simple. The performance could be further improved if more sophisticated and ad hoc treatments are involved.

The rest of this paper is organized as follows. Section 2 briefly reviews the related work. Section 3 describes the top-down scheme and Section 4 describes the bottom-up scheme. We discuss the relationship between our method and some related work in Section 5. Section 6 presents experiments that compare the proposed method and the state-of-the-art ones on several public databases. We conclude our paper in Section 7.

Section snippets

Related work

Most sliding window based methods first search for possible texts in multi-scale windows and then estimate the text existence probability by using classifiers. Zhong et al. [1] adopt image transformations, such as discrete cosine transform and wavelet decomposition, to extract features. They remove the non-text regions by thresholding the filter responses. Kim et al. [7] extract texture features from all local windows in every layer of image pyramid, which enables the method to detect texts at

Top-down: text confidence map computation and text region candidates extraction

As mentioned before, the proposed method has both the top-down scheme and the bottom-up scheme. We introduce the top-down scheme in this section and leave the bottom-up scheme in next section. We first introduce learning-based PDEs and then apply them to compute the text confidence map. Then we adopt a local thresholding method for text region candidates extraction.

Bottom-up: text detection in the region candidates

Although using learning-based PDEs for text confidence map computation is our major contribution, to make our method complete and comparable to others, in this section we propose a simple framework to extract text strings in the detected text region candidates. It can be replaced by any CC-based methods and the performance can be further improved if more advanced CC-based methods are employed. The framework is the bottom-up scheme and is mainly inspired by [21], [26]. It consists of three main

Discussions

In this section, we discuss the connection between our methods and some related work.

Experimental results

In this section, we compare our method with several state-of-the-art methods on a variety of public databases, including the ICDAR 2005 [44], [45], ICDAR 2011 [17], and the two street view text databases, i.e., the SVT 2010 database [5] and the SVT 2011 database [6]. The performance of these methods is quantitatively measured by precision (P), recall (R), and F-measure (F). They are computed using the same definitions as those in [44], [45] at the image level. The overall performance values are

Conclusions

In this paper, we propose a novel hybrid approach for text detection in natural scene images. We apply learning-based PDEs to provide text region candidates and devise a simple connected components based method to locate the texts accurately in each text region candidate. Experiment results show the robustness and superiority of our method when compared to many state-of-the-art approaches. In the future, we plan to develop even better operators via learning-based PDEs for robust text region

Acknowledgment

Zhenyu Zhao is supported by National Natural Science Foundation of China (NSFC) (Grant no. 61473302). Zhouchen Lin is supported by National Basic Research Program of China (973 Program) (Grant no. 2015CB352502), NSFC (Grant nos. 61272341 and 61231002), and Microsoft Research Asia Collaborative Research Program.

Zhenyu Zhao received the B.S. degree in mathematics from University of Science and Technology in 2009, and the M.S. degree in system science from National University of Defense and Technology in 2011. He is currently pursuing the Ph.D. degree in applied mathematics, National University of Defense and Technology. His research interests include computer vision, pattern recognition and machine learning.

References (54)

Honggang Zhang et al.
Text extraction from natural scene imagea survey
Neurocomputing
(2013)
Céline Mancas-Thillou et al.
Color text extraction with selective metric-based clustering
Comput. Vis. Image Underst.
(2007)
Risheng Liu et al.
Toward designing intelligent PDEs for computer vision: an optimal control approach
Image Vis. Comput.
(2013)
Woei Chan et al.
Text analysis using local energy
Pattern Recognit.
(2001)
Chucai Yi et al.
Text extraction from scene images by character appearance and structure modeling
Comput. Vis. Image Underst.
(2013)
Cunzhao Shi et al.
Scene text detection using graph model built upon maximally stable extremal regions
Pattern Recognit. Lett.
(2013)
Yu Zhong et al.
Automatic caption localization in compressed video
IEEE Trans. Pattern Anal. Mach. Intell.
(2000)
Jerod J. Weinman et al.
Scene text recognition using similarity and a lexicon with sparse belief propagation
IEEE Trans. Pattern Anal. Mach. Intell.
(2009)
Qixiang Ye et al.
Text detection and recognition in imagery: a survey
IEEE Trans. Pattern Anal. Mach. Intell.
(2015)
Boris Epshtein, Eyal Ofek, Yonatan Wexler, Detecting text in natural scenes with stroke width transform, in: CVPR,...

Kai Wang, Boris Babenko, Serge Belongie, End-to-end scene text recognition, in: ICCV, IEEE, 2011, pp....

Kwang In Kim et al.

Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm

IEEE Trans. Pattern Anal. Mach. Intell.

(2003)

Xiangrong Chen, Alan L Yuille, Detecting and reading text in natural scenes, in: CVPR, IEEE,...

Shehzad Muhammad Hanif, Lionel Prevost, Text detection and localization in complex scene images using constrained...

Chucai Yi et al.

Text string detection from natural scenes by structure-based partition and grouping

IEEE Trans. Image Process.

(2011)

Yi-Feng Pan et al.

A hybrid approach to detect and localize texts in natural scene images

IEEE Trans. Image Process.

(2011)

Risheng Liu, Junjie Cao, Zhouchen Lin, Shiguang Shan, Adaptive partial differential equation learning for visual...

Wayne Niblack

An Introduction to Digital Image Processing

(1986)

Nobuyuki Otsu

A threshold selection method from gray-level histograms

Automatica

(1975)

Asif Shahab, Faisal Shafait, Andreas Dengel, ICDAR 2011 robust reading competition challenge 2: reading text in scene...

Huiping Li et al.

Automatic text detection and tracking in digital video

IEEE Trans. Image Process.

(2000)

Jung-Jin Lee, Pyoung-Hean Lee, Seong-Whan Lee, Alan L Yuille, Christof Koch, Adaboost for text detection in natural...

Adam Coates, Blake Carpenter, Carl Case, Sanjeev Satheesh, Bipin Suresh, Tao Wang, David J. Wu, Andrew Y. Ng, Text...

Chucai Yi et al.

Text string detection from natural scenes by structure-based partition and grouping

IEEE Trans. Image Process.

(2011)

Palaiahnakote Shivakumara et al.

A Laplacian approach to multi-oriented text detection in video

IEEE Trans. Pattern Anal. Mach. Intell.

(2011)

Luka Neumann, Jiri Matas, Scene text localization and recognition with oriented stroke detection, in: ICCV, 2013, pp....

XuCheng Yin et al.

Robust text detection in natural scene images

IEEE Trans. Pattern Anal. Mach. Intell.

(2014)

Cited by (19)

A multi-rank two-dimensional CCA based on PDEs for multi-view feature extraction
2024, Expert Systems with Applications
Feature extraction is one of the fundamental problems in pattern recognition research. For image recognition, extracting effective image features is the key to accomplish the recognition task. In this paper, a partial differential equations-based multi-rank two-dimensional canonical correlation analysis (PDEs-MR2DC²A) is proposed for multi-view feature extraction and pattern classification. Unlike most of the previous researches on multi-view algorithms that work directly on the original 2D representation, in our approach, we first utilize the evolution process of PDEs to extract the feature matrix of per-view. In addition, we employ multi-rank left and right projecting matrices to maximize the correlation. The computational complexity of PDEs-MR2DC²A is also analyzed. To evaluate the effectiveness of the proposed algorithm, we conducted a series of performance comparisons with some existing methods on several popular datasets. The experimental results showed that our proposed algorithm performed very well on these datasets and outperformed the existing related methods on some metrics.
A Novel Variational Model for Detail-Preserving Low-Illumination Image Enhancement
2022, Signal Processing
Citation Excerpt :
However, in the process of equalizing the histogram, the gray levels with fewer pixels may be merged into new gray levels, which will cause the image to be locally over-enhanced [2]. Partial-differential-equations-based methods [7] have the advantages of simplicity and flexibility, and are easy to be implemented. Tone mapping [8] is developed based on contrast enhancement techniques, which constructs a probability-based pixel mapping function to light up low-illumination images.
The images captured under low-light conditions often show low sharpness and low contrast, which inevitably influences the quality of the images and degrades the performance of many computer vision systems designed for pattern recognition. Although many low-light image enhancement methods have been put forward to deal with the problem, the existing methods may bring about contrast distortion and insufficient or excessive enhancement of the source images. To tackle this challenge, a novel variational enhancement model is presented. This method focuses on enhancing the visibility of low-illumination images while preserving the details and the texture information. To begin with, we introduce a new variational model, which includes novel regularization terms for the illumination, the reflectance and the noise component. Specifically, we use the l₁ norm to implement the piece-wise continuousness operation on the reflectance map, present a novel structure-preserving regularizer to assure the scale-aware structure smoothness of the illumination map, and take noise map estimation into account to suppress the noise of the given image. Then, an alternating direction method of multipliers (ADMM) is utilized to solve the specified problem accurately. The extensive experimental results have shown that the proposed approach yields comparative and even better performance in comparison with some state-of-the-art techniques in both qualitative and quantitative evaluations.
TEDLESS – Text detection using least-square SVM from natural scene
2020, Journal of King Saud University - Computer and Information Sciences
Citation Excerpt :
Char 74K contains 62 sub-dataset, containing samples for training characters A–Z, a–z and 0–9. And also provides Bad Image samples containing characters, digits of much-distorted form and Good Image samples containing characters, digits in good form (Zhao et al., 2015). The trained TEDLESS demands to be evaluated by means of manifold performance measures to sort out its exercitation.
Text detection from the natural scene is considered to be a challenging problem due to the complex background, varied light intensity at different locations, a large variety of colors, diverse font style and size. This paper focusses on detecting candidate text objects from the scene. The image is initially preprocessed to remove the noise and enhance the contrast. Then the various objects of the scene are marked and extracted forming a pool of objects. A set of candidate text objects are extracted from this pool of objects and given as output. In order to locate text candidates among these objects, we use Least-Square Support Vector Machine Technique, which trains the model using Char 74K character dataset and CIFAR 10 non-text image dataset. Finally, the trained model was applied to perform a binary classification of text and non-text objects. The results were evaluated over ICDAR 2015 scene images, MSRA500 and SVT datasets and also have been compared to other approaches acquiring encouraging results.
CRF based text detection for natural scene images using convolutional neural network and context information
2018, Neurocomputing
Citation Excerpt :
Our method achieves Precision 0.83, Recall 0.69 and F-measure 0.75 and outperforms most of approaches on ICDAR 2005. Compared with the state-of-the-art algorithm of [38], the proposed method achieves competitive performance. Specifically, our method works better than the methods [26,39,40] which adopt MSER algorithm to extract character candidates.
This paper presents a novel scene text detection method based on conditional random field (CRF) framework. We estimate the confidence of Maximally Stable Extremal Region (MSER) being text by leveraging convolutional neural network (CNN) to define the unary cost item. In addition, we establish the neighboring interactions for MSERs using four different features including color, shape, stroke and spatial features to define the pairwise cost item. Considering the special layout of texts appearing in natural scene images, we employ context information to recover missing text MSER candidates. Furthermore, text MSERs are grouped into candidate text lines which are verified with shape-specific classifiers by integrating gray and binary features. Experimental results on four public benchmark datasets show that the proposed method achieves the comparable performance.
Feature learning via partial differential equation with applications to face recognition
2017, Pattern Recognition
Citation Excerpt :
In [25], they model the saliency detection task as learning a boundary condition of a PDE system. Zhao et al. [26] extend this model to text detection. The incapability of the existing methods in incorporating both discrimination and invariance into features motivates us to find new ways to feature learning, especially in the case of limited training samples.
Feature learning is a critical step in pattern recognition, such as image classification. However, most of the existing methods cannot extract features that are discriminative and at the same time invariant under some transforms. This limits the classification performance, especially in the case of small training sets. To address this issue, in this paper we propose a novel Partial Differential Equation (PDE) based method for feature learning. The feature learned by our PDE is discriminative, also translationally and rotationally invariant, and robust to illumination variation. To our best knowledge, this is the first work that applies PDE to feature learning and image recognition tasks. Specifically, we model feature learning as an evolution process governed by a PDE, which is designed to be translationally and rotationally invariant and is learned via minimizing the training error, hence extracts discriminative information from data. After feature extraction, we apply a linear classifier for classification. We also propose an efficient algorithm that optimizes the whole framework. Our method is very effective when the training samples are few. The experimental results of face recognition on the four benchmark face datasets show that the proposed method outperforms the state-of-the-art feature learning methods in the case of low-resolution images and when the training samples are limited.
Script independent approach for multi-oriented text detection in scene image
2017, Neurocomputing
Citation Excerpt :
Since this method depends on adaptive thresholds and a classifier for text detection, it may not be used for multi-lingual text images directly as discussed. Zhao et al. [19] proposed a robust hybrid method for text detection in natural scenes by generating a confidence map based on learning-based partial differential equations. Adaptive binarization and cluster analysis are proposed for the confidence map to detect text region candidates.
Developing a text detection method which is invariant to scripts in natural scene images is a challenging task due to different geometrical structures of various scripts. Besides, multi-oriented of text lines in natural scene images make the problem more challenging. This paper proposes to explore ring radius transform (RRT) for text detection in multi-oriented and multi-script environments. The method finds component regions based on convex hull to generate radius matrices using RRT. It is a fact that RRT provides low radius values for the pixels that are near to edges, constant radius values for the pixels that represent stroke width, and high radius values that represent holes created in background and convex hull because of the regular structures of text components. We apply k-means clustering on the radius matrices to group such spatially coherent regions into individual clusters. Then the proposed method studies the radius values of such cluster components that are close to the centroid and far from the centroid to detect text components. Furthermore, we have developed a Bangla dataset (named as ISI-UM dataset) and propose a semi-automatic system for generating its ground truth for text detection of arbitrary orientations, which can be used by the researchers for text detection and recognition in the future. The ground truth will be released to public. Experimental results on our ISI-UM data and other standard datasets, namely, ICDAR 2013 scene, SVT and MSRA data, show that the proposed method outperforms the existing methods in terms of multi-lingual and multi-oriented text detection ability.

View all citing articles on Scopus

Cong Fang received the bachelor׳s degree in electronic Science and technology (for optoelectronic technology) from Tianjin University in 2014. He is currently pursuing the Ph.D. degree with the School of Electronics Engineering and Computer Science, Peking University. His research interests include computer vision, pattern recognition, machine learning and optimization.

Zhouchen Lin received the PhD degree in applied mathematics from Peking University in 2000. Currently, he is a professor at the Key Laboratory of Machine Perception (MOE), School of Electronics Engineering and Computer Science, Peking University. He is also a chair professor at Northeast Normal University. He was a guest professor at Shanghai Jiaotong University, Beijing Jiaotong University, and Southeast University. He was also a guest researcher at the Institute of Computing Technology, Chinese Academic of Sciences. His research interests include computer vision, image processing, machine learning, pattern recognition, and numerical optimization. He is an associate editor of IEEE T. Pattern Analysis and Machine Intelligence and International J. Computer Vision and a senior member of the IEEE.

Yi Wu is a professor in the Department of Mathematics and System Science at the National University of Defense Technology in Changsha, China. He earned bachelor׳s and master׳s degrees in applied mathematics at the National University of Defense Technology in 1981 and 1988. He worked as a visiting researcher at New York State University in 1999. His research interests include applied mathematics, statistics, and data processing.

View full text

A robust hybrid method for text detection in natural scenes by learning-based partial differential equations

Abstract

Introduction

Section snippets

Related work

Top-down: text confidence map computation and text region candidates extraction

Bottom-up: text detection in the region candidates

Discussions

Experimental results

Conclusions

Acknowledgment

Neurocomputing

Comput. Vis. Image Underst.

Image Vis. Comput.

Pattern Recognit.

Comput. Vis. Image Underst.

Pattern Recognit. Lett.

Automatic caption localization in compressed video

IEEE Trans. Pattern Anal. Mach. Intell.

Scene text recognition using similarity and a lexicon with sparse belief propagation

IEEE Trans. Pattern Anal. Mach. Intell.

Text detection and recognition in imagery: a survey

IEEE Trans. Pattern Anal. Mach. Intell.

Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm

IEEE Trans. Pattern Anal. Mach. Intell.

Text string detection from natural scenes by structure-based partition and grouping

IEEE Trans. Image Process.

A hybrid approach to detect and localize texts in natural scene images

IEEE Trans. Image Process.

An Introduction to Digital Image Processing

A threshold selection method from gray-level histograms

Automatica

Automatic text detection and tracking in digital video

IEEE Trans. Image Process.

Text string detection from natural scenes by structure-based partition and grouping

IEEE Trans. Image Process.

A Laplacian approach to multi-oriented text detection in video

IEEE Trans. Pattern Anal. Mach. Intell.

Robust text detection in natural scene images

IEEE Trans. Pattern Anal. Mach. Intell.