Elsevier

Neurocomputing

Volume 168, 30 November 2015, Pages 23-34
Neurocomputing

A robust hybrid method for text detection in natural scenes by learning-based partial differential equations

https://doi.org/10.1016/j.neucom.2015.06.019Get rights and content

Abstract

Learning-based partial differential equations (PDEs), which combine fundamental differential invariants into a non-linear regressor, have been successfully applied to several computer vision tasks. In this paper, we present a robust hybrid method that uses learning-based PDEs for detecting texts from natural scene images. Our method consists of both top-down and bottom-up processing, which are loosely coupled. We first use learning-based PDEs to produce a text confidence map. Text region candidates are then detected from the map by local binarization and connected component clustering. In each text region candidate, character candidates are detected based on their color similarity and then grouped into text candidates by simple rules. Finally, we adopt a two-level classification scheme to remove the non-text candidates. Our method has a flexible structure, where the latter part can be replaced with any connected component based methods to further improve the detection accuracy. Experimental results on public benchmark databases, ICDAR and SVT, demonstrate the superiority and robustness of our hybrid approach.

Introduction

Text detection and recognition in natural scene images have received more and more attention in recent years [1], [2], [3], [4]. This is because text often provides critical information for understanding the high-level semantics of multimedia content, such as street view data [5], [6]. Moreover, the demand of a growing number of applications on mobile devices has brought great interest in this problem. Text detection from natural scene images is very challenging due to the complexity of background, uneven lighting, blurring, degradation, distortion, and the diversity of text patterns.

There have been a lot of methods for scene text detection, which can be roughly divided into three categories: sliding window based methods [7], [8], [9], connected component (CC) based methods [5], [10], [11], and hybrid methods [12]. Sliding window based methods search for possible texts in multi-scale windows in an image and then classify them into positives using a lot of texture features. However, they are often computationally expensive when a large number of windows, with various sizes, need to be checked and complex classification methods are used. CC-based methods firstly extract character candidates as connected components using some low-level features, e.g., color similarity and spatial layout. Then the character candidates are grouped into words after eliminating the wrong ones by connected components analysis (CCA). The hybrid method [12] creates a text region detector to estimate the probabilities of text position at different scales and extract character candidates (connected components) by local binarization. The CC-based and the hybrid methods are more popular than the sliding window based ones because they can achieve a high precision once the candidate characters are correctly detected and grouped. However, such a condition is not often met: the low-level operations are usually unreliable and sensitive to noise, which makes it difficult to extract the right character candidates. The large number of wrong character candidates can cause many difficulties in the post-processing, such as grouping and classification.

Recently, Liu et al. [13] have proposed a framework that learns partial differential equations (PDEs) from training image pairs, which has been successfully applied to several computer vision and image processing problems. It can handle some mid-and-high-level tasks that the traditional PDE-based methods cannot. In [13] they apply learning-based PDEs to object detection, color2gray, and demosaicking. In [14], they use an adaptive (learning-based) PDEs system for saliency detection. However, these methods [13], [14] may not handle text detection well. This is because text is a man-made object and its interpretation strongly depends on human perception. The complexity of backgrounds, flexible text styles, and variation of text contents make text detection more challenging than previous tasks.

In most cases, texts cover only a small area of the scene image (Fig. 1). So it will be beneficial for post-processing, like character candidate extraction, if we could narrow down the candidates of text region and eliminate annoying backgrounds. Although we cannot expect that the learned PDEs can produce an exactly binary text mask map, because the solution of the learned PDEs should be a more or less smooth function, learning-based PDEs can give us a relatively high quality confidence map as a good reference. So we use learning-based PDEs to design a text region detector. It is much faster than sliding window based methods because its complexity is only O(N), where N is the number of pixels in the image. Some examples containing the detected region candidates are shown in Fig. 1. To make our method complete and comparable to others, we further propose a simple method for detecting texts in the region candidates.

In summary, we propose a new robust hybrid method using learning-based PDEs. It incorporates both top-down scheme and bottom-up scheme to extract texts in natural scene images. Fig. 2 shows the flow chart of our system and Fig. 3 gives an example. A PDEs system is first learnt off-line with L1-norm regularization on the training images. In the top-down scheme, given a test image as the initial condition we solve the learnt PDEs to produce a high quality text confidence map (see Fig. 3(b)). Then we apply a local binarization algorithm (Niblack [15]) to the confidence map to extract text region candidates (see Fig. 3(c)). In the bottom-up scheme, we present a simple connected component based method and apply it to each region candidate to determine accurate text locations. We firstly perform mean shift algorithm and binarization (OTSU [16]) to extract character candidates (see Fig. 3(d)). Then we group these components to text lines simply based on their color and size (see Fig. 3(e)). Next we adopt a two-level classification scheme (character candidates classifier and text candidates classifier) to eliminate the non-text candidates. Then we obtain the final result (Fig. 3(f)). Our system is evaluated on several benchmark databases and has achieved higher F-measures than other methods. Note that the parameters and classifiers are only trained on the ICDAR 2011 database [17] as only this database provides the required training information. But the proposed approach still yields higher precisions and recalls on the SVT databases [5], [6] than other state-of-the-art methods. We summarize the contributions of this paper as follows:

  • We propose a new hybrid method for text detection from natural scene images. Unlike previous methods, our method consists of loosely coupled top-down and bottom-up schemes, where the latter part can be replaced by any connect component based methods.

  • We apply learning-based PDEs for computing a high quality text confidence map, upon which good text region candidates can be easily chosen. Unlike sliding window based methods, the complexity of learning-based PDEs for text candidate proposal is only O(N), where N is the number of pixels. So our learning-based PDEs are much faster. To our best knowledge, this is the first work that applies PDEs to text detection.

  • We conduct extensive experiments on benchmark databases to prove the superiority of our method over the state-of-the-art ones in detection accuracy. Note that unlike previous approaches, after computing the text confidence map, all the procedures are very simple. The performance could be further improved if more sophisticated and ad hoc treatments are involved.

The rest of this paper is organized as follows. Section 2 briefly reviews the related work. Section 3 describes the top-down scheme and Section 4 describes the bottom-up scheme. We discuss the relationship between our method and some related work in Section 5. Section 6 presents experiments that compare the proposed method and the state-of-the-art ones on several public databases. We conclude our paper in Section 7.

Section snippets

Related work

Most sliding window based methods first search for possible texts in multi-scale windows and then estimate the text existence probability by using classifiers. Zhong et al. [1] adopt image transformations, such as discrete cosine transform and wavelet decomposition, to extract features. They remove the non-text regions by thresholding the filter responses. Kim et al. [7] extract texture features from all local windows in every layer of image pyramid, which enables the method to detect texts at

Top-down: text confidence map computation and text region candidates extraction

As mentioned before, the proposed method has both the top-down scheme and the bottom-up scheme. We introduce the top-down scheme in this section and leave the bottom-up scheme in next section. We first introduce learning-based PDEs and then apply them to compute the text confidence map. Then we adopt a local thresholding method for text region candidates extraction.

Bottom-up: text detection in the region candidates

Although using learning-based PDEs for text confidence map computation is our major contribution, to make our method complete and comparable to others, in this section we propose a simple framework to extract text strings in the detected text region candidates. It can be replaced by any CC-based methods and the performance can be further improved if more advanced CC-based methods are employed. The framework is the bottom-up scheme and is mainly inspired by [21], [26]. It consists of three main

Discussions

In this section, we discuss the connection between our methods and some related work.

Experimental results

In this section, we compare our method with several state-of-the-art methods on a variety of public databases, including the ICDAR 2005 [44], [45], ICDAR 2011 [17], and the two street view text databases, i.e., the SVT 2010 database [5] and the SVT 2011 database [6]. The performance of these methods is quantitatively measured by precision (P), recall (R), and F-measure (F). They are computed using the same definitions as those in [44], [45] at the image level. The overall performance values are

Conclusions

In this paper, we propose a novel hybrid approach for text detection in natural scene images. We apply learning-based PDEs to provide text region candidates and devise a simple connected components based method to locate the texts accurately in each text region candidate. Experiment results show the robustness and superiority of our method when compared to many state-of-the-art approaches. In the future, we plan to develop even better operators via learning-based PDEs for robust text region

Acknowledgment

Zhenyu Zhao is supported by National Natural Science Foundation of China (NSFC) (Grant no. 61473302). Zhouchen Lin is supported by National Basic Research Program of China (973 Program) (Grant no. 2015CB352502), NSFC (Grant nos. 61272341 and 61231002), and Microsoft Research Asia Collaborative Research Program.

Zhenyu Zhao received the B.S. degree in mathematics from University of Science and Technology in 2009, and the M.S. degree in system science from National University of Defense and Technology in 2011. He is currently pursuing the Ph.D. degree in applied mathematics, National University of Defense and Technology. His research interests include computer vision, pattern recognition and machine learning.

References (54)

  • Kai Wang, Boris Babenko, Serge Belongie, End-to-end scene text recognition, in: ICCV, IEEE, 2011, pp....
  • Kwang In Kim et al.

    Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2003)
  • Xiangrong Chen, Alan L Yuille, Detecting and reading text in natural scenes, in: CVPR, IEEE,...
  • Shehzad Muhammad Hanif, Lionel Prevost, Text detection and localization in complex scene images using constrained...
  • Chucai Yi et al.

    Text string detection from natural scenes by structure-based partition and grouping

    IEEE Trans. Image Process.

    (2011)
  • Yi-Feng Pan et al.

    A hybrid approach to detect and localize texts in natural scene images

    IEEE Trans. Image Process.

    (2011)
  • Risheng Liu, Junjie Cao, Zhouchen Lin, Shiguang Shan, Adaptive partial differential equation learning for visual...
  • Wayne Niblack

    An Introduction to Digital Image Processing

    (1986)
  • Nobuyuki Otsu

    A threshold selection method from gray-level histograms

    Automatica

    (1975)
  • Asif Shahab, Faisal Shafait, Andreas Dengel, ICDAR 2011 robust reading competition challenge 2: reading text in scene...
  • Huiping Li et al.

    Automatic text detection and tracking in digital video

    IEEE Trans. Image Process.

    (2000)
  • Jung-Jin Lee, Pyoung-Hean Lee, Seong-Whan Lee, Alan L Yuille, Christof Koch, Adaboost for text detection in natural...
  • Adam Coates, Blake Carpenter, Carl Case, Sanjeev Satheesh, Bipin Suresh, Tao Wang, David J. Wu, Andrew Y. Ng, Text...
  • Chucai Yi et al.

    Text string detection from natural scenes by structure-based partition and grouping

    IEEE Trans. Image Process.

    (2011)
  • Palaiahnakote Shivakumara et al.

    A Laplacian approach to multi-oriented text detection in video

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • Luka Neumann, Jiri Matas, Scene text localization and recognition with oriented stroke detection, in: ICCV, 2013, pp....
  • XuCheng Yin et al.

    Robust text detection in natural scene images

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2014)
  • Cited by (19)

    • A Novel Variational Model for Detail-Preserving Low-Illumination Image Enhancement

      2022, Signal Processing
      Citation Excerpt :

      However, in the process of equalizing the histogram, the gray levels with fewer pixels may be merged into new gray levels, which will cause the image to be locally over-enhanced [2]. Partial-differential-equations-based methods [7] have the advantages of simplicity and flexibility, and are easy to be implemented. Tone mapping [8] is developed based on contrast enhancement techniques, which constructs a probability-based pixel mapping function to light up low-illumination images.

    • TEDLESS – Text detection using least-square SVM from natural scene

      2020, Journal of King Saud University - Computer and Information Sciences
      Citation Excerpt :

      Char 74K contains 62 sub-dataset, containing samples for training characters A–Z, a–z and 0–9. And also provides Bad Image samples containing characters, digits of much-distorted form and Good Image samples containing characters, digits in good form (Zhao et al., 2015). The trained TEDLESS demands to be evaluated by means of manifold performance measures to sort out its exercitation.

    • CRF based text detection for natural scene images using convolutional neural network and context information

      2018, Neurocomputing
      Citation Excerpt :

      Our method achieves Precision 0.83, Recall 0.69 and F-measure 0.75 and outperforms most of approaches on ICDAR 2005. Compared with the state-of-the-art algorithm of [38], the proposed method achieves competitive performance. Specifically, our method works better than the methods [26,39,40] which adopt MSER algorithm to extract character candidates.

    • Feature learning via partial differential equation with applications to face recognition

      2017, Pattern Recognition
      Citation Excerpt :

      In [25], they model the saliency detection task as learning a boundary condition of a PDE system. Zhao et al. [26] extend this model to text detection. The incapability of the existing methods in incorporating both discrimination and invariance into features motivates us to find new ways to feature learning, especially in the case of limited training samples.

    • Script independent approach for multi-oriented text detection in scene image

      2017, Neurocomputing
      Citation Excerpt :

      Since this method depends on adaptive thresholds and a classifier for text detection, it may not be used for multi-lingual text images directly as discussed. Zhao et al. [19] proposed a robust hybrid method for text detection in natural scenes by generating a confidence map based on learning-based partial differential equations. Adaptive binarization and cluster analysis are proposed for the confidence map to detect text region candidates.

    View all citing articles on Scopus

    Zhenyu Zhao received the B.S. degree in mathematics from University of Science and Technology in 2009, and the M.S. degree in system science from National University of Defense and Technology in 2011. He is currently pursuing the Ph.D. degree in applied mathematics, National University of Defense and Technology. His research interests include computer vision, pattern recognition and machine learning.

    Cong Fang received the bachelor׳s degree in electronic Science and technology (for optoelectronic technology) from Tianjin University in 2014. He is currently pursuing the Ph.D. degree with the School of Electronics Engineering and Computer Science, Peking University. His research interests include computer vision, pattern recognition, machine learning and optimization.

    Zhouchen Lin received the PhD degree in applied mathematics from Peking University in 2000. Currently, he is a professor at the Key Laboratory of Machine Perception (MOE), School of Electronics Engineering and Computer Science, Peking University. He is also a chair professor at Northeast Normal University. He was a guest professor at Shanghai Jiaotong University, Beijing Jiaotong University, and Southeast University. He was also a guest researcher at the Institute of Computing Technology, Chinese Academic of Sciences. His research interests include computer vision, image processing, machine learning, pattern recognition, and numerical optimization. He is an associate editor of IEEE T. Pattern Analysis and Machine Intelligence and International J. Computer Vision and a senior member of the IEEE.

    Yi Wu is a professor in the Department of Mathematics and System Science at the National University of Defense Technology in Changsha, China. He earned bachelor׳s and master׳s degrees in applied mathematics at the National University of Defense Technology in 1981 and 1988. He worked as a visiting researcher at New York State University in 1999. His research interests include applied mathematics, statistics, and data processing.

    View full text