A text reading algorithm for natural images

https://doi.org/10.1016/j.imavis.2013.01.003Get rights and content

Abstract

Reading text in natural images has focused again the attention of many researchers during the last few years due to the increasing availability of cheap image-capturing devices in low-cost products like mobile phones. Therefore, as text can be found on any environment, the applicability of text-reading systems is really extensive. For this purpose, we present in this paper a robust method to read text in natural images. It is composed of two main separated stages. Firstly, text is located in the image using a set of simple and fast-to-compute features highly discriminative between character and non-character objects. They are based on geometric and gradient properties. The second part of the system carries out the recognition of the previously detected text. It uses gradient features to recognize single characters and Dynamic Programming (DP) to correct misspelled words. Experimental results obtained with different challenging datasets show that the proposed system exceeds state-of-the-art performance, both in terms of localization and recognition.

Graphical abstract

Highlights

► A text detection and recognition method for natural images is proposed. ► Text location is based on simple and fast-to-compute features. ► Text recognition is based on simple gradient features. ► Misspelled words are corrected using Dynamic Programming. ► State-of-the-art performance is improved.

Introduction

Automatic text recognition has traditionally focused on analyzing scanned documents. However, during the last years digital cameras have started to be embedded in low-cost consumer products such as mobile phones and Tablet PCs, so user applications related to digital image processing have become very popular. Nevertheless, automatic text recognition in natural images still remains one of the most challenging problems in computer vision. In addition, as textual information can be found on any environment, both indoors and outdoors, the range of applications of automatic text reading systems can be wide, from support to visually impaired people to automatic geocoding of businesses, including support to robotic navigation in indoor and outdoor environments, image spam filtering, driver assistance or translation services for tourists, among others.

Up to now, most works have focused on concrete subsets of the problem, such as extracting text in CD cover images [1] or segmenting text in web images [2]. This is due to the wide variety of text appearance because of different fonts, thicknesses, colors, sizes, textures, as well as the presence of geometrical distortions and partial occlusions in the images, different lighting conditions and image resolutions, different languages, etc. In this paper, we propose a system to read text in any kind of scenario, both indoors and outdoors, in natural images. We simply constrain to machine-printed text and English language. To benchmark the performance of the proposed system, results have been obtained with several datasets that include images in different scenarios and situations. These datasets were released for the robust reading competitions held in the frame of the ICDAR (International Conference on Document Analysis and Recognition) 2003, 2005 and 2011 conferences. Most of researchers in this field use these datasets as a benchmark.

There are a number of contributions in this paper. First, a segmentation method based on a combination of Maximally Stable Extremal Regions (MSER) [3] and a locally adaptive thresholding method has been proposed. Second, a thorough study on different features to describe text has been carried out and the main results are shown here. A set of fast-to-compute features to discriminate between character and non-character has been proposed. It has also been proposed to use a restoration stage based on position and size features in order to bring back characters erroneously rejected. The results section will show the importance of this stage. On the other hand, a new feature is proposed for recognizing single characters. Finally, misspelled words are corrected using DP with substitution costs based on the confusion matrix of the character recognizer.

The remainder of the paper is organized as follows. In Section 2 we overview the main state-of-the-art methods. Section 3 presents a short general description of the proposed method. Section 4 describes the study that has been carried out in order to obtain a set of features that allow us to distinguish characters from non-characters. Section 5 describes the text location algorithm, while Section 6 explains the recognition approach. Experimental results are widely described in Section 7. Section 8 ends the paper highlighting the main conclusions and future works.

Section snippets

Related work

Automatic text location and recognition has been one of the main challenges in computer vision ever. Most of the works in this field are based on locating the text areas, so it is difficult to make an overview of all the implemented methods, since there has been thorough research on text location. In this section we focus on similar works from the last decade.

Yao et al. [4] use locally adaptive thresholding to segment an image. Then, certain geometric features are extracted from connected

General overview of the system

Fig. 1 shows the flowchart of the proposed framework, which is made up of two main blocks. The text location block aims at locating the text in the image precisely and to discard those parts of the image that do not contain text. It is thoroughly explained in Section 5. On the other hand, the recognition block treats to recognize the text detected in the previous stage. It is detailed in Section 6.

Text features analysis

In order to obtain a set of distinctive features capable of distinguishing character objects from non-character objects, we have made an analysis of certain text features under the ICDAR 2003 Reading Competition dataset. This dataset contains a total of 509 realistic images with complex background, captured in a wide variety of situations, with different cameras, at different resolutions and under different lighting conditions. The dataset is divided into two sections: a training set that

Text location

The flowchart of our text location algorithm is shown in Fig. 5. We find letter candidates with a segmentation method based on MSER and a locally adaptive thresholding method. The resulting candidates are filtered using certain constraints based on the prior features shown in the previous section. Character candidates are grouped into lines and each line is classified into text or non-text using a classifier based on HOG. Finally, words within a text line are separated, giving segmented word

Text recognition

The flowchart of the text recognition algorithm is shown in Fig. 12. Single characters are recognized using a classification approach based on K-Nearest Neighbors (KNN) and gradient direction features. Later, a unigram language model is applied in order to correct misspelled words.

Experimental results

We evaluate the proposed method by running it on several public datasets and comparing to the state of the art. The chosen datasets have been used as a benchmark for most of researchers working in the field of text location and recognition in the last decade. A Robust Reading Competition was organized at ICDAR 2003 [29], [27]. The competition was divided into three sub-problems: text location, character recognition and word recognition. Here, we will show our results for the three problems. The

Conclusions and future work

We have presented a method to localize and recognize text in natural images. The text location is a CC-based approach that extracts and discards basic letter candidates using a series of easy and fast-to-compute features. These features have been analyzed on a challenging train dataset which contains different types of text in a huge variety of situations and they have proved to follow a Gaussian distribution. It means that they can be used with any dataset independently from their size, color

Acknowledgments

This work has been financed with funds from the Ministerio de Economía y Competitividad through the project ADD-Gaze (TRA2011-29001-C04-01), as well as from the Comunidad de Madrid through the project Robocity2030 (CAM-S-0505/DPI000176).

References (39)

  • D. Karatzas et al.

    Colour text segmentation in web images based on human perception

    Image Vision Comput.

    (2007)
  • H. Bay et al.

    Speeded-up robust features (SURF)

    Comput. Vis. Image Underst.

    (2008)
  • S. Escalera et al.

    Text detection in urban scenes

  • J. Matas et al.

    Robust wide baseline stereo from maximally stable extremal regions

  • J.-L. Yao, Y.-Q. Wang, L.-B. Weng, Y.-P. Yang, Locating text based on connected component and svm, in: Wavelet Analysis...
  • L. Neumann et al.

    Real-time scene text localization and recognition

  • B. Epshtein et al.

    Detecting text in natural scenes with stroke width transform

  • H. Chen et al.

    Robust text detection in natural images with edge-enhanced maximally stable extremal regions

  • Y.-F. Pan et al.

    A hybrid approach to detect and localize texts in natural scene images

    IEEE Trans. Image Process.

    (2011)
  • N. Dalal et al.

    Histograms of oriented gradients for human detection

  • J. Sochman et al.

    WaldBoost — learning for time constrained sequential detection

  • W. Niblack

    An introduction to digital image processing

    (1986)
  • J.D. Lafferty et al.

    Conditional random fields: probabilistic models for segmenting and labeling sequence data

  • X. Chen et al.

    Detecting and reading text in natural scenes

  • S.M. Hanif et al.

    Text detection and localization in complex scene images using constrained adaboost algorithm

  • S. Belongie et al.

    Shape matching and object recognition using shape contexts

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2002)
  • A.C. Berg et al.

    Shape matching and object recognition using low distortion correspondences

  • D.G. Lowe

    Object recognition from local scale-invariant features

  • S. Lazebnik et al.

    A sparse texture representation using local affine regions

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2005)
  • Cited by (57)

    • Non-stationary content-adaptive projector resolution enhancement

      2021, Signal Processing: Image Communication
      Citation Excerpt :

      Sajadi et al. [51] proposed the concept of optical pixel sharing that allows selected regions of an image to be reproduced using smaller pixels, achieving superior perceived resolution, however the enhanced resolution may sacrifice frame rate and display brightness. Gonzalez and Bergasa [52] proposed a method for text detection and recognition in natural images using geometric and gradient-based features, however, this method aimed to recognize individual characters, narrower than our needs. Next, a wide variety of researchers have explored motion detection [60–63,66].

    • Scene text segmentation using low variation extremal regions and sorting based character grouping

      2017, Neurocomputing
      Citation Excerpt :

      In recent papers authors tend to propose an end to end system for scene text understanding encompassing localization, segmentation and recognition. Gonzales and Bergasa [10] proposed a method for text reading in natural scene images. Segmentation is done by combining locally adaptive thresholding and maximally stable extremal regions.

    • A novel technique for automatic retrieval of embedded text from books

      2016, Optik
      Citation Excerpt :

      Text in images contain information that can be used for structuring, indexing and annotation of the images. There are a number of constraints that makes text retrieval a complex task; for example, images contain text with different size, style, alignment, low contrast, noise and complex backgrounds [9]. A broad classification of text that are present in an image are as follows:

    View all citing articles on Scopus

    This paper has been recommended for acceptance by Enrique Dunn.

    View full text