A text reading algorithm for natural images

doi:10.1016/j.imavis.2013.01.003

Image and Vision Computing

Volume 31, Issue 3, March 2013, Pages 255-274

https://doi.org/10.1016/j.imavis.2013.01.003 Get rights and content

Abstract

Reading text in natural images has focused again the attention of many researchers during the last few years due to the increasing availability of cheap image-capturing devices in low-cost products like mobile phones. Therefore, as text can be found on any environment, the applicability of text-reading systems is really extensive. For this purpose, we present in this paper a robust method to read text in natural images. It is composed of two main separated stages. Firstly, text is located in the image using a set of simple and fast-to-compute features highly discriminative between character and non-character objects. They are based on geometric and gradient properties. The second part of the system carries out the recognition of the previously detected text. It uses gradient features to recognize single characters and Dynamic Programming (DP) to correct misspelled words. Experimental results obtained with different challenging datasets show that the proposed system exceeds state-of-the-art performance, both in terms of localization and recognition.

Graphical abstract

Highlights

► A text detection and recognition method for natural images is proposed. ► Text location is based on simple and fast-to-compute features. ► Text recognition is based on simple gradient features. ► Misspelled words are corrected using Dynamic Programming. ► State-of-the-art performance is improved.

Introduction

Automatic text recognition has traditionally focused on analyzing scanned documents. However, during the last years digital cameras have started to be embedded in low-cost consumer products such as mobile phones and Tablet PCs, so user applications related to digital image processing have become very popular. Nevertheless, automatic text recognition in natural images still remains one of the most challenging problems in computer vision. In addition, as textual information can be found on any environment, both indoors and outdoors, the range of applications of automatic text reading systems can be wide, from support to visually impaired people to automatic geocoding of businesses, including support to robotic navigation in indoor and outdoor environments, image spam filtering, driver assistance or translation services for tourists, among others.

Up to now, most works have focused on concrete subsets of the problem, such as extracting text in CD cover images [1] or segmenting text in web images [2]. This is due to the wide variety of text appearance because of different fonts, thicknesses, colors, sizes, textures, as well as the presence of geometrical distortions and partial occlusions in the images, different lighting conditions and image resolutions, different languages, etc. In this paper, we propose a system to read text in any kind of scenario, both indoors and outdoors, in natural images. We simply constrain to machine-printed text and English language. To benchmark the performance of the proposed system, results have been obtained with several datasets that include images in different scenarios and situations. These datasets were released for the robust reading competitions held in the frame of the ICDAR (International Conference on Document Analysis and Recognition) 2003, 2005 and 2011 conferences. Most of researchers in this field use these datasets as a benchmark.

There are a number of contributions in this paper. First, a segmentation method based on a combination of Maximally Stable Extremal Regions (MSER) [3] and a locally adaptive thresholding method has been proposed. Second, a thorough study on different features to describe text has been carried out and the main results are shown here. A set of fast-to-compute features to discriminate between character and non-character has been proposed. It has also been proposed to use a restoration stage based on position and size features in order to bring back characters erroneously rejected. The results section will show the importance of this stage. On the other hand, a new feature is proposed for recognizing single characters. Finally, misspelled words are corrected using DP with substitution costs based on the confusion matrix of the character recognizer.

The remainder of the paper is organized as follows. In Section 2 we overview the main state-of-the-art methods. Section 3 presents a short general description of the proposed method. Section 4 describes the study that has been carried out in order to obtain a set of features that allow us to distinguish characters from non-characters. Section 5 describes the text location algorithm, while Section 6 explains the recognition approach. Experimental results are widely described in Section 7. Section 8 ends the paper highlighting the main conclusions and future works.

Section snippets

Related work

Automatic text location and recognition has been one of the main challenges in computer vision ever. Most of the works in this field are based on locating the text areas, so it is difficult to make an overview of all the implemented methods, since there has been thorough research on text location. In this section we focus on similar works from the last decade.

Yao et al. [4] use locally adaptive thresholding to segment an image. Then, certain geometric features are extracted from connected

General overview of the system

Fig. 1 shows the flowchart of the proposed framework, which is made up of two main blocks. The text location block aims at locating the text in the image precisely and to discard those parts of the image that do not contain text. It is thoroughly explained in Section 5. On the other hand, the recognition block treats to recognize the text detected in the previous stage. It is detailed in Section 6.

Text features analysis

In order to obtain a set of distinctive features capable of distinguishing character objects from non-character objects, we have made an analysis of certain text features under the ICDAR 2003 Reading Competition dataset. This dataset contains a total of 509 realistic images with complex background, captured in a wide variety of situations, with different cameras, at different resolutions and under different lighting conditions. The dataset is divided into two sections: a training set that

Text location

The flowchart of our text location algorithm is shown in Fig. 5. We find letter candidates with a segmentation method based on MSER and a locally adaptive thresholding method. The resulting candidates are filtered using certain constraints based on the prior features shown in the previous section. Character candidates are grouped into lines and each line is classified into text or non-text using a classifier based on HOG. Finally, words within a text line are separated, giving segmented word

Text recognition

The flowchart of the text recognition algorithm is shown in Fig. 12. Single characters are recognized using a classification approach based on K-Nearest Neighbors (KNN) and gradient direction features. Later, a unigram language model is applied in order to correct misspelled words.

Experimental results

We evaluate the proposed method by running it on several public datasets and comparing to the state of the art. The chosen datasets have been used as a benchmark for most of researchers working in the field of text location and recognition in the last decade. A Robust Reading Competition was organized at ICDAR 2003 [29], [27]. The competition was divided into three sub-problems: text location, character recognition and word recognition. Here, we will show our results for the three problems. The

Conclusions and future work

We have presented a method to localize and recognize text in natural images. The text location is a CC-based approach that extracts and discards basic letter candidates using a series of easy and fast-to-compute features. These features have been analyzed on a challenging train dataset which contains different types of text in a huge variety of situations and they have proved to follow a Gaussian distribution. It means that they can be used with any dataset independently from their size, color

Acknowledgments

This work has been financed with funds from the Ministerio de Economía y Competitividad through the project ADD-Gaze (TRA2011-29001-C04-01), as well as from the Comunidad de Madrid through the project Robocity2030 (CAM-S-0505/DPI000176).

References (39)

D. Karatzas et al.
Colour text segmentation in web images based on human perception
Image Vision Comput.
(2007)
H. Bay et al.
Speeded-up robust features (SURF)
Comput. Vis. Image Underst.
(2008)
S. Escalera et al.
Text detection in urban scenes
J. Matas et al.
Robust wide baseline stereo from maximally stable extremal regions
J.-L. Yao, Y.-Q. Wang, L.-B. Weng, Y.-P. Yang, Locating text based on connected component and svm, in: Wavelet Analysis...
L. Neumann et al.
Real-time scene text localization and recognition
B. Epshtein et al.
Detecting text in natural scenes with stroke width transform
H. Chen et al.
Robust text detection in natural images with edge-enhanced maximally stable extremal regions
Y.-F. Pan et al.
A hybrid approach to detect and localize texts in natural scene images
IEEE Trans. Image Process.
(2011)
N. Dalal et al.
Histograms of oriented gradients for human detection

J. Sochman et al.

WaldBoost — learning for time constrained sequential detection

W. Niblack

An introduction to digital image processing

(1986)

J.D. Lafferty et al.

Conditional random fields: probabilistic models for segmenting and labeling sequence data

X. Chen et al.

Detecting and reading text in natural scenes

S.M. Hanif et al.

Text detection and localization in complex scene images using constrained adaboost algorithm

S. Belongie et al.

Shape matching and object recognition using shape contexts

IEEE Trans. Pattern Anal. Mach. Intell.

(2002)

A.C. Berg et al.

Shape matching and object recognition using low distortion correspondences

D.G. Lowe

Object recognition from local scale-invariant features

S. Lazebnik et al.

A sparse texture representation using local affine regions

IEEE Trans. Pattern Anal. Mach. Intell.

(2005)

Cited by (57)

Non-stationary content-adaptive projector resolution enhancement
2021, Signal Processing: Image Communication
Citation Excerpt :
Sajadi et al. [51] proposed the concept of optical pixel sharing that allows selected regions of an image to be reproduced using smaller pixels, achieving superior perceived resolution, however the enhanced resolution may sacrifice frame rate and display brightness. Gonzalez and Bergasa [52] proposed a method for text detection and recognition in natural images using geometric and gradient-based features, however, this method aimed to recognize individual characters, narrower than our needs. Next, a wide variety of researchers have explored motion detection [60–63,66].
Digital imagery and video can have content at higher resolutions than can be projected by most data projectors, which has led to a variety of techniques to improve the high-resolution perception from lower-resolution displays. However, the downsampling procedures frequently used to fit an original image or video of high-resolution into a lower-resolution projector cause a frustrating loss of fine structures in the projected imagery. Since the human visual system is more sensitive to certain image phenomena, such as text and edges, an optimal approach to preserving fine structures should further sharpen such displayed content. On the other hand, the human visual system is also very sensitive to aliasing effects in motion, such that over-sharpening can lead to significant motion artifacts.
In this paper, a new non-stationary content-adaptive resolution enhancement scheme is proposed. Our main objective in this study is to reduce the severity of artifacts due to the image enhancement processes. To achieve this goal, distribution-based text detection and hypothesis-testing-based motion detection methods are developed. Three spatial kernels, each constructed using a new band-limited Wiener deconvolution filter, are used to enhance a given image with different sharpening strengths, where the differently enhanced images are combined using a weighted non-stationary filter. For evaluation, a new visual projection assessment (VPA) dataset along with new metrics for quantifying motion artifacts are introduced. Experimental results show that the proposed non-stationary content-adaptive resolution enhancement scheme offers improved visual quality over the state-of-the-art while offering a reasonable balance between high text sharpness and reduced motion artifacts.
Deep learning for detection of text polarity in natural scene images
2021, Neurocomputing
Text extraction and recognition from natural scene images is a challenging task due to their complex background. It has several computer vision applications like license plate recognition, content based image retrieval, digitization for visually impaired etc. In these images, dark text can be present on a bright background or vice versa and there is an imperative need to determine this polarity for the recognition process. In the present work, we have proposed to use deep learning approaches to determine text polarity. We have used Convolutional Neural Network (CNN) to classify whether a scene image contains dark text on a bright background or vice versa. CNN has been trained on image samples collected from benchmarking datasets like ICDAR, IIIT5K etc. We have also extracted CNN features by removing its final fully connected layers and trained support vector machine (SVM) classifier using these features. Our experiments have shown that this transfer learning approach has given better accuracy than original CNN and the corresponding results are reported.
CRF based text detection for natural scene images using convolutional neural network and context information
2018, Neurocomputing
This paper presents a novel scene text detection method based on conditional random field (CRF) framework. We estimate the confidence of Maximally Stable Extremal Region (MSER) being text by leveraging convolutional neural network (CNN) to define the unary cost item. In addition, we establish the neighboring interactions for MSERs using four different features including color, shape, stroke and spatial features to define the pairwise cost item. Considering the special layout of texts appearing in natural scene images, we employ context information to recover missing text MSER candidates. Furthermore, text MSERs are grouped into candidate text lines which are verified with shape-specific classifiers by integrating gray and binary features. Experimental results on four public benchmark datasets show that the proposed method achieves the comparable performance.
Scene text segmentation using low variation extremal regions and sorting based character grouping
2017, Neurocomputing
Citation Excerpt :
In recent papers authors tend to propose an end to end system for scene text understanding encompassing localization, segmentation and recognition. Gonzales and Bergasa [10] proposed a method for text reading in natural scene images. Segmentation is done by combining locally adaptive thresholding and maximally stable extremal regions.
Extraction of textual information from natural scene images is a challenging task due to imaging conditions and diversity of text properties. Segmentation of scene text is important step in the pipeline that significantly affects the final recognition performance. In this paper I propose a new scene text segmentation method. Firstly, a novel approach for character candidates generation based on extremal regions (ERs) is introduced. Subpaths having low area variation are extracted from ER tree. Instead of using minimum variation criterion for selection of character candidates, position of ER in extracted subpath is used as criterion for that purpose. Each subpath is represented by one ER that is sent to SVM-based classification step. After that a novel method for character candidates grouping is used to discard non-character objects that are wrongly classified as characters. Proposed approach estimates vertical positions of the lines by sorting y coordinates of region centroids and checks spatial relation of adjacent regions in the line. This step enhances precision significantly and has lower computational complexity compared to hierarchical clustering methods. Finally, the last step is restoration of character ERs erroneously eliminated by SVM classifier where text layout properties are exploited to correct false negative classifications. Experimental results obtained on the ICDAR 2013 dataset show that the proposed character candidates generation method efficiently prunes repeating regions and achieves character recall rate superior to recently published ER based method. Proposed segmentation algorithm obtains competitive performance compared to state-of-the-art methods.
Efficient character segmentation approach for machine-typed documents
2017, Expert Systems with Applications
In this paper an efficient approach for segmentation of the individual characters from scanned documents typed on old typewriters is proposed. The approach proposed in this paper is primarily intended for processing of machine-typed documents, but can be used for machine-printed documents as well. The proposed character segmentation approach uses the modified projection profiles technique which is based on using the sliding window for obtaining the information about the document image structure. This is followed by histogram processing in order to determine the spaces between lines, words and characters in the document image. The decision-making logic used in the process of character segmentation is describes and represents the most an integral aspect of the proposed technique. Beside the character segmentation approach, the ultra-fast architecture for geometrical image transformations, which is used for image rotation in the process of skew correction, is presented, and its fast implementation using pointer arithmetic and a highly optimized low-level machine routine is provided. The proposed character segmentation approach is semi-automatic and uses threshold values to control the segmentation process. Provided results for segmentation accuracy show that the proposed approach outperforms the state-of-the-art approaches in most cases. Also, the results from the aspect of the time complexity show that the new technique performs faster than state-of-the-art approaches and can process even very large document images in less than one second, which makes this approach suitable for real-time tasks. Finally, visual demonstration of the proposed approach performances is achieved using original documents authored by Nikola Tesla.
A novel technique for automatic retrieval of embedded text from books
2016, Optik
Citation Excerpt :
Text in images contain information that can be used for structuring, indexing and annotation of the images. There are a number of constraints that makes text retrieval a complex task; for example, images contain text with different size, style, alignment, low contrast, noise and complex backgrounds [9]. A broad classification of text that are present in an image are as follows:
The current paper propounds a method for extracting text from images of book covers and embedded text. The automation of this process greatly reduces the human interference while converting books (specifically their covers where this task becomes extremely difficult) to readable and editable electronic format specifically for electronic book readers. To achieve this purpose we propose a technique which works on scanned images of documents. The image is first clustered to reduce the number of color variances, a suitable plane is identified and then text region is segmented using connected component based method. The text thus obtained is then enhanced to ameliorate the results.

View all citing articles on Scopus

^☆: This paper has been recommended for acceptance by Enrique Dunn.

View full text

A text reading algorithm for natural images☆

Abstract

Graphical abstract

Highlights

Introduction

Section snippets

Related work

General overview of the system

Text features analysis

Text location

Text recognition

Experimental results

Conclusions and future work

Acknowledgments

Image Vision Comput.

Comput. Vis. Image Underst.

Text detection in urban scenes

Robust wide baseline stereo from maximally stable extremal regions

Real-time scene text localization and recognition

Detecting text in natural scenes with stroke width transform

Robust text detection in natural images with edge-enhanced maximally stable extremal regions

A hybrid approach to detect and localize texts in natural scene images

IEEE Trans. Image Process.

Histograms of oriented gradients for human detection

WaldBoost — learning for time constrained sequential detection

An introduction to digital image processing

Conditional random fields: probabilistic models for segmenting and labeling sequence data

Detecting and reading text in natural scenes

Text detection and localization in complex scene images using constrained adaboost algorithm

Shape matching and object recognition using shape contexts

IEEE Trans. Pattern Anal. Mach. Intell.

Shape matching and object recognition using low distortion correspondences

Object recognition from local scale-invariant features

A sparse texture representation using local affine regions

IEEE Trans. Pattern Anal. Mach. Intell.