Elsevier

Pattern Recognition Letters

Volume 25, Issue 6, 19 April 2004, Pages 679-699
Pattern Recognition Letters

Hybrid approach to efficient text extraction in complex color images

https://doi.org/10.1016/j.patrec.2004.01.017Get rights and content

Abstract

Texture-based methods and connected component (CC) methods have been widely used for text localization. However, these two primary methods have their own strength and weakness. This paper proposes a hybrid approach of the two methods for text localization in complex images. An automatically constructed MLP-based texture classifier can increase the recall rates for complex images with much less user intervention and no explicit feature extraction. The CC-based filtering based on the geometry and shape information enhances the precision rates without affecting overall performance. Then, the time-consuming texture analysis for less relevant pixels is avoided by using CAMShift. Our experimentation shows that the proposed hybrid approach leads to not only robust but also efficient text localization.

Introduction

In the area of content-based image indexing, there have been a lot of research results using semantic contents such as faces, human bodies, objects, events and their relations. Based on the fact that texts within an image are very useful for describing the contents of the image and can be easily extracted compared with other semantic contents, researchers have pursued text-based image indexing using various image processing techniques (Lienhart and Stuber, 1996; Kim, 1996; Li et al., 2000; Zhong et al., 2000; Jain and Yu, 1998; Kim et al., 2000).

In the text-based image indexing, text localization is important as a prerequisite stage for optical character recognition (OCR). Also it can be used in many other applications such as page segmentation, address block location, license plate location, etc. Therefore, many approaches to text localization have been proposed (Lienhart and Stuber, 1996; Kim, 1996; Li et al., 2000; Zhong et al., 1995, Zhong et al., 2000; Jain and Yu, 1998; Kim et al., 2000; Jeong et al., 1999; Yassin et al., 2000; Messelodi and Modena, 1999; Wu et al., 1999; Strouthopoulos and Papamarkos, 1998; Jung, 2001; Li and Doermann, 2000; Wernicle and Lienhart, 2000; Gargi et al., 1998). However, it has been still considered a very difficult problem because of text variations in size, style, and orientation as well as complex background of images.

There are two primary methods for text localization: connected component (CC)-based methods and texture-based methods. In general, the CC-based methods group small components into successively larger components and then analyze the geometrical arrangement of the components that belong to texts (Lienhart and Stuber, 1996; Jain and Yu, 1998). Lienhart and Stuber (1996) regarded text regions as CCs that are of the same or similar color and size, and then used motion information to enhance the text localization results in a video sequence. Jain and Yu (1998) segmented a video frame into sub-images of different colors and then checked whether they contain text components satisfying some predefined conditions.

The CC-based methods are very popular for text localization thanks to their simplicity in implementation. However, they are not appropriate for low-resolution and noisy video documents because they depend on the effectiveness of the segmentation method, which should guarantee that a character is segmented into a few connected components separated from other objects and the background.

Unlike the CC-based methods, the texture-based methods regard text regions as textured objects and apply gabor filters, wavelet, FFT, spatial variance, or some other texture analyzers to utilize the textural properties (Li et al., 2000; Zhong et al., 1995, Zhong et al., 2000; Jeong et al., 1999; Wu et al., 1999; Strouthopoulos and Papamarkos, 1998; Jung, 2001; Li and Doermann, 2000; Jain and Karu, 1996). Li et al. (2000) used an artificial neural network classifier after feature extraction based on wavelet decomposition. Zhong et al. (2000) extracted text regions by analyzing texture properties directly on the DCT compressed domain. Zhong et al. (1995) computed local spatial variations in a gray-scale image and located text regions with high variance. Jain and Karu (1996) proposed an example-based learning technique for automatic generation of texture classifier. The technique was used to separate texts, graphics, and halftone image regions in document images. Although very effective in text localization, the texture-based methods have some shortcomings: (1) difficulties in manually designing a texture classifier for various text conditions, (2) locality of the texture information, and (3) expensive computation in the texture classification stage.

Section snippets

Outline of the proposed algorithm

As we discussed in the previous section, both of the CC-based methods and the texture-based ones have their own strength and weakness respectively. We propose to sequentially merge the two primary methods. We can increase the recall rate with multi-layer perceptrons (MLPs), which automatically generate a texture classifier that discriminates between text regions and non-text regions on three color bands. We can also increase the precision rate with CC analysis. Fig. 1 gives an overall structure

Texture-based text localization

We use MLPs to make a texture classifier that discriminates between text pixels and non-text ones. An input image is scanned by the MLPs, which receive the color values of a given pixel and its neighbors within a small window for each Red, Green, and Blue color band. The MLPs' outputs are combined into a text probability image (TPI), where each pixel's value is in the range [0,1] and represents the probability that the corresponding input pixel is a part of text. If a pixel has a larger value

Connected component-based text localization

Although we use the bootstrap method to make the texture classification MLPs learn the precise boundary between a text class and non-text one, the detection result from the MLPs includes many false alarms because we want the MLPs to detect texts as many as possible. As shown in Fig. 4, texture-based text detection algorithms tend to give false alarms for high-contrast or high-frequency regions and the regions which have similar textural properties with characters. In Fig. 4(a), detected text

Region marking

After texture and CC analyses, we perform two different region marking processes depending on the image types. Although lots of works have been done for text detection using texture and CC analysis, post-procesing such as region marking and text extraction has been rarely done. For gray or color document images, we perform XY recursive cut based on the assumption that skew correction is done in advance. For video images which usually have lower text presence rates than document images, we use

Database

The proposed text localization method has been tested with several types of images: captured broadcast news, scanned images, web images, and video clips downloaded from the web site of Movie Content Analysis (MoCA) Project (Lienhart and Stuber, 1996). Table 1 summarizes the image database. Fifty video frames randomly selected from the database are used in the initial MLP training and bootstrap processes.

Quantitative evaluation of text localization is an open issue. In the previous works such

Conclusions

This paper presents an efficient text localization technique based on the integration of texture-based methods and CC-based ones. Detection of texts in various conditions can be automatically performed using MLPs without any explicit feature extraction stage. However, the main drawback of such texture-based methods lies in their locality property, i.e. the outside of the specified window is not considered at all. Such observation has motivated us to use a hybrid approach of the texture-based

Acknowledgements

This work was supported by the Soongsil University Research Fund.

References (22)

  • Gargi, U., Antani, S., Kasturi, R., 1998. Indexing text events in digital video database. In: Int. Conf. on Pattern...
  • Cited by (32)

    View all citing articles on Scopus
    View full text