Hybrid approach to efficient text extraction in complex color images
Introduction
In the area of content-based image indexing, there have been a lot of research results using semantic contents such as faces, human bodies, objects, events and their relations. Based on the fact that texts within an image are very useful for describing the contents of the image and can be easily extracted compared with other semantic contents, researchers have pursued text-based image indexing using various image processing techniques (Lienhart and Stuber, 1996; Kim, 1996; Li et al., 2000; Zhong et al., 2000; Jain and Yu, 1998; Kim et al., 2000).
In the text-based image indexing, text localization is important as a prerequisite stage for optical character recognition (OCR). Also it can be used in many other applications such as page segmentation, address block location, license plate location, etc. Therefore, many approaches to text localization have been proposed (Lienhart and Stuber, 1996; Kim, 1996; Li et al., 2000; Zhong et al., 1995, Zhong et al., 2000; Jain and Yu, 1998; Kim et al., 2000; Jeong et al., 1999; Yassin et al., 2000; Messelodi and Modena, 1999; Wu et al., 1999; Strouthopoulos and Papamarkos, 1998; Jung, 2001; Li and Doermann, 2000; Wernicle and Lienhart, 2000; Gargi et al., 1998). However, it has been still considered a very difficult problem because of text variations in size, style, and orientation as well as complex background of images.
There are two primary methods for text localization: connected component (CC)-based methods and texture-based methods. In general, the CC-based methods group small components into successively larger components and then analyze the geometrical arrangement of the components that belong to texts (Lienhart and Stuber, 1996; Jain and Yu, 1998). Lienhart and Stuber (1996) regarded text regions as CCs that are of the same or similar color and size, and then used motion information to enhance the text localization results in a video sequence. Jain and Yu (1998) segmented a video frame into sub-images of different colors and then checked whether they contain text components satisfying some predefined conditions.
The CC-based methods are very popular for text localization thanks to their simplicity in implementation. However, they are not appropriate for low-resolution and noisy video documents because they depend on the effectiveness of the segmentation method, which should guarantee that a character is segmented into a few connected components separated from other objects and the background.
Unlike the CC-based methods, the texture-based methods regard text regions as textured objects and apply gabor filters, wavelet, FFT, spatial variance, or some other texture analyzers to utilize the textural properties (Li et al., 2000; Zhong et al., 1995, Zhong et al., 2000; Jeong et al., 1999; Wu et al., 1999; Strouthopoulos and Papamarkos, 1998; Jung, 2001; Li and Doermann, 2000; Jain and Karu, 1996). Li et al. (2000) used an artificial neural network classifier after feature extraction based on wavelet decomposition. Zhong et al. (2000) extracted text regions by analyzing texture properties directly on the DCT compressed domain. Zhong et al. (1995) computed local spatial variations in a gray-scale image and located text regions with high variance. Jain and Karu (1996) proposed an example-based learning technique for automatic generation of texture classifier. The technique was used to separate texts, graphics, and halftone image regions in document images. Although very effective in text localization, the texture-based methods have some shortcomings: (1) difficulties in manually designing a texture classifier for various text conditions, (2) locality of the texture information, and (3) expensive computation in the texture classification stage.
Section snippets
Outline of the proposed algorithm
As we discussed in the previous section, both of the CC-based methods and the texture-based ones have their own strength and weakness respectively. We propose to sequentially merge the two primary methods. We can increase the recall rate with multi-layer perceptrons (MLPs), which automatically generate a texture classifier that discriminates between text regions and non-text regions on three color bands. We can also increase the precision rate with CC analysis. Fig. 1 gives an overall structure
Texture-based text localization
We use MLPs to make a texture classifier that discriminates between text pixels and non-text ones. An input image is scanned by the MLPs, which receive the color values of a given pixel and its neighbors within a small window for each Red, Green, and Blue color band. The MLPs' outputs are combined into a text probability image (TPI), where each pixel's value is in the range [0,1] and represents the probability that the corresponding input pixel is a part of text. If a pixel has a larger value
Connected component-based text localization
Although we use the bootstrap method to make the texture classification MLPs learn the precise boundary between a text class and non-text one, the detection result from the MLPs includes many false alarms because we want the MLPs to detect texts as many as possible. As shown in Fig. 4, texture-based text detection algorithms tend to give false alarms for high-contrast or high-frequency regions and the regions which have similar textural properties with characters. In Fig. 4(a), detected text
Region marking
After texture and CC analyses, we perform two different region marking processes depending on the image types. Although lots of works have been done for text detection using texture and CC analysis, post-procesing such as region marking and text extraction has been rarely done. For gray or color document images, we perform X–Y recursive cut based on the assumption that skew correction is done in advance. For video images which usually have lower text presence rates than document images, we use
Database
The proposed text localization method has been tested with several types of images: captured broadcast news, scanned images, web images, and video clips downloaded from the web site of Movie Content Analysis (MoCA) Project (Lienhart and Stuber, 1996). Table 1 summarizes the image database. Fifty video frames randomly selected from the database are used in the initial MLP training and bootstrap processes.
Quantitative evaluation of text localization is an open issue. In the previous works such
Conclusions
This paper presents an efficient text localization technique based on the integration of texture-based methods and CC-based ones. Detection of texts in various conditions can be automatically performed using MLPs without any explicit feature extraction stage. However, the main drawback of such texture-based methods lies in their locality property, i.e. the outside of the specified window is not considered at all. Such observation has motivated us to use a hybrid approach of the texture-based
Acknowledgements
This work was supported by the Soongsil University Research Fund.
References (22)
- et al.
Automatic text location in images and video frames
Pattern Recognition
(1998) Neural network-based text location in color images
Pattern Recognition Lett.
(2001)Efficient automatic text location method and content-based indexing and structuring of video database
J. Visual Comm. Image Representat.
(1996)- et al.
Automatic identification and skew estimation of text lines in real scene images
Pattern Recognition
(1999) - et al.
Text identification for document image analysis using a neural network
Image Vision Comput.
(1998) - et al.
Locating text in complex color images
Pattern Recognition
(1995) - Antani, S., Gargi, U., Crandall, D., Gandhi, T., Kasturi, R., 1999. Extraction of text in video, Technical Report,...
- Bradski, G.R., Pisarevsky, V., 2000. Intel's computer vision library: Application in calibration, stereo, segmentation,...
Computer vision face tracking for use in a perceptual user interface
Intel Technol. J
(1998)Mean shift, mode seeking, and clustering
IEEE Trans. on Pattern Anal. Machine Intell.
(1995)
Cited by (32)
GAS meter reading from real world images using a multi-net system
2013, Pattern Recognition LettersCitation Excerpt :Because of these problems and very high visual variability of the different meters models and brands, we could not rely on knowledge-based information derived from specific meter models to locate the counter. Recently, neural networks have been successfully applied in many similar problems (Jung and Han, 2004, Jung, 2001). In this paper we used a set of supervised neural models to address and solve the many problems described above and hosted in the images of our domain.
Mage: An efficient deployment of python flask web application to app engine flexible using google cloud platform
2021, Lecture Notes in Networks and SystemsCaption text extraction from color image based on differential operation and morphological processing
2018, Advances in Intelligent Systems and ComputingTarget tracking based on the improved Camshift method
2016, Proceedings of the 28th Chinese Control and Decision Conference, CCDC 2016Template matching tracking method based on target decomposition
2016, Proceedings of the 28th Chinese Control and Decision Conference, CCDC 2016A method for automatically translating print books into electronic Braille books
2016, Science China Information Sciences