A knowledge-based system for extracting text-lines from mixed and overlapping text/graphics compound document images

https://doi.org/10.1016/j.eswa.2011.07.040Get rights and content

Abstract

This paper presents a new knowledge-based system for extracting and identifying text-lines from various real-life mixed text/graphics compound document images. The proposed system first decomposes the document image into distinct object planes to separate homogeneous objects, including textual regions of interest, non-text objects such as graphics and pictures, and background textures. A knowledge-based text extraction and identification method obtains the text-lines with different characteristics in each plane. The proposed system offers high flexibility and expandability by merely updating new rules to cope with various types of real-life complex document images. Experimental and comparative results prove the effectiveness of the proposed knowledge-based system and its advantages in extracting text-lines with a large variety of illumination levels, sizes, and font styles from various types of mixed and overlapping text/graphics complex compound document images.

Highlights

► We propose a knowledge-based system for extracting and identifying text-lines from compound document images. ► The document image is decomposed into distinct object planes to separate homogeneous objects. ► A knowledge-based text extraction and identification method obtains the text-lines with different characteristics in each plane. ► The proposed system offers high flexibility and expandability by updating new rules to cope with various complex document images.

Introduction

Despite the recent adoption of electronic documents and books, paper-based published documents and books continue to spread. Because paper-based publications are less convenient than electronic publications in the aspects of archiving, modification, and retrieval, the practical applications of document image analysis for paper-based documents and books has recently attracted a lot of attention. Examples of these applications include text information extraction and analysis, optical character recognition, document retrieval, compression, and archiving (Doermann, 1998, O’ Gorman and Kasturi, 1995). However, textual information extraction is the most essential task in document image analysis. Therefore, researchers have presented several studies on textual information extraction and analysis from monochromatic document images (Fisher et al., 1990, Fletcher and Kasturi, 1988, Lee et al., 2000, Niyogi and Srihari, 1996, Shih ET AL., 1992). Most of these methods mostly rely on prior publication-specified knowledge of printed text-lines on monochromatic document images with regular typesetting and layouts. Recent advances in multimedia publishing and mixed text/graphics printing technology have enabled an increasing number of real-life paper publications that print various stylistic text-lines with graphical, pictorial, and non-text decorated objects, and often include colorful, textured backgrounds. However, conventional text extraction methods do not perform well when extracting text-lines from real-life mixed and overlapping text/graphics compound document images. Extracting text-lines from mixed text/graphics compound document images is much more complicated than extracting them from monochromatic document images. This is because text-lines in document images are often printed with various colors or levels of illumination, and superimposed with graphics, pictures, or other textured backgrounds. Therefore, a system that can efficiently locate and extract text-lines printed in the pictorial and textured regions of complex compound documents remains an open research topic in document image analysis.

Researchers have developed various methods of extracting text regions from mixed text/graphic compound document images. Some of these methods are based on the fact that most textual regions show distinctive texture features that are unlike other non-text background regions (Hasan and Karam, 2000, Jain and Bhattacharjee, 1992, Wu et al., 1999, Yuan and Tan, 2001). Such methods adopt texture detection filters to extract the texture features of possible text regions, and use these features to extract text from document images. Jain and Bhattacharjee’s (1992) method extracts the texture features of text regions by applying Gabor filters, and segments the text regions of interest based on these texture features. However, the limitation of this method is that its text-line extraction performance may sensitively influenced by their various font sizes and styles. Wu et al.’s (1999) Textfinder system uses nine second-order Gaussian derivative filters to obtain texture feature vectors of each pixel at three different scales, and applies the K-means clustering process to these texture feature vectors to classify the corresponding pixels into text regions. Hasan and Karam (2000) introduced a morphological texture extraction scheme that recursively applies morphological dilation and erosion operations to the extracted closure edge textures to locate text regions. Texture information is useful for detecting the existence of textual objects in a specific region. Texture-based extraction methods are capable of identifying most textual regions in mixed-text/graphical document images. However, most of these methods fail to provide consistent accuracy in locating text-lines, which in turn reduces the performance of subsequent document analysis processes. Moreover, texture feature extraction is very time consuming for practical document image processing applications. When textual objects border or overlap graphical objects, non-text texture patterns, or backgrounds with similar texture features, the texture features of these non-text objects may be identified as textual objects. Such non-text objects smear the text-lines in extracted regions.

Researchers have recently proposed several color-segmentation-based methods for text extraction from color document images. Jain and Yu (1998) used bit-dropping quantization and the single-link color-clustering algorithm to decompose a color document into a set of foreground images in the RGB color space. Strouthopoulos et al.’s adaptive color reduction technique (2002) utilizes an unsupervised neural network classifier and a tree-search procedure to determine prototype colors. Some alternative color spaces can determine prototype colors for finding textual objects of interest. Yang and Ozawa (1999) used the HSI color space to segment homogenous color regions to extract bibliographic information from book covers, while Sobottka, Kronenberg, Perroud, and Bunke (2000) presented a hybrid method combining top-down and bottom-up analysis techniques to extract text-lines from color journal and book covers. Hase, Shinokawa, Yoneda, and Suen (2001) applied a histogram-based approach to select prototype colors on the CIELab color space, and adopted a multi-stage relaxation approach to label and classify extracted homogeneous connected-components to obtain character strings. However, most of these methods have difficulty extracting text-lines embedded in complex backgrounds or touching other pictorial and graphical objects. This is because the prototype colors are determined from a global view, making it difficult to select appropriate prototype colors to differentiate textual objects from nearby pictorial objects and complex backgrounds without sufficient contrast.

Moreover, few of the methods above can provide can cope with the various types of real-life complex compound document images. The advantages of extensibility and flexibility that a knowledge-based system can provide are suitable for a large variety of practical applications. As a result, many researchers have recently applied knowledge-based systems to image processing (Avci and Avci, 2009, Cho et al., 2009, Cucchiara et al., 2000, Kang and Bae, 1997, Lee et al., 2000, Levine and Nazif, 1985, Niyogi and Srihari, 1996, Subasic et al., 2009). Levine and Nazif (1985) proposed an efficient three-level knowledge-based model for low-level scene image segmentation. For the thresholding applications on object segmentation in images, Kang and Bae (1997) developed an adaptive image thresholding method that integrates the fuzzy inference method with the logical level technique to extract character objects with linearity features. Avci and Avci (2009) proposed an expert system for analyzing fuzzy entropy. Their system selects an optimal threshold for segmenting foreground objects. Subasic et al. (2009) presented an expert system-based face segmentation system that integrates a low-level image segmentation module and a multi-stage rule-based labeling system. Cucchiara et al. (2000) and Cho et al. (2009) successfully applied knowledge-based systems to real-time video-based traffic monitoring applications. In previous studies on document image analysis, Fisher et al., 1990, Niyogi and Srihari, 1996, and Lee et al. (2000) applied the concepts of knowledge-based systems to the structural and geometric analysis of typical document images such as newspaper images and journal images. However, their methods are only applicable to publication-specified monochromatic documents with regular and ordered layouts, and cannot easily process various types of mixed and overlapping text/graphics compound document images.

Levine and Nazif, 1985, Niyogi and Srihari, 1996, and Lee et al. (2000) applied a three-level rule-based model, consisting of knowledge, control, and strategy rules, to low-level scene image segmentation and structural analysis of newspaper images and journal images. This rule-based reasoning model provides feasible modeling for applications on the image analysis domain, and offers high flexibility for further improving and extending the system by updating the knowledge rules in the inference mechanism. The knowledge-based system proposed in this study adopts this efficient concept of three-level rule-based reasoning model for text-line extraction and identification in real-life complex compound document images. Knowledge rules encode the geometric and statistical features of text-lines, such as colors, illumination levels, sizes, and font styles, and form two rule sets: text region extraction rules and text-line identification rules.

This study proposes a novel knowledge-based system for extracting text-lines from various types of mixed and overlapping text/graphics compound document images that contain text-lines with different illumination levels, sizes, and font styles. The text-lines can be superimposed on various background objects with uneven, gradational, and sharp variations in contrast, illumination, and texture, such as figures, photographs, pictures, or other background textures. This system first applies the multi-plane segmentation technique to decompose the document image into distinct object planes to extract and separate homogeneous objects including textual regions of interest, non-text objects such as graphics and pictures, and background textures (Chen & Wu, 2009). This multi-plane segmentation technique processes document images regionally and adaptively based on their local features, the proposed method can easily handle text-lines that border or overlap with pictorial objects and backgrounds with uneven, gradational, and sharp variations in contrast, illumination, and texture. The system applies a knowledge-based text extraction and identification procedure to the resulting planes to detect, extract, and identify text-lines with various characteristics in each plane. This method consists of two processing phases: the text region extraction and text-line identification phases. Knowledge rules encode the geometric and statistical features of text-lines, such as different illumination levels, sizes, and font styles, and establish two rule sets of text region extraction and text-line identification based on the phase in which they are performed. To perform the processes of text-line extraction and identification from real-life complex compound document images, the inference engine of the proposed system is based on hierarchically structured control and strategy rules. The proposed system enables high flexibility and expandability by merely updating new rules for coping with new and various types of real-life complex document images. Experimental results demonstrate that the proposed knowledge-based approach can provide accurately extraction of text-lines with different illumination levels, sizes, and font styles from various complex compound document images.

Section snippets

Multi-plane segmentation approach

Complex compound document images often contain text-lines with different illumination levels, sizes, and font styles, and are printed on varying or inhomogeneous background objects with uneven, gradational, and sharp variations in contrast, illumination, and texture. Examples of these backgrounds include illustrations, photographs, pictures, or other background patterns. A critical problem in text extraction is that no global segmentation techniques work well for such kinds of document images.

Knowledge-based textline extraction and identification

Having performed the multi-plane segmentation process, the entire image is decomposed into various object planes. Each object plane may consist of various considerable objects, such as textual regions, graphical and pictorial objects, background textures, or other objects. Here, each individual object plane Pq is binarized by setting its object pixels to black, and setting other non-object pixels to whit. This creates a “binarized plane,” denoted as BPq, in each plane Pq. Performing a text-line

Experimental results

This section evaluates the performance of the proposed knowledge-based text-line extraction technique and compares it to Jain and Yu’s color-based method (Jain & Yu, 1998). The document image database used in this study consists of 50 real-life complex mixed and overlapping text/graphics compound document images. These images consist of text-lines printed in various colors or illuminations, font styles, and sizes, including sparse and dense textual regions, adjoining or overlapping pictorial,

Conclusions

This study presents a new knowledge-based system for extracting text-lines from various types of mixed and overlapping text/graphics complex compound document images. Text-lines in such complex compound document images may appear in different colors, illumination levels, sizes, or font styles, and are printed and overlapped on various background objects with uneven, shaded, and sharp variations in contrast, illumination, and texture. Examples of these background objects include figures,

Acknowledgements

This paper was supported by the National Science Council of R.O.C. under Contract No. NSC-99-2221-E-027-100, NSC-99-2221-E-468-022, and NSC-100-2219-E-027-006.

References (26)

  • K. Suzuki et al.

    Linear-time connected-component labeling based on sequential local operations

    Computer Vision and Image Understanding

    (2003)
  • R. Cucchiara et al.

    Image analysis and rule-based reasoning for a traffic monitoring system

    IEEE Transactions on Intelligent Transportation Systems

    (2000)
  • Fisher, J. L., Hinds, S. C., & D’Amato, D. P. (1990). Rule-based system for document image segmentation. In:...
  • Cited by (10)

    • Text-line extraction from handwritten document images using GAN

      2020, Expert Systems with Applications
      Citation Excerpt :

      In Ryu, Koo and Cho (2014), the authors have modified the method, but still it suffers from the merge of very close neighboring text-lines or a text-line with few number of components. In spite of the various challenges, some novel advancements (Du, Pan, & Bui, 2009b,a), (Li, Zheng, Doermann, & Jaeger, 2008), (Chen, Hong, & Chuang, 2012) have provided impressive results for multi-script documents, even in noisy environments but the computational complexity of the methods remains high. Jamuna and Haribabu (2015) developed the energy minimization framework to group the CCs where they have used two classifiers; one for text pixels and another one for non-text pixels.

    • Segmentation of connected handwritten digits using Self-Organizing Maps

      2013, Expert Systems with Applications
      Citation Excerpt :

      At this point, it is also common to estimate and correct the skew of the document (Mascaro, Cavalcanti, & Mello, 2010). The goal of document segmentation is to isolate text and graphics blocks (Caponetti, Castiello, & Grecki, 2008; Chen, Hong, & Chuang, 2012) (sometimes this analysis also considers a distinct background area), since OCR systems should be interested in text elements only. Text segmentation is analogous to document segmentation, but it only works on textual components.

    • Recognition of the character on the map captured by the camera using k-nearest neighbor

      2020, IOP Conference Series: Materials Science and Engineering
    • Semantics based web ranking using a robust weight scheme

      2019, International Journal of Web Portals
    View all citing articles on Scopus
    View full text