Elsevier

Applied Soft Computing

Volume 30, May 2015, Pages 705-721
Applied Soft Computing

Rough-fuzzy clustering and multiresolution image analysis for text-graphics segmentation

https://doi.org/10.1016/j.asoc.2015.01.049Get rights and content

Highlights

  • A new method is proposed for text-graphics segmentation.

  • M-band wavelet packet is used to extract scale-space features for document image.

  • Unsupervised feature selection method is proposed to select relevant and non-redundant features.

  • Rough-fuzzy clustering is used to address uncertainty problem of document segmentation.

  • The approach is invariant under font size of text, scanning resolution and type of layout.

Abstract

This paper presents a segmentation method, integrating judiciously the merits of rough-fuzzy computing and multiresolution image analysis technique, for documents having both text and graphics regions. It assumes that the text and non-text or graphics regions of a given document are considered to have different textural properties. The M-band wavelet packet analysis and rough-fuzzy-possibilistic c-means are used for text-graphics segmentation problem. The M-band wavelet packet is used to extract the scale-space features, which offers a huge range of possibilities of scale-space features for document image and is able to zoom it onto narrow band high frequency components. A scale-space feature vector is thus derived, taken at different scales for each pixel in an image. However, the decomposition scheme employing M-band wavelet packet leads to a large number of redundant features. In this regard, an unsupervised feature selection method is introduced to select a set of relevant and non-redundant features for text-graphics segmentation problem. Finally, the rough-fuzzy-possibilistic c-means algorithm is used to address the uncertainty problem of document segmentation. The whole approach is invariant under the font size, line orientation, and script of the text. The performance of the proposed technique, along with a comparison with related approaches, is demonstrated on a set of real life document images.

Introduction

With the advances in information technology, automated processing of documents has become an imperative need. As the world moves closer to the concept of the paperless office, more and more communication and storage of documents are performed digitally. In this background, there is a great demand for software that automatically extracts, analyzes, and stores information from physical documents for later retrieval. However, the documents in digitized form require a large amount of storage space, after being compressed using advanced techniques. Text-graphics segmentation partitions a document image into distinct regions corresponding to the text and non-text parts. In effect, it facilitates efficient searching and storage of text parts of the documents.

Many techniques have been proposed to segment the document image into text and non-text regions [1], [2]. Most popular among them are top-down and bottom-up approaches. The top-down techniques are based on the difference in contrast between the foreground and background to split the document into columns, paragraphs, text lines, and may be in words. Projection profiles [3], [4] are popular top-down approaches that work by identifying the white spaces by vertical and horizontal projections. On the other hand, bottom-up methods, which are similarity based document segmentation approaches, tend to cluster pixels with similar intensities to obtain higher level descriptions. The run length smoothing algorithm [5], [6] is an example of bottom-up approach, which applies region growing approach to detect text regions. The Docstrum proposed in [7] is another bottom-up method, which groups connected components of the same type using nearest neighbor information, starting from the pixel level to obtain higher level descriptions of the document such as words, text lines, paragraphs, and so on. Each of these methods assumes rectangular blocks of text and graphics, and is sensitive to different textural properties such as font size, text line orientation, and inter-character spacing. Hence, these methods are not effective when any a priori knowledge about the content and attributes of the document image is unavailable.

Nicolas et al. [8] extracted the text from document image using 2-D conditional random field model by integrating contextual knowledge and machine learning technique. Another approach for the binarization of a document image was proposed based on a Bayesian framework using Markov random field model of the image [9]. Junga et al. [10] achieved text segmentation by applying region growing procedure based on the response of stroke filter and then improved the segmentation by using an OCR feedback procedure. Chen and Wu [11] developed a document segmentation approach, which integrates multiplane segmentation and multilevel thresholding method. Su et al. [12] proposed a method to segment the degraded document image by combining contrast map and canny edge detector, followed by thresholding.

Recently, wavelet techniques have become powerful tools in document image analysis domain. It is particularly good at describing a scene in terms of the scale of textures in it. Li and Gray [13] have used features based on distribution characteristics of wavelet coefficients in high frequency bands to segment document images into four classes, namely, background, photograph, text, and graph. While Kundu and Acharyya [14] proposed a text-graphics segmentation method based on wavelet scale-space features followed by k-means clustering algorithm, Deng et al. [15] have used cubic b-spline wavelet for feature extraction and k-means for text-graphics segmentation. On the other hand, Lee et al. [16] proposed an algorithm, which is based on local energy estimation in wavelet packet domain and k-means algorithm. Kumar et al. [17] used globally matched wavelet filters and Markov random field model to segment the document images into text, background, and picture components. Haneda and Bouman [18] combined cost optimized segmentation and connected component classification into multiscale framework in order to improve the text-graphics segmentation accuracy for compression.

In this background, this paper presents a text-graphics segmentation method, integrating judiciously the merits of rough-fuzzy clustering and multiresolution image analysis. The proposed method decomposes a composite image into multiresolution multidirectional feature subbands using wavelet analysis. According to Chang and Kuo [19], most of the significant textural features are more prevalent in the intermediate frequency subbands [20]. In this regard, the M-band wavelet packet transform is used in the proposed method for feature extraction. It recursively decomposes both the high frequency and low frequency bands at each scale. However, the complete decomposition tree is usually not required by decomposing all the subbands at each scale. Hence, an appropriate method of selecting the significant and relevant features is required. Subsequently, features are computed from this set of selected bases by using nonlinear energy estimation followed by a smoothing filter. The use of M-band wavelet packet decomposition gives rise to a large number of features, which incurs redundancy. Therefore, selection of the appropriate features using some basis selection algorithms is required. Since any a priori knowledge about the image is not available, an unsupervised approach is proposed for feature selection. Finally, the rough-fuzzy-possibilistic c-means (RFPCM) algorithm [21] is used to segment the document image using feature vectors. It adds the concept of fuzzy membership (both probabilistic and possibilistic) of fuzzy sets, and lower and upper approximations of rough sets into k-means or c-means algorithm. While the membership of fuzzy sets enables efficient handling of overlapping partitions, the rough sets deal with uncertainty, vagueness, and incompleteness in cluster definition. Due to integration of both probabilistic and possibilistic memberships, the RFPCM avoids the problems of noise sensitivity of fuzzy c-means [22] and the coincident clusters of possibilistic c-means [23]. Also, the concept of crisp lower bound and fuzzy boundary of a cluster, introduced in rough-fuzzy-possibilistic c-means, enables efficient selection of cluster prototypes. The performance of the proposed method, along with a comparison with other related approaches, is demonstrated both qualitatively and quantitatively on a set of real life document images.

The structure of the rest of this paper is as follows: Section 2 describes the proposed text-graphics segmentation method based on rough-fuzzy-possibilistic c-means and M-band wavelet packet. An unsupervised feature selection algorithm is introduced to select relevant and non-redundant features for segmentation. Section 3 presents the experimental results on several document images and a comparison among different methods. Finally, concluding remarks are given in Section 4.

Section snippets

Proposed text-graphics segmentation method

The proposed text-graphics segmentation algorithm based on M-band wavelet packet and rough-fuzzy-possibilistic c-means is illustrated in Fig. 1. The algorithm proceeds as follows:

  • 1.

    The input image is decomposed using M-band wavelet packet into m number of subbands based on energy estimation of each subband with respect to two threshold values.

  • 2.

    These outputs are subjected to nonlinear operation followed by smoothing operation.

  • 3.

    The unsupervised feature selection algorithm is applied to the feature

Experimental results

The proposed text-graphics segmentation method judiciously integrates the merits of rough-fuzzy-possibilistic c-means (RFPCM) [21] and M-band wavelet packet analysis. In this section, the performance of the proposed method is extensively compared with that of different clustering algorithms and several feature extraction techniques. The clustering algorithms compared are k-means or hard c-means (HCM) [33], fuzzy c-means (FCM) [22], [34], possibilistic c-means (PCM) [23], fuzzy-possibilistic c

Conclusion

In this paper, a new methodology is presented, integrating judiciously the merits of rough-fuzzy-possibilistic c-means algorithm and multiresolution image analysis, for segmenting the text part from the graphics region based on textural cues. The rough-fuzzy-possibilistic c-means combines c-means algorithm, rough sets, and probabilistic and possibilistic memberships of fuzzy sets. This formulation is geared towards maximizing the utility of both rough sets and fuzzy sets with respect to

Acknowledgements

This work is partially supported by the Indian National Science Academy, New Delhi, India (grant no. SP/YSP/68/2012). The authors would like to thank anonymous referees and Prof. M.K. Kundu of Indian Statistical Institute, Kolkata for providing helpful comments and valuable criticisms on the original version of the manuscript which have greatly improved the presentation of paper.

References (35)

  • L. O’Gorman

    The document spectrum for page layout analysis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1993)
  • S. Nicolas et al.

    Document Image Segmentation Using a 2D Conditional Random Field Model

  • T. Lelore et al.

    Document Image Binarisation Using Markov Field Model

  • C. Junga et al.

    A new approach for text segmentation using a stroke filter

    Signal Process.

    (2008)
  • B. Su et al.

    Robust document image binarization technique for degraded document images

    IEEE Trans. Image Process.

    (2013)
  • J. Li et al.

    Context-based multiscale classification of document images using wavelet coefficient distributions

    IEEE Trans. Image Process.

    (2000)
  • M. Acharyya et al.

    Document image segmentation using wavelet scale-space features

    IEEE Trans. Circuits Syst. Video Technol.

    (2002)
  • Cited by (29)

    • Multigranulation rough-fuzzy clustering based on shadowed sets

      2020, Information Sciences
      Citation Excerpt :

      In this case, (b) and (c) will take over the effect of (a). The threshold, that determines the approximation regions of each cluster, is often selected depending on subjective tuning in the available researches [16,26]. Maji et al. [15] and Sarkar et al. [28] chose this value as the average value and the median of the difference between the highest and second highest fuzzy memberships of all the patterns, respectively.

    • Accurate segmentation of complex document image using digital shearlet transform with neutrosophic set as uncertainty handling tool

      2017, Applied Soft Computing Journal
      Citation Excerpt :

      We compared the performance of the proposed method with four published methods. They are Acharyya [25], Kumar [26], Maji [28] and Gomez [10]. The qualitative results of the text region segmentation by five different methods (Acharyya [25], Kumar [26], Maji [28], Gomez [10]and the proposed) are shown in Fig. 4.

    • Complex layout analysis based on contour classification and morphological operations

      2017, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      The recently published works mainly focus on hybrid and local analysis algorithms. The later (Asi et al., 2014; Mehri et al., 2015; Maji and Roy, 2015; Chen et al., 2015) can handle very complex layouts with multiple backgrounds and overlapping regions. They need however to be trained and tested every time a new dataset is considered, since different features may have to be taken into account.

    View all citing articles on Scopus
    View full text