Elsevier

Pattern Recognition

Volume 42, Issue 12, December 2009, Pages 3146-3157
Pattern Recognition

Handwritten Chinese text line segmentation by clustering with distance metric learning

https://doi.org/10.1016/j.patcog.2008.12.013Get rights and content

Abstract

Separating text lines in unconstrained handwritten documents remains a challenge because the handwritten text lines are often un-uniformly skewed and curved, and the space between lines is not obvious. In this paper, we propose a novel text line segmentation algorithm based on minimal spanning tree (MST) clustering with distance metric learning. Given a distance metric, the connected components (CCs) of document image are grouped into a tree structure, from which text lines are extracted by dynamically cutting the edges using a new hypervolume reduction criterion and a straightness measure. By learning the distance metric in supervised learning on a dataset of pairs of CCs, the proposed algorithm is made robust to handle various documents with multi-skewed and curved text lines. In experiments on a database with 803 unconstrained handwritten Chinese document images containing a total of 8,169 lines, the proposed algorithm achieved a correct rate 98.02% of line detection, and compared favorably to other competitive algorithms.

Introduction

Text line segmentation from document images is one of the major problems in document image analysis. It provides crucial information for the tasks of text block segmentation, character segmentation and recognition, and text string recognition. Whereas the difficulty of machine-printed document analysis mainly lies in the complex layout structure and degraded image quality, handwritten document analysis is difficult mainly due to the irregularity of layout and character shapes originated from the variability of writing styles. For unconstrained handwritten documents, text line segmentation and character segmentation-recognition are not solved though enormous efforts have been devoted to them and great advances have been made.

Text line segmentation of handwritten documents is much more difficult than that of printed documents. Unlike that printed documents have approximately straight and parallel text lines, the lines in handwritten documents are often un-uniformly skewed and curved. Moreover, the spaces between handwritten text lines are often not obvious compared to the spaces between within-line characters, and some text lines may interfere with each other. Therefore, many text line detection techniques, such as projection analysis [1], [2], [3], [4], [5], [6], [7] and K-nearest neighbor connected components (CCs) grouping [12], [13], [14], are not able to segment handwritten text lines successfully. Fig. 1 shows an example of unconstrained handwritten Chinese document with segmentation results by the XY cut algorithm [1], the stroke skew correction algorithm [6], the Docstrum algorithm [12] and the piece-wise projection algorithm [5]. In this case, we can see that the X–Y cut algorithm and the stroke skew correction algorithm succeed in detecting the text lines, but fail to locate the boundaries of text lines. The Docstrum algorithm can locate the boundaries of text lines very well, but fails to detect some lines (the first and fourth lines in Fig. 1(c)) correctly because of the anomalous size of characters. Although the piece-wise projection algorithm can overcome the aforementioned errors, it fails to segment some small-size CCs (the first and eighth lines in Fig. 1(d)).

Many efforts have been devoted to the difficult problem of handwritten text line segmentation [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28]. The methods can be roughly categorized into three classes: top-down, bottom-up, and hybrid. Top-down methods partition the document image recursively into text regions, text lines, and words/characters with the assumption of straight lines. Bottom-up methods group small units of image (pixels, CCs, characters, words, etc.) into text lines and then text regions. Bottom-up grouping can be viewed as a clustering process, which aggregates image components according to proximity and does not rely on the assumption of straight lines. Hybrid methods combine bottom-up grouping and top-down partitioning in different ways. All the three approaches have their disadvantages. Top-down methods do not perform well on curved and overlapping text lines. The performance of bottom-up grouping relies on some heuristic rules or artificial parameters, such as the between-component distance metric for clustering. On the other hand, hybrid methods are complicated in computation, and the design of a robust combination scheme is non-trivial.

In this paper, we propose an effective bottom-up method for text line segmentation in unconstrained handwritten Chinese documents. Our approach is based on minimal spanning tree (MST) clustering of CCs and the distance metric between CCs is designed by supervised learning. The number of clusters, namely the number of text lines, is automatically decided by a new hypervolume reduction criterion. Except for some empirical parameters in pre-processing of CCs and in post-processing of text lines, the clustering algorithm itself has no artificial parameters. The experimental comparison of clustering with metric learning with that of artificially designed metric shows that supervised metric learning improves largely the accuracy of text line segmentation. The proposed method was also compared with other state-of-the-art methods in experiments on a large database of handwritten Chinese documents and its superiority was demonstrated. By customizing the between-component features and training with documents of specific languages, we suggest that the proposed method is also applicable to the documents of other languages.

The rest of this paper is organized as follows. In Section 2, we give a brief review of the related works; An overall description of our clustering-based text line segmentation method is given in Section 3, and the distance metric learning scheme is elaborated in Section 4. In Section 5, we present the hypervolume reduction criterion and the straightness measure for text line grouping. Experimental results are presented in Section 6 and concluding remarks are given in Section 7.

Section snippets

Previous works

The structure of a document image is a hierarchy of text regions, text lines, words, characters and CCs. Text lines can be extracted by either top-down region partitioning, bottom-up components aggregation, or a hybrid scheme. Some representative segmentation methods are reviewed below.

The X–Y cut algorithm [1], [2] is a typical projection-based top-down segmentation method. It uses horizontal and vertical projection histograms alternately along the X and Y axis so as to partition the document

Clustering based text line segmentation

In this section, we describe the rationale of our approach and the MST clustering algorithm. The distance metric learning and text line grouping techniques are elaborated in 4 Distance metric learning, 5 Text line grouping, respectively. The performance of MST clustering relies on the metric of distance between image components. After clustering, the resulting tree is carefully cut into subtrees each corresponding to a text line.

Distance metric learning

As many clustering algorithms rely critically on the distance metric between pairs of input units, some recent studies have contributed to metric learning from data [32], [33], [34]. For improving the performance of fuzzy c-means clustering, an evolutionary algorithm was used to optimize the scales of the dimensions of input data set [32]. Domeniconi [33] proposed a variant of k-means algorithm in which individual Euclidean metric weights were learned for each cluster. Xing et al. [34] combined

Text line grouping

Although the learned distance metric encourages the components in the same text line to be connected in a subtree, there are still some components from different lines connected. Since between-line edges are not obvious because their lengths (distances between components) are not necessarily longer than the within-line edge lengths, to correctly recognize and cut the between-line edges is non-trivial. Although several algorithms [36], [37], [38], [39] on this problem have been proposed, they do

Experimental results

We evaluated the performance of our algorithm on a large database of unconstrained handwritten Chinese documents and compared with some existing reference algorithms. As follows, we briefly describe the database and evaluation methodology, outline the reference algorithms, and then present the experimental results.

Conclusion

We propose a new method for text line segmentation in unconstrained handwritten Chinese document images based on minimum spanning tree (MST) clustering with distance metric learning. This bottom-up method is able to segment multi-skewed, curved and slightly overlapping text lines. Except some empirical parameters (which are easy to determine and do not influence the performance critically) in pre-processing of connected components (CCs) and post-processing of text lines, this algorithm has no

Acknowledgments

The authors would like to thank Tonghua Su for authorizing us to use the HIT-MW database, Zhenglong Li for discussions on distance metric learning, Gang Liu and Yi Li for their suggestions on the experiments. This research was supported by the National Natural Science Foundation of China (NSFC) under grant nos. 60775004 and 60825301.

About the Author—FEI YIN received the B.S. degree in Computer Science from Xidian University, Xi’an, China, the M.E. degree in Pattern Recognition and Intelligent Systems from Huazhong University of Science and Technology, Wuhan, China, in 1999 and 2002, respectively. He is currently pursuing a Ph.D. degree in Pattern Recognition and Intelligent Systems at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research

References (48)

  • T. Su, T. Zhang, H. Huang, Y. Zhou, Skew detection for Chinese handwriting by horizontal stroke histogram, in:...
  • C. Weliwitage, A.L. Harvey, A.B. Jennings, Handwritten document offline text line segmentation, in: Proceedings of...
  • M. Liwicki, E. Indermuehle, H. Bunke, On-line handwritten text line detection using dynamic programming, in:...
  • U. Pal et al.

    Multioriented and curved text lines extraction from Indian documents

    IEEE Transactions on Systems, Man and Cybernetics, Part B

    (2004)
  • U. Pal, P.P. Roy, Text line extraction from India document, in: Proceeding of Fifth International Conference on...
  • L. O’Gorman

    The document spectrum for page layout analysis

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1993)
  • F. Kimura, Y. Miyake, M. Shridhar, Handwritten ZIP code recognition using lexicon free word recognition algorithm, in:...
  • L. Likforman-Sulem, C. Faure, Extracting lines on handwritten document by perceptual grouping, in: Advances in...
  • S. Nicola, T. Paquet, L. Heutte, Text line segmentation in handwritten document using a production system, in:...
  • I.S.I. Abuhaiba, S. Datta, M.J.J. Holt, Line extraction and stroke ordering of text pages, in: Proceeding of the Third...
  • A. Simon et al.

    A fast algorithm for bottom-up document layout analysis

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1997)
  • Y. Pu, Z. Shi, A natural learning algorithm based on Hough transform for text lines extraction in handwritten document,...
  • L. Likforman-Sulem, A. Hanimyan, C. Faure, A Hough based algorithm for extracting text lines in handwritten documents,...
  • G. Louloudis, B. Gatos, I. Pratikakis, K. Halatis, A block-based Hough transform mapping for text line detection in...
  • Cited by (142)

    • Distance transform based text-line extraction from unconstrained handwritten document images

      2021, Expert Systems with Applications
      Citation Excerpt :

      In (Li et al., 2008), the authors combine projection profiles with a traditional level set method for this type of image segmentation but this method is sensitive to the number of boundary evaluation steps and touching text-lines. Yin et al. (Yin & Liu, 2009) design an approach to group CCs using a minimal spanning tree (MST) with a distance measure. More comprehensive development of this idea is reported in (Deshmukh et al., 2018) which is specially designed for the multi-script documents.

    View all citing articles on Scopus

    About the Author—FEI YIN received the B.S. degree in Computer Science from Xidian University, Xi’an, China, the M.E. degree in Pattern Recognition and Intelligent Systems from Huazhong University of Science and Technology, Wuhan, China, in 1999 and 2002, respectively. He is currently pursuing a Ph.D. degree in Pattern Recognition and Intelligent Systems at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include document image analysis, handwritten character recognition and computer vision.

    About the Author—CHENG-LIN LIU received the B.S. degree in Electronic Engineering from Wuhan University, Wuhan, China, the M.E. degree in Electronic Engineering from Beijing Polytechnic University, Beijing, China, the Ph.D. degree in Pattern Recognition and Intelligent Systems from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 1989, 1992 and 1995, respectively. He was a postdoctoral fellow at Korea Advanced Institute of Science and Technology (KAIST) and later at Tokyo University of Agriculture and Technology from March 1996 to March 1999. From 1999 to 2004, he was a research staff member and later a senior researcher at the Central Research Laboratory, Hitachi, Ltd., Tokyo, Japan. From 2005, he has been a Professor at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China, and is now the Deputy Director of the laboratory. His research interests include pattern recognition, image processing, neural networks, machine learning, and especially the applications to character recognition and document analysis. He has published over 80 technical papers at international journals and conferences. He won the IAPR/ICDAR Young Investigator Award of 2005.

    View full text