Handwritten Chinese text line segmentation by clustering with distance metric learning

doi:10.1016/j.patcog.2008.12.013

Pattern Recognition

Volume 42, Issue 12, December 2009, Pages 3146-3157

https://doi.org/10.1016/j.patcog.2008.12.013 Get rights and content

Abstract

Separating text lines in unconstrained handwritten documents remains a challenge because the handwritten text lines are often un-uniformly skewed and curved, and the space between lines is not obvious. In this paper, we propose a novel text line segmentation algorithm based on minimal spanning tree (MST) clustering with distance metric learning. Given a distance metric, the connected components (CCs) of document image are grouped into a tree structure, from which text lines are extracted by dynamically cutting the edges using a new hypervolume reduction criterion and a straightness measure. By learning the distance metric in supervised learning on a dataset of pairs of CCs, the proposed algorithm is made robust to handle various documents with multi-skewed and curved text lines. In experiments on a database with 803 unconstrained handwritten Chinese document images containing a total of 8,169 lines, the proposed algorithm achieved a correct rate 98.02% of line detection, and compared favorably to other competitive algorithms.

Introduction

Text line segmentation from document images is one of the major problems in document image analysis. It provides crucial information for the tasks of text block segmentation, character segmentation and recognition, and text string recognition. Whereas the difficulty of machine-printed document analysis mainly lies in the complex layout structure and degraded image quality, handwritten document analysis is difficult mainly due to the irregularity of layout and character shapes originated from the variability of writing styles. For unconstrained handwritten documents, text line segmentation and character segmentation-recognition are not solved though enormous efforts have been devoted to them and great advances have been made.

Text line segmentation of handwritten documents is much more difficult than that of printed documents. Unlike that printed documents have approximately straight and parallel text lines, the lines in handwritten documents are often un-uniformly skewed and curved. Moreover, the spaces between handwritten text lines are often not obvious compared to the spaces between within-line characters, and some text lines may interfere with each other. Therefore, many text line detection techniques, such as projection analysis [1], [2], [3], [4], [5], [6], [7] and K-nearest neighbor connected components (CCs) grouping [12], [13], [14], are not able to segment handwritten text lines successfully. Fig. 1 shows an example of unconstrained handwritten Chinese document with segmentation results by the X–Y cut algorithm [1], the stroke skew correction algorithm [6], the Docstrum algorithm [12] and the piece-wise projection algorithm [5]. In this case, we can see that the X–Y cut algorithm and the stroke skew correction algorithm succeed in detecting the text lines, but fail to locate the boundaries of text lines. The Docstrum algorithm can locate the boundaries of text lines very well, but fails to detect some lines (the first and fourth lines in Fig. 1(c)) correctly because of the anomalous size of characters. Although the piece-wise projection algorithm can overcome the aforementioned errors, it fails to segment some small-size CCs (the first and eighth lines in Fig. 1(d)).

Many efforts have been devoted to the difficult problem of handwritten text line segmentation [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28]. The methods can be roughly categorized into three classes: top-down, bottom-up, and hybrid. Top-down methods partition the document image recursively into text regions, text lines, and words/characters with the assumption of straight lines. Bottom-up methods group small units of image (pixels, CCs, characters, words, etc.) into text lines and then text regions. Bottom-up grouping can be viewed as a clustering process, which aggregates image components according to proximity and does not rely on the assumption of straight lines. Hybrid methods combine bottom-up grouping and top-down partitioning in different ways. All the three approaches have their disadvantages. Top-down methods do not perform well on curved and overlapping text lines. The performance of bottom-up grouping relies on some heuristic rules or artificial parameters, such as the between-component distance metric for clustering. On the other hand, hybrid methods are complicated in computation, and the design of a robust combination scheme is non-trivial.

In this paper, we propose an effective bottom-up method for text line segmentation in unconstrained handwritten Chinese documents. Our approach is based on minimal spanning tree (MST) clustering of CCs and the distance metric between CCs is designed by supervised learning. The number of clusters, namely the number of text lines, is automatically decided by a new hypervolume reduction criterion. Except for some empirical parameters in pre-processing of CCs and in post-processing of text lines, the clustering algorithm itself has no artificial parameters. The experimental comparison of clustering with metric learning with that of artificially designed metric shows that supervised metric learning improves largely the accuracy of text line segmentation. The proposed method was also compared with other state-of-the-art methods in experiments on a large database of handwritten Chinese documents and its superiority was demonstrated. By customizing the between-component features and training with documents of specific languages, we suggest that the proposed method is also applicable to the documents of other languages.

The rest of this paper is organized as follows. In Section 2, we give a brief review of the related works; An overall description of our clustering-based text line segmentation method is given in Section 3, and the distance metric learning scheme is elaborated in Section 4. In Section 5, we present the hypervolume reduction criterion and the straightness measure for text line grouping. Experimental results are presented in Section 6 and concluding remarks are given in Section 7.

Section snippets

Previous works

The structure of a document image is a hierarchy of text regions, text lines, words, characters and CCs. Text lines can be extracted by either top-down region partitioning, bottom-up components aggregation, or a hybrid scheme. Some representative segmentation methods are reviewed below.

The X–Y cut algorithm [1], [2] is a typical projection-based top-down segmentation method. It uses horizontal and vertical projection histograms alternately along the X and Y axis so as to partition the document

Clustering based text line segmentation

In this section, we describe the rationale of our approach and the MST clustering algorithm. The distance metric learning and text line grouping techniques are elaborated in 4 Distance metric learning, 5 Text line grouping, respectively. The performance of MST clustering relies on the metric of distance between image components. After clustering, the resulting tree is carefully cut into subtrees each corresponding to a text line.

Distance metric learning

As many clustering algorithms rely critically on the distance metric between pairs of input units, some recent studies have contributed to metric learning from data [32], [33], [34]. For improving the performance of fuzzy c-means clustering, an evolutionary algorithm was used to optimize the scales of the dimensions of input data set [32]. Domeniconi [33] proposed a variant of k-means algorithm in which individual Euclidean metric weights were learned for each cluster. Xing et al. [34] combined

Text line grouping

Although the learned distance metric encourages the components in the same text line to be connected in a subtree, there are still some components from different lines connected. Since between-line edges are not obvious because their lengths (distances between components) are not necessarily longer than the within-line edge lengths, to correctly recognize and cut the between-line edges is non-trivial. Although several algorithms [36], [37], [38], [39] on this problem have been proposed, they do

Experimental results

We evaluated the performance of our algorithm on a large database of unconstrained handwritten Chinese documents and compared with some existing reference algorithms. As follows, we briefly describe the database and evaluation methodology, outline the reference algorithms, and then present the experimental results.

Conclusion

We propose a new method for text line segmentation in unconstrained handwritten Chinese document images based on minimum spanning tree (MST) clustering with distance metric learning. This bottom-up method is able to segment multi-skewed, curved and slightly overlapping text lines. Except some empirical parameters (which are easy to determine and do not influence the performance critically) in pre-processing of connected components (CCs) and post-processing of text lines, this algorithm has no

Acknowledgments

The authors would like to thank Tonghua Su for authorizing us to use the HIT-MW database, Zhenglong Li for discussions on distance metric learning, Gang Liu and Yi Li for their suggestions on the experiments. This research was supported by the National Natural Science Foundation of China (NSFC) under grant nos. 60775004 and 60825301.

About the Author—FEI YIN received the B.S. degree in Computer Science from Xidian University, Xi’an, China, the M.E. degree in Pattern Recognition and Intelligent Systems from Huazhong University of Science and Technology, Wuhan, China, in 1999 and 2002, respectively. He is currently pursuing a Ph.D. degree in Pattern Recognition and Intelligent Systems at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research

References (48)

S. Basu et al.
Text line extraction from multi-skewed handwritten documents
Pattern Recognition
(2007)
K. Kise et al.
Segmentation of page images using the area Voronoi diagram
Computer Vision and Image Understanding
(1998)
B. Yu et al.
A robust and fast skew detection algorithm for generic document
Pattern Recognition
(1996)
F. Chang et al.
A linear-time component-labeling algorithm using contour tracing technique
Computer Vision and Image Understanding
(2004)
L. Yujian
A clustering algorithm based on maximal θ-distant subtrees
Pattern Recognition
(2007)
G. Nagy et al.
A prototype document image analysis system for technical journals
Computer
(1992)
J. He, A.C. Downton, User-assisted archive document analysis for digital library construction, in: Proceedings of the...
A. Zahour, B. Taconet, P. Mercy, S. Ramdane, Arabic handwritten text-line extraction, in: Proceedings of the Sixth...
U. Pal, S. Datta, Segmentation of Bangla unconstrained handwritten text, in: Proceedings of the Seventh International...
M. Arivazhagan, H. Srinivasan, S. Srihari, A statistical approach to line segmentation in handwritten documents, in:...

T. Su, T. Zhang, H. Huang, Y. Zhou, Skew detection for Chinese handwriting by horizontal stroke histogram, in:...

C. Weliwitage, A.L. Harvey, A.B. Jennings, Handwritten document offline text line segmentation, in: Proceedings of...

M. Liwicki, E. Indermuehle, H. Bunke, On-line handwritten text line detection using dynamic programming, in:...

U. Pal et al.

Multioriented and curved text lines extraction from Indian documents

IEEE Transactions on Systems, Man and Cybernetics, Part B

(2004)

U. Pal, P.P. Roy, Text line extraction from India document, in: Proceeding of Fifth International Conference on...

L. O’Gorman

The document spectrum for page layout analysis

IEEE Transactions on Pattern Analysis and Machine Intelligence

(1993)

F. Kimura, Y. Miyake, M. Shridhar, Handwritten ZIP code recognition using lexicon free word recognition algorithm, in:...

L. Likforman-Sulem, C. Faure, Extracting lines on handwritten document by perceptual grouping, in: Advances in...

S. Nicola, T. Paquet, L. Heutte, Text line segmentation in handwritten document using a production system, in:...

I.S.I. Abuhaiba, S. Datta, M.J.J. Holt, Line extraction and stroke ordering of text pages, in: Proceeding of the Third...

A. Simon et al.

A fast algorithm for bottom-up document layout analysis

IEEE Transactions on Pattern Analysis and Machine Intelligence

(1997)

Y. Pu, Z. Shi, A natural learning algorithm based on Hough transform for text lines extraction in handwritten document,...

L. Likforman-Sulem, A. Hanimyan, C. Faure, A Hough based algorithm for extracting text lines in handwritten documents,...

G. Louloudis, B. Gatos, I. Pratikakis, K. Halatis, A block-based Hough transform mapping for text line detection in...

Cited by (142)

End-to-End page-Level assessment of handwritten text recognition
2023, Pattern Recognition
The evaluation of Handwritten Text Recognition (HTR) systems has traditionally used metrics based on the edit distance between HTR and ground truth (GT) transcripts, at both the character and word levels. This is very adequate when the experimental protocol assumes that both GT and HTR text lines are the same, which allows edit distances to be independently computed to each given line. Driven by recent advances in pattern recognition, HTR systems increasingly face the end-to-end page-level transcription of a document, where the precision of locating the different text lines and their corresponding reading order (RO) play a key role. In such a case, the standard metrics do not take into account the inconsistencies that might appear. In this paper, the problem of evaluating HTR systems at the page level is introduced in detail. We analyse the convenience of using a two-fold evaluation, where the transcription accuracy and the RO goodness are considered separately. Different alternatives are proposed, analysed and empirically compared both through partially simulated and through real, full end-to-end experiments. Results support the validity of the proposed two-fold evaluation approach. An important conclusion is that such an evaluation can be adequately achieved by just two simple and well-known metrics: the Word Error Rate (WER), that takes transcription sequentiality into account, and the here re-formulated Bag of Words Word Error Rate (bWER), that ignores order. While the latter directly and very accurately assess intrinsic word recognition errors, the difference between both metrics ( $Δ$ WER) gracefully correlates with the Normalised Spearman’s Foot Rule Distance (NSFD), a metric which explicitly measures RO errors associated with layout analysis flaws. To arrive to these conclusions, we have introduced another metric called Hungarian Word Word Rate (hWER), based on a here proposed regularised version of the Hungarian Algorithm. This metric is shown to be always almost identical to bWER and both bWER and hWER are also almost identical to WER whenever HTR transcripts and GT references are guarantee to be in the same RO.
A generalized line segmentation method for multi-script handwritten text documents
2023, Expert Systems with Applications
Handwritten document image segmentation into text-lines is a crucial stage towards unconstrained handwritten document recognition. In the context of Indian subcontinent various scripts are used for communication where a system for multi-script handwritten text line segmentation is very much essential. This paper presents a multi-script text line segmentation algorithm based on newly developed light projection, start point detection, and boundary tracking methods. The proposed approach is capable of overcoming most of the hindrance faced by state-of-the-art methods. The experiment is performed on our proposed Bangla handwritten document image dataset WBSUBNdb_text and also on a variety of well-known public handwritten datasets namely: CMATERdb, PhDIndic_11, KHATT, HIT-MW, ISI Bengali Writer Identification/Verification dataset, ICDAR 2013 segmentation contest dataset, ICDAR 2013 writer identification contest benchmark dataset, and obtained promising results.
A bibliometric analysis of off-line handwritten document analysis literature (1990–2020)
2022, Pattern Recognition
Providing computers with the ability to process handwriting is both important and challenging, since many difficulties (e.g., different writing styles, alphabets, languages, etc.) need to be overcome for addressing a variety of problems (text recognition, signature verification, writer identification, word spotting, etc.). This paper reviews the growing literature on off-line handwritten document analysis over the last thirty years. A sample of 5389 articles is examined using bibliometric techniques. Using bibliometric techniques, this paper identifies (i) the most influential articles in the area, (ii) the most productive authors and their collaboration networks, (iii) the countries and institutions that have led research on the topic, (iv) the journals and conferences that have published most papers, and (v) the most relevant research topics (and their related tasks and methodologies) and their evolution over the years.
Distance transform based text-line extraction from unconstrained handwritten document images
2021, Expert Systems with Applications
Citation Excerpt :
In (Li et al., 2008), the authors combine projection profiles with a traditional level set method for this type of image segmentation but this method is sensitive to the number of boundary evaluation steps and touching text-lines. Yin et al. (Yin & Liu, 2009) design an approach to group CCs using a minimal spanning tree (MST) with a distance measure. More comprehensive development of this idea is reported in (Deshmukh et al., 2018) which is specially designed for the multi-script documents.
Text-line extraction (TLE) is the process of segmenting a document page into lines of text for processing by modules such as language and writer identification or Optical Character Recognition (OCR). Designing of an appropriate TLE method is always a challenging research problem especially in the domain of unconstrained handwritten documents. This is because of the vast number of potential interactions between the text lines. For example, these lines are not always straight, lines written close to each other can overlap with ascenders and descenders, and lines can interact with other content on the page. In this paper, we present a novel language-independent text-line extraction method for unconstrained handwritten documents which handles complexities such as touching and multi-skewed text lines, overlapping characters and irregular inter-line spacing. Our method preprocesses the document pages using a distance transform based method and uses a novel path detection algorithm to separate individual text-line. The proposed method has been tested on six standard datasets which are publicly available and the experimental results show that our method achieves a promising accuracy over state-of-the-art TLE methods.
Joint stroke classification and text line grouping in online handwritten documents with edge pooling attention networks
2021, Pattern Recognition
Stroke classification and text line grouping are important tasks in online handwritten document segmentation. In the past, the two tasks were usually performed using different models which are trained independently and perform sequentially. This cannot optimize the integration of contextual information and the system may suffer from error accumulation in stroke classification. In this paper, we propose a method for joint text/non-text stroke classification and text line grouping in online handwritten documents using attention based graph neural network. In our framework, the stroke classification and text line grouping problems are formulated as node classification and node clustering problems in a relational graph, which is constructed based on the temporal and spatial relationship between strokes. We propose a new graph network architecture, called edge pooling attention network (EPAT) to efficiently aggregate information between the features of neighboring nodes and edges. The proposed model is trained by multi-task learning with cross entropy loss for node classification and distance metric loss for node clustering. In experiments on two online handwritten document datasets IAMOnDo and Kondate, the proposed method is demonstrated effective, yielding superior performance in both stroke classification and text line grouping.
Online Handwritten Gurmukhi Word Segmentation: A Novel Algorithm Approach
2024, SSRN

View all citing articles on Scopus

About the Author—CHENG-LIN LIU received the B.S. degree in Electronic Engineering from Wuhan University, Wuhan, China, the M.E. degree in Electronic Engineering from Beijing Polytechnic University, Beijing, China, the Ph.D. degree in Pattern Recognition and Intelligent Systems from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 1989, 1992 and 1995, respectively. He was a postdoctoral fellow at Korea Advanced Institute of Science and Technology (KAIST) and later at Tokyo University of Agriculture and Technology from March 1996 to March 1999. From 1999 to 2004, he was a research staff member and later a senior researcher at the Central Research Laboratory, Hitachi, Ltd., Tokyo, Japan. From 2005, he has been a Professor at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China, and is now the Deputy Director of the laboratory. His research interests include pattern recognition, image processing, neural networks, machine learning, and especially the applications to character recognition and document analysis. He has published over 80 technical papers at international journals and conferences. He won the IAPR/ICDAR Young Investigator Award of 2005.

View full text

Handwritten Chinese text line segmentation by clustering with distance metric learning

Abstract

Introduction

Section snippets

Previous works

Clustering based text line segmentation

Distance metric learning

Text line grouping

Experimental results

Conclusion

Acknowledgments

Pattern Recognition

Computer Vision and Image Understanding

Pattern Recognition

Computer Vision and Image Understanding

Pattern Recognition

A prototype document image analysis system for technical journals

Computer

Multioriented and curved text lines extraction from Indian documents

IEEE Transactions on Systems, Man and Cybernetics, Part B

The document spectrum for page layout analysis

IEEE Transactions on Pattern Analysis and Machine Intelligence

A fast algorithm for bottom-up document layout analysis

IEEE Transactions on Pattern Analysis and Machine Intelligence