A robust approach to text line grouping in online handwritten Japanese documents

doi:10.1016/j.patcog.2008.10.019

Pattern Recognition

Volume 42, Issue 9, September 2009, Pages 2077-2088

https://doi.org/10.1016/j.patcog.2008.10.019 Get rights and content

Abstract

In this paper, we present an effective approach for grouping text lines in online handwritten Japanese documents by combining temporal and spatial information. With decision functions optimized by supervised learning, the approach has few artificial parameters and utilizes little prior knowledge. First, the strokes in the document are grouped into text line strings according to off-stroke distances. Each text line string, which may contain multiple lines, is segmented by optimizing a cost function trained by the minimum classification error (MCE) method. At the temporal merge stage, over-segmented text lines (caused by stroke classification errors) are merged with a support vector machine (SVM) classifier for making merge/non-merge decisions. Last, a spatial merge module corrects the segmentation errors caused by delayed strokes. Misclassified text/non-text strokes (stroke type classification precedes text line grouping) can be corrected at the temporal merge stage. To evaluate the performance of text line grouping, we provide a set of performance metrics for evaluating from multiple aspects. In experiments on a large number of free form documents in the Tokyo University of Agriculture and Technology (TUAT) Kondate database, the proposed approach achieves the entity detection metric (EDM) rate of 0.8992 and the edit-distance rate (EDR) of 0.1114. For grouping of pure text strokes, the performance reaches EDM of 0.9591 and EDR of 0.0669.

Introduction

With the increasing use of tablet PCs, electronic whiteboards, and digital pens on paper, users can draw various heterogeneous structures such as text, drawings and table forms freely on a large writing area. Such freely handwritten ink documents bring new challenges to automatic analysis and recognition. An ink document (online handwritten document) comprises a sequence of strokes, which are to be grouped into various structural units (drawings, text lines, words, characters, etc). Text lines are the most salient structures in ink documents and their reliable extraction is the pre-condition to further processing tasks such as character recognition, text editing and retrieval. This paper addresses the problem of text line grouping in freeform online handwritten Japanese documents.

Earlier systems of ink document analysis [1], [2] used to assume horizontal text lines and simply separate the text lines according to the projection profile. Liwichi et al. [3] proposed a text line detection method based on dynamic programming (DP) with a cost function involving hand-tuned parameters. This system was designed for ink pages with parallel text lines, and the online information was used merely in post-processing. It is important to point out that the assumption of text line regularity is often violated in ink documents because of the arbitrary line directions and inter-line distances [4]. Nevertheless, compared to offline documents, online handwriting has the advantage that the temporal order of strokes is available, which provides useful cues for text line grouping besides spatial information. Higher accuracy of grouping is achievable by utilizing both the temporal and spatial information.

The system developed by Shilman et al. [4] over-segments the stroke sequence by DP with a cost function reflecting the confidence that a given set of strokes belongs to one word, and the text lines are grouped by merging pairs of stroke clusters in aggressive steps. Although the system utilizes good heuristics in merging, it lacks a principled framework to guarantee the optimality.

Ye et al. [5] proposed a global optimization method integrating the likelihood of the resulting lines and the consistency of their configuration for text line grouping. The initial segmentation by DP minimizes a cost function involving the number of text lines as a constraint, which can effectively avoid over-segmentation. Then, a local gradient-decent algorithm is used to evaluate the splitting and merging hypotheses iteratively to minimize the global cost function. The global optimization method is superior to heuristic rules because it incorporates both the prior knowledge and the information from data in a principled formulation. However, without a training procedure, the parameters in the cost function have to be tuned manually. In their enhanced system [6], a high-confidence-first (HCF) method was proposed to group text lines and writing regions, with an AdaBoost classifier for making merge/non-merge decisions. The classifier was trained from ink data and provides confidence of merge.

For Japanese documents, Nakagawa et al. [7] describe a system in which the stroke sequence is segmented into text lines according to off-stroke (pen-up) distances and changing of writing directions based on the fact that off-strokes within a text line are mostly shorter than those between text lines and the text lines are usually straight. However, due to the variability of character size and spacing, this segmentation method does not perform reliably.

To better utilize the temporal and spatial information in online handwriting, we propose a text line grouping approach with very few artificial parameters and little prior knowledge. We only assume that the text lines are approximately straight (curved lines will be segmented as piecewise linear text lines by the proposed approach), while the writing directions can be arbitrary and need not to be parallel with each other. Unlike the projection based methods [1], [2], we do not estimate the inter-line distance, so the text lines can be arbitrarily close to each other. The above merits also favor the extensibility of the proposed method for text line grouping in non-Japanese ink documents.

The proposed text line grouping process performs in several stages. Initially, strokes are grouped into text line strings according to off-stroke distances. A discriminant function is trained under the string-level minimum classification error (MCE) criterion [8], [9] to separate the over-merged text lines with the beam search strategy. To correct the stroke classification errors and merge the misclassified strokes to text lines in documents of mixed text/drawing (in this case, a text/non-text classification procedure precedes text line grouping), an support vector machine (SVM) classifier is trained to make merge/non-merge decisions. Last, the errors caused by delayed strokes are amended by a spatial merge step.

Partitioning stroke sequence into text lines based on optimization has been tried by previous works [3], [4], [5], but without a training procedure, the parameters in the cost function have to be tuned manually. We take into account the similarity of this problem to character string recognition integrating character segmentation and classification [10], [11] and use string-level MCE training [8], [9], which has been applied to numeral string recognition [12], for training the parameters of cost function. The merge/non-merge decision by an SVM classifier in temporal merge stage is similar to that of Ref. [6], which uses an AdaBoost classifier. By our temporal merge module, text/non-text stroke classification errors can be corrected.

As for performance evaluation of text line grouping algorithms, there has not been a unified criterion. In Ref. [5], the recall metric, which is defined as the number of correct lines divided by the number of labeled lines in each page, is used to measure the accuracy of the system. In Refs. [4] and [6], the edit-distance metrics are employed to evaluate the system performance. The edit-distance between the text line detection result and the ground-truth is defined as the minimum number of split or merge operations needed to correct all errors, and the error rate is defined as the total number of edit operations divided by the total number of labeled text lines [6]. The system proposed by Liwichi et. al. [3] adopts the stroke classification rate defined as the number of correctly assigned strokes divided by the total number of strokes, and the document classification rate defined as the number of correctly processed documents divided by the total number of documents. Each of the above metrics can evaluate the system from a certain aspect, while to achieve an overall evaluation, we need a systematic methodology. Inspired by the performance evaluation methods for graphics recognition systems [13], which are also used in the ICDAR page segmentation competitions and offline handwriting segmentation contest [14], we give an extensive set of evaluation metrics.

To demonstrate the effectiveness of the proposed approach, we have experimented on the Tokyo University of Agriculture and Technology (TUAT) Kondate database [7] with two settings: one with perfect text/non-text separation (stroke type labels given) and the other with a stroke classification module. The results show that the proposed text line grouping method is robust for both the two cases.

This paper is an extension to a conference paper [15]. The extension is in several respects: more details of techniques description, improved features in MCE training, extensive performance metrics, experimental results and discussions. The rest of this paper is organized as follows: Section 2 gives an overview of our ink document analysis system. Section 3 details the text line grouping approach. Section 4 describes the performance evaluation metrics. Section 5 presents the experimental results and Section 6 offers our concluding remarks.

Section snippets

System overview

Text line grouping is one of the key parts of our online handwritten document analysis system (Fig. 1). After text/non-text separation by stroke classification, the text strokes are grouped into text lines. Some misclassified strokes can be corrected in the process of text line grouping. Last, each text line is recognized using a character string recognition algorithm [11].

The flows between the three parts in Fig. 1 are bidirectional. The stroke classification module may leave behind

Text line grouping

Before the grouping process, the average character size, which will be used in the following steps, is estimated from the text strokes in the ink document. Based on the fact that most of Japanese characters are square blocks, we first calculate the bounding box of each stroke, and sort the longer sides of the boxes, then take the average of the larger half and abandon the smaller half, otherwise the character size will be under-estimated [7].

Performance evaluation

Systematic evaluation metrics are needed to measure the overall performance of text line grouping algorithms. We first define matches and errors between the text lines detected by the algorithm (result lines) and those in the ground-truth (ground-truthed or labeled lines), based on which a set of evaluation metrics are calculated. Some of the metrics are originally proposed to evaluate graphics recognition systems [13] and further used in the ICDAR page segmentation competitions and handwriting

Experiments

To evaluate the performance of the proposed text line grouping approach, we have experimented on the TUAT HANDS-Kondate_t_bf-2001-11 (in brief, Kondate) database, of online freeform handwritten Japanese documents from 100 people without any writing constraints [7]. We have also compared our temporal segmentation method with a previous method proposed in Ref. [7].

Conclusion

We presented a robust text line grouping approach for analyzing freeform online handwritten Japanese documents. After the relatively coarse pre-segmentation, we use a linear discriminant function trained under the string-level MCE criterion to separate over-merged text lines. Then the HCF method is employed for both temporal and spatial merge. To evaluate the performance, we give a set of metrics from multiple aspects. The experiments on the TUAT Kondate database demonstrate the effectiveness

Acknowledgments

This work was partially supported by the Central Research Laboratory of Hitachi Ltd., Tokyo, Japan. The authors thank the Nakagawa Laboratory of Tokyo University of Agriculture and Technology (TUAT) for providing the Kondate database.

About the Author—XIANG-DONG ZHOU received the B.S. degree in Applied Mathematics and the M.S. degree in Management Science and Engineering both from National University of Defense Technology, Changsha, China, in 1998 and 2003, respectively. Currently, he is working toward the Ph.D. degree in Pattern Recognition and Intelligent System at National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include character

References (18)

C.-L. Liu et al.
Evaluation of prototype learning algorithms for nearest neighbor classifier in application to handwritten character recognition
Pattern Recognition
(2001)
A.K. Jain, A.M. Namboodiri, J. Subrahmonia, Structure in on-line documents, in: Proceedings of the Sixth International...
E.H. Ratzlaff, Inter-line distance estimation and text line extraction for unconstrained online handwriting, in:...
M. Liwicki, E. Indermuhle, H. Bunke, On-line handwritten text line detection using dynamic programming, in: Proceedings...
M. Shilman, Z. Wei, S. Raghupathy, P. Simard, D. Jones, Discerning structure from freeform handwritten notes, in:...
M. Ye, H. Sutanto, S. Raghupathy, C.Y. Li, M. Shilman, Grouping text lines in freeform handwritten notes, in:...
M. Ye, P. Viola, S. Raghupathy, H. Sutanto, C. Li, Learning to group text lines and regions in freeform handwritten...
M. Nakagawa, M. Onuma, On-line handwritten Japanese text recognition free from constrains on line direction and...
B.-H. Juang et al.
Minimum classification error rate methods for speech recognition
IEEE Trans. Speech Audio Process.
(1997)

There are more references available in the full text version of this article.

Cited by (0)

About the Author—DA-HAN WANG received the B.S. degree in Automation Science and Electrical Engineering from Beihang University, Beijing, China, in 2006 and in the same year he became a member of National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China. Now he is still working toward the Ph.D. degree in NLPR, CASIA. His research interests is character recognition.

About the Author—CHENG-LIN LIU received the B.S. degree in Electronic Engineering from Wuhan University, Wuhan, China, the M.E. degree in Electronic Engineering from Beijing Polytechnic University, Beijing, China, the Ph.D. degree in Pattern Recognition and Artificial Intelligence from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 1989, 1992 and 1995, respectively. He was a postdoctoral fellow at Korea Advanced Institute of Science and Technology (KAIST) and later at Tokyo University of Agriculture and Technology from March 1996 to March 1999. From 1999 to 2004, he was a research staff member and later a senior researcher at the Central Research Laboratory, Hitachi, Ltd., Tokyo, Japan. From 2005, he has been a Professor at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China, and is now the deputy director of the laboratory. His research interests include pattern recognition, image processing, neural networks, machine learning, and especially the applications to character recognition and document analysis. He has published over 70 technical papers at international journals and conferences. He won the IAPR/ICDAR Young Investigator Award of 2005.

View full text