Elsevier

Pattern Recognition

Volume 47, Issue 3, March 2014, Pages 1202-1216
Pattern Recognition

Unsupervised language model adaptation for handwritten Chinese text recognition

https://doi.org/10.1016/j.patcog.2013.09.015Get rights and content

Highlights

  • We propose an unsupervised language model (LM) adaptation framework for handwritten Chinese text recognition.

  • We use a two-pass recognition strategy with a pre-defined multi-domain LM set.

  • The adaptive LM is dynamically generated by three methods of model selection, model combination and model reconstruction.

  • We compress the LM set by split vector quantization and principal component analysis.

Abstract

This paper presents an effective approach for unsupervised language model adaptation (LMA) using multiple models in offline recognition of unconstrained handwritten Chinese texts. The domain of the document to recognize is variable and usually unknown a priori, so we use a two-pass recognition strategy with a pre-defined multi-domain language model set. We propose three methods to dynamically generate an adaptive language model to match the text output by first-pass recognition: model selection, model combination and model reconstruction. In model selection, we use the language model with minimum perplexity on the first-pass recognized text. By model combination, we learn the combination weights via minimizing the sum of squared error with both L2-norm and L1-norm regularization. For model reconstruction, we use a group of orthogonal bases to reconstruct a language model with the coefficients learned to match the document to recognize. Moreover, we reduce the storage size of multiple language models using two compression methods of split vector quantization (SVQ) and principal component analysis (PCA). Comprehensive experiments on two public Chinese handwriting databases CASIA-HWDB and HIT-MW show that the proposed unsupervised LMA approach improves the recognition performance impressively, particularly for ancient domain documents with the recognition accuracy improved by 7 percent. Meanwhile, the combination of the two compression methods largely reduces the storage size of language models with little loss of recognition accuracy.

Introduction

Handwritten Chinese character recognition has attracted a lot of attention since the 1970s and has achieved tremendous advances [1], [2]. However, the recognition of unconstrained handwritten Chinese texts has been reported only in recent years, and the reported accuracies are quite low (e.g., character-level correct rate of 39 percent in [3]). Besides the divergence of writing styles, handwritten text recognition is difficult due to the weak lexical constraint: the number of sentence classes is infinite. Although our recent work by integrating multiple contexts including linguistic context achieved a high correct rate of 91 percent [4], there are still many recognition errors due to the insufficient modeling of linguistic context, particularly, the language model often mismatches the domain of handwritten text. To deal with this mismatch problem, we investigate language model adaptation for handwritten Chinese text recognition (HCTR) in this paper, particularly, unsupervised adaptation for the scenario that there is no prior information about the domain of text.

Language model (LM) provides a principled way to quantify the uncertainties associated with the natural language, but this uncertainty is variable in the texts of different domains. The modeling of diverse domains was originally studied in speech recognition [5], leading to the research of language model adaptation. The researchers often combine a generic LM with a domain-specific LM that is more relevant to the recognition task. However, it is usually difficult to get enough domain-specific data to learn the domain-specific model [5], and a growing interest is evident in collecting texts from the Internet to supplement sparse domain-specific resources [6]. For Chinese resources, the Sogou Labs1 provide a large set of resources relevant to diverse domains extracted from the Internet, which can be used to train a multi-domain LM set via the SRILM toolkit [7]. This LM set can be adaptively applied in HCTR according to the domain of each document.

Language model adaptation (LMA) is difficult when the domain of recognition task is unknown a priori, which can only be solved by unsupervised approaches. Many efforts have been made in this direction in speech recognition. One method is to calculate the probability of one word given a document (herein, the history of recognized text) as an uni-gram model by latent topic analysis [5], [8], [9], [10], then interpolate it with the generic LM. Another popular method is to use the recognized text directly to estimate an adaptive n-gram model [11], [12], which is interpolated with the generic LM for the next-pass recognition. This method is usually based on a multi-pass (e.g., two-pass) recognition framework. In handwriting recognition, unsupervised LMA has been reported only in recent years, and the adaptive LM is very simple. For example, Xiu and Baird [13] applied the multi-pass recognition strategy iteratively to adapt a word lexicon, and Lee and Smith [14] iteratively modified uni-gram probabilities further for English whole-book recognition, where the texts are very long.

Another difficulty of LMA is that the transcripts of handwritten texts are usually short. There are usually only a few hundreds of characters in a handwritten text (e.g., Fig. 1), impeding the direct adaption of the lexicon or n-gram probabilities. In such situation, the interpolation of models from a pre-defined multi-domain LM set is usually used in speech recognition, and the interpolation weights are learned by the maximum a posteriori (MAP) estimation from the held-out data similar to the task domain (supervised LMA, e.g., [5], [15]), or the recognized texts (unsupervised LMA, e.g., [16]) or both (e.g., [17]). This solution also brings a problem of the large storage size of the LM set. To overcome this problem, the size of each LM is usually reduced by entropy-based pruning [18] and quantization-based compression [19].

In this paper, we propose an unsupervised LMA framework for HCTR. For no prior information of the document to recognize (test document) is available, we use a two-pass recognition strategy. In consideration of various and short handwritten texts, we dynamically generate an adaptive LM to match the first-pass recognized text of each document using a pre-defined multi-domain LM set. We propose three methods to generate adaptive LMs, namely model selection, model combination and model reconstruction. The model selection method is to select the best LM according to the minimum perplexity criterion. In model combination, we estimate the combination weights by minimizing the sum of squared error (MSE) with both L2-norm and L1-norm regularization. By model reconstruction, the adaptive LM is constructed by a group of orthogonal bases. To make the adaption approach practical, we also consider the reduction of computational cost and storage space. To speed up the two-pass recognition, we store the candidate character classes of each candidate character pattern in the first-pass recognition to avoid repeated classification in the second-pass. To reduce the storage size of LM set, we compress the LMs using split vector quantization (SVQ) and principal component analysis (PCA). Finally, we evaluated the recognition performance on two public Chinese handwriting databases CAISA-HWDB [20] and HIT-MW [21], and showed large improvement of performance gained by the proposed unsupervised LMA methods with comparable computational cost to the baseline system.

Unlike previous works on unsupervised LMA in speech recognition (e.g., [17]), we regard the language model combination as a linear regression problem. We learn combination weights via minimizing an error cost function (i.e., MSE) and further consider model sparsity by adding L1-norm regularization, which is totally different from the MAP or MBR (minimum Bayes' risk) based framework [17]. The MAP estimation is based on the perplexity and MBR takes into account the acoustic model under supervised framework, and both get a local optimal solution by Baum–Welch algorithm, while the MSE-based method aims at error loss minimization on the recognized text with a global optimal solution. Another contribution of this paper is the new LMA method based on model reconstruction by applying PCA to a pre-defined LM set. This idea is motivated by a technique developed by Sirovich and Kirby [22] for efficiently representing images of human faces using PCA, and it has also been successfully used in image processing like active shape models [23]. More importantly, the focus of this work is to investigate the role of LMA in Chinese handwriting recognition, which, to our best of knowledge, has not been investigated in depth in handwriting recognition field. A preliminary conference version of the paper was presented in [24], and this extended version provides more detailed descriptions, presents additional LMA methods and a significantly extended experimental validation. By customizing the handwriting recognition algorithm, the proposed approach can also apply to the recognition of the documents of other languages (such as English and Arabic).

The rest of this paper is organized as follows: Section 2 reviews some related works; Section 3 gives an overview of our HCTR system with unsupervised LMA; Section 4 briefly describes the statistical language models used in this paper; Section 5 describes in details the unsupervised LMA methods of model selection, model combination and model reconstruction; Section 6 introduces the LM compression methods; Section 7 presents the experimental results, and Section 8 draws concluding remarks.

Section snippets

Related works

The large variability of domains across different handwritten texts makes accurate language modeling a challenge. Language model adaptation (LMA) is a process that adapts the language model to match the domain of each recognition task. However, LMA has been rarely studied in handwriting recognition. Recently, Xiu and Baird [13] adapted a word lexicon from the previous recognized text for English whole-book recognition, and Lee and Smith [14] further modified word uni-gram probabilities in

System overview

The baseline handwriting recognition system without LMA is introduced in our previous paper [4], which is based on the integrated segmentation-and-recognition framework with character over-segmentation. For no prior information of each handwritten text is available, we hereby use a two-pass recognition strategy for LMA. In the first-pass recognition, a generic LM is used to get a preliminary recognized text, then the text is used to get an adaptive LM for the second-pass recognition.

Fig. 2

Statistical language model

In the path evaluation function Eq. (1), the linguistic context score logP(C) plays a very important role, which is usually given by a statistical language model. The most popular language model is the n-gram model [37], where n is called the order of the model. Such model characterizes the statistical dependency between n characters or words. In consideration of model complexity, the order n usually takes 2 or 3, meaning the bi-gram and tri-gram model, respectively. In this paper, we evaluate

Language model adaptation

This section presents three language model adaptation (LMA) methods, which are necessary when the generic language model (LM) does not match well the handwritten text. Because the domain of handwritten text is variable and unknown a priori, we use a two-pass recognition strategy for unsupervised adaptation of language model, which is described as follows:

Two-Pass recognition for unsupervised LMA

  • (1)

    Use a generic language model (LM0) to recognize a document to obtain a preliminary transcript C.

  • (2)

Language model compression

The above LMA methods depend on a LM set including K+1 LMs, which poses a challenge of storage. Although we pruned each LM to a moderate size using entropy-pruning [18], the storage size of K+1 models is still considerable for practical applications. Each LM contains two parts, namely, the n-gram table and the probability values of each n-gram. The n-gram table is not considered in the following, because it is fixed in every LM. In this section, we introduce two methods to compress the storage

Experimental results

We evaluated the performance of our unsupervised LMA approaches on two databases: a large database of unconstrained Chinese handwriting, CASIA-HWDB [20], and a small data set, HIT-MW [21], both are free to download for research.5 All the experiments were run on a desktop computer with 3.10 GHz CPU, programming using Microsoft Visual C++ 2005.

Conclusion

This paper presents an unsupervised language model adaptation (LMA) framework for handwritten Chinese text recognition. Based on two-pass recognition strategy, we propose three methods to dynamically generate an adaptive language model (LM) to match the test document via a pre-defined multi-domain LM set, namely, model selection, model combination and model reconstruction. The experimental results show that the model combination of three selected LMs performs best, considering the tradeoff

Conflict of interest statement

None declared.

Acknowledgment

This work was supported by the National Natural Science Foundation of China (NSFC) under Grants 60933010 and 61305005.

Qiu-Feng Wang is an Assistant Professor at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China. He received the B.S. degree in computer science from Nanjing University of Science and Technology, Nanjing, China, the Ph.D. degree in Pattern Recognition and Intelligent Systems from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2006 and 2012, respectively. His research interests include sequential

References (47)

  • A. Sethy et al.

    An iterative relative entropy minimization-based data selection approach for n-gram model adaptation

    Speech, Language Processing

    (2009)
  • A. Stolcke, SRILM—an extensible language modeling toolkit, in: Proceedings of the 7th ICSLP, 2002, pp....
  • J.R. Bellegarda

    Exploiting latent semantic information in statistical language modeling

    Proceedings of the IEEE

    (2000)
  • D. Mrva, P.C. Woodland, Unsupervised language model adaptation for mandarin broadcast conversation transcription, in:...
  • Y.-C. Tam, T. Schultz, Correlated latent semantic model for unsupervised LM Adaptation, in: Proceedings of the 32nd...
  • M. Bacchiani, B. Roark, Unsupervised language model adaptation, in: Proceedings of the 28th ICASSP, 2003, pp....
  • G. Tur, A. Stolcke, Unsupervised language model adaptation for meeting recognition, in: Proceedings of the 32nd ICASSP,...
  • P. Xiu et al.

    Whole-book recognition

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2012)
  • D.-S. Lee, R. Smith, Improving book OCR by adaptation language and image models, in: Proceedings of the 10th...
  • P.C. Woodland, T. Hain, G.L. Moore, T.R. Niesler, D. Povey, A. Tuerk, E.W.D. Whittaker, The 1998 HTK broadcast news...
  • C. Allauzen, M. Riley, Bayesian language model interpolation for mobile speech input, in: Proceedings of the 12th...
  • A. Stolcke, Entropy-based pruning of backoff language models, in: Proceedings of the DARPA Broadcast News Workshop,...
  • E.W.D. Whittaker, B. Raj, Quantization-based language model compression, in: Proceedings of the 7th Eurospeech, 2001,...
  • Cited by (0)

    Qiu-Feng Wang is an Assistant Professor at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China. He received the B.S. degree in computer science from Nanjing University of Science and Technology, Nanjing, China, the Ph.D. degree in Pattern Recognition and Intelligent Systems from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2006 and 2012, respectively. His research interests include sequential pattern recognition, handwritten text recognition and language models.

    Fei Yin is an Assistant Professor at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China. He received the B.S. degree in Computer Science from Xidian University of Posts and Telecommunications, Xi'an, China, the M.E. degree in Pattern Recognition and Intelligent Systems from Huazhong University of Science and Technology, Wuhan, China, the Ph.D. degree in Pattern Recognition and Intelligent Systems from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 1999, 2002 and 2010, respectively. His research interests include document image analysis, handwritten character recognition and image processing. He has published over 20 papers at international journals and conferences.

    Cheng-Lin Liu is a Professor at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation of Chinese Academy of Sciences, Beijing, China, and is now the deputy director of the laboratory. He received the B.S. degree in electronic engineering from Wuhan University, Wuhan, China, the M.E. degree in electronic engineering from Beijing Polytechnic University, Beijing, China, the Ph.D. degree in pattern recognition and intelligent control from the Chinese Academy of Sciences, Beijing, China, in 1989, 1992 and 1995, respectively. He was a postdoctoral fellow at Korea Advanced Institute of Science and Technology (KAIST) and later at Tokyo University of Agriculture and Technology from March 1996 to March 1999. From 1999 to 2004, he was a research staff member and later a senior researcher at the Central Research Laboratory, Hitachi, Ltd., Tokyo, Japan. His research interests include pattern recognition, image processing, neural networks, machine learning, and especially the applications to character recognition and document analysis. He has published over 150 technical papers at prestigious international journals and conferences. He is a fellow of the IAPR, and a senior member of the IEEE.

    View full text