Elsevier

Information Sciences

Volume 179, Issues 1–2, 2 January 2009, Pages 169-179
Information Sciences

Minimum tag error for discriminative training of conditional random fields

https://doi.org/10.1016/j.ins.2008.09.018Get rights and content

Abstract

This paper proposes a new criterion called minimum tag error (MTE) for discriminative training of conditional random fields (CRFs). The new criterion, which is a smoothed approximation to the sentence labeling error, aims to maximize an average of transcription tagging accuracies of all possible sentences, weighted by their probabilities. Corpora from the second international Chinese word segmentation bakeoff (Bakeoff 2005) are used to test the effectiveness of this new training criterion. The experimental results have demonstrated that the proposed minimum tag error criterion can reliably improve the initial performance of supervised conditional random fields. In particular, the recall rate of out-of-vocabulary words (Roov) is significantly improved compared with that obtained using standard conditional random fields. Furthermore, the new training method has the advantage of robustness to segmentation across all datasets.

Introduction

Conditional random fields (CRFs) [14] have recently become popular as models for sequence labeling tasks because they offer several advantages over traditional generative models such as hidden Markov models (HMM). Because CRFs are basically defined as conditional models of label sequences given observation sequences, they can make use of flexible overlapping features and overcome label bias problems. In recent years, CRFs have been successfully applied to many tasks, such as gene identification [11], spoken language understanding (SLU) [10], part-of-speech (POS) tagging [14], name entity recognition (NER) [3], [17], and shallow parsing [23], especially Chinese word segmentation [6], [21].

Unlike English and other Western languages, the Chinese language is based on characters rather than words. There are no blank spaces between words in Chinese sentences. Word segmentation is the first step in Chinese language processing tasks such as information retrieval (IR) [2], text mining (TM) [4], [31], question answering (QA) [18], and machine translation (MT) [15], [26]. The goal of Chinese word segmentation is to segment a Chinese sentence into a sequence of meaningful words. In early decades, methods for Chinese word segmentation mainly focused on dictionary-based approaches which matched input sentences against a given dictionary. However, the “word” in Chinese is actually not a well-defined concept, and no generally accepted lexicon exists. Furthermore, different tasks may have different granularities for defining Chinese word segmentation. In computer applications, “segmentation units” receive more attention than “words” [29]. For example,

(parallel computer)” may be segmented as two segmentation units
(parallel/computer)” in an information retrieval task, but may be regarded as one unit in a key word extraction task.1 Moreover, new words come into being all the time. Because of such factors, statistically based methods are the mainstream approach to Chinese word segmentation, especially supervised machine learning methods such as HMM, maximum entropy (ME) [30], and CRFs [14]. Unlike dictionary-based methods, machine learning methods rely on statistical models which are learned automatically from corpora, making them more adaptive and robust in processing different corpora. In the recent SIGHAN Bakeoff competitions [5], [16], CRFs were widely used and outperformed other machine learning methods such as HMM, support vector machine (SVM) [7], and ME. However, little work has been done on training criteria for CRFs. In this paper, a new training criterion is presented which achieves further improvement in the performance of CRFs without adding to the commonly used unigram and bigram features.

CRF parameters were first estimated using the maximum log-likelihood (ML) criterion [14]. However, the ML criterion is prone to overfitting because CRFs are often trained with a very large number of overlapping features. The maximum a posteriori (MAP) criterion was then proposed in [23] to reduce overfitting. Large margin methods have also been applied to parameter optimization [1], [25], [27]. Furthermore, the minimum classification error (MCE) criterion, on which the speech and pattern recognition research communities often focused, was adapted to CRF parameter estimation [24]. Gross et al. [8] proposed a training procedure that maximized per-label predictive accuracy in [8]. The procedure was similar to MCE except that it was based on a pointwise loss function rather than a sequential loss function.

These training criteria achieved excellent performances on various tasks. For the task of sequence labeling, ideally a CRF model is desirable because it can provide high accuracy when labeling new sequences. However, it is difficult to find parameters which provide the best possible accuracy on training data. In particular, to maximize sequence tagging accuracy, which is measured by the number of correct labels, gradient-based optimization methods cannot be used directly. Therefore, other optimization methods such as those mentioned above are used.

This paper presents a new discriminative training criterion called minimum tag error (MTE) which can be seen as being in the same spirit as MCE, but with a different objective function that is more naturally applicable to the sequence labeling task. The MTE criterion is a smoothed approximation to the tag accuracy measured on the output of a sequence labeling system given the training data, which can be directly optimized by the gradient-based method without providing a smoothing function as with the MCE criterion. The effectiveness of this new criterion is tested on the Chinese word segmentation task because Chinese word segmentation is a prerequisite step for Chinese information processing. The experimental results presented here show that the proposed new criterion can reliably enhance the initial results yielded by the MAP trained model. Furthermore, the new approach has the advantage of being able to recognize out-of-vocabulary (OOV) words (i.e., the set of words in the test corpus which do not occur in the training corpus) better than the standard MAP training method.

The remainder of this paper is structured as follows: Section 2 reviews standard conditional random fields. The main focus is in Section 3 which introduces the MTE training method. Section 4 describes the experiments, Section 5 presents a discussion of the results, and Section 6 states conclusions. Finally, acknowledgements are expressed.

Section snippets

Conditional random fields

Let X = X1, X2,  , XR〉 be observation input data sequences to be labeled, and let Y = Y1, Y2,  , YR〉 be a set of corresponding label sequences, where R is the number of data sequences. All components of Yi (i = 1, 2,  , R) are assumed to range over a finite tag set T. For example, X might consist of unsegmented Chinese sentences, and Y might range over the boundary tags of these sentences, with T a set of boundary tags such as the commonly used BIO (“B” means the beginning of a word, “I” indicates a

Motivation

After the maximum mutual information (MMI) criterion was first successfully applied to automatic speech recognition (ASR), there has been growing interest in a class of error minimizing discriminative training criteria. The minimum phone error (MPE) criterion [19], [20] is one of the most attractive discriminative training techniques. In contrast to the traditional MMI criterion, which directly maximizes the posterior probability of the training utterances, the MPE training approach tries to

Experimental data

Corpora from the second international Chinese word segmentation bakeoff were used to verify the effectiveness of the MTE training method. Performance values are reported in terms of three major metrics [5], [16]: the F-score as given by F-score = 2PR/(P + R) (where P is the word precision, and R the word recall), the recall on OOV words (Roov) and the recall on in-vocabulary words (Riv).

First, the PKU and MSR corpora were used to perform the segmentation evaluation on a closed track. Second, the

Discussions

The main contribution of this paper is to propose a new criterion which is integrated with sentence tagging accuracy in the process of training. The new criterion is inspired by the MPE criterion which has been successfully applied in the field of speech recognition. To verify the effectiveness of the proposed MTE criterion, experiments were conducted on Chinese word segmentation tasks. Because many researchers have used the CRF model for Chinese word segmentation using Bakeoff 2005 datasets in

Conclusions

A new criterion, minimum tag error, is proposed in this paper for discriminative training of CRF. The new criterion is a smoothed approximation to the weighted average accuracy of all possible labeling sequences in the lattice, which can be optimized directly by gradient-based methods. Rather than the MAP training method, which maximizes the posterior probability of the correct sequence on the training dataset, the new criterion tries to make more accurate transcriptions more likely. Corpora

Acknowledgements

Thanks to Taku for providing the CRF toolkit package, SIGHAN Bakeoff 2005 for providing the data and reviewers for useful suggestions.

References (36)

  • T. Emerson, The second international Chinese word segmentation bakeoff, in: Proceedings of the Fourth SIGHAN Workshop...
  • C.-L. Goh et al.

    Chinese word segmentation by classification of characters

    Computational Linguistics and Chinese Language Processing

    (2005)
  • S.S. Gross, O. Russakovsky, C.B. Do, S. Batzoglou, Training conditional random fields for maximum labelwise accuracy,...
  • A. Gunawardana, M. Mahajan, A. Acero, J.C. Platt, Hidden conditional random fields for phone classification, in:...
  • R. Klinger et al.

    Identifying gene specific variations in biomedical text

    Journal of Bioinformatics and Computational Biology

    (2007)
  • T. Kudo, CRF++: yet another CRF toolkit, 2007....
  • J.W. Kuo et al.

    An empirical study of word error minimization approaches for mandarin large vocabulary speech recognition

    International Journal of Computational Linguistics and Chinese Language Processing

    (2006)
  • J. Lafferty, A. McCallum, F. Pereira, Conditional random fields: probabilistic models for segmenting and labeling...
  • Cited by (0)

    View full text