A derivation of minimum classification error from the theoretical classification risk using Parzen estimation

doi:10.1016/S0885-2308(03)00037-8

Computer Speech & Language

Volume 18, Issue 2, April 2004, Pages 107-122

https://doi.org/10.1016/S0885-2308(03)00037-8 Get rights and content

Abstract

The minimum classification error (MCE) framework is an approach to discriminative training for pattern recognition that explicitly incorporates a smoothed version of classification performance into the recognizer design criterion. Many studies have confirmed the effectiveness of MCE for speech recognition. In this article, we present a theoretical analysis of the smoothness of the MCE loss function. Specifically, we show that the MCE criterion function is equivalent to a Parzen window-based estimate of the theoretical classification risk. In this analysis, each training token is mapped to the center of a Parzen kernel in the domain of a suitably defined random variable. The kernels are summed to produce a density estimate; this estimate in turn can easily be integrated over the domain of incorrect classifications, yielding the risk estimate. The expression of risk for each kernel corresponds directly to the usual MCE loss function. The specific form of the Parzen window corresponds to the specific form of the MCE loss function. The derivation presented here shows that the smooth MCE loss function, far from being an ad-hoc approximation of the true error, can be seen as the direct consequence of using a well-understood type of smoothing, Parzen estimation, to estimate the theoretical risk from a finite training set. This analysis provides a novel link between the MCE empirical cost measured on a finite training set and the theoretical classification risk.

Introduction

Minimum classification error (MCE) is a member of a broad family of approaches to pattern classifier design known as generalized probabilistic descent (GPD) (Katagiri et al., 1990, Katagiri et al., 1991a, Katagiri et al., 1998). The GPD family uses general discriminant functions to formulate different approaches to pattern classifier design, including MCE, discriminative feature extraction (DFE) (Biem et al., 2001) and minimum spotting error (MSPE) (Komori and Katagiri, 1993). GPD was proposed around the same time as a number of similar, though on the whole less general, approaches to discriminative training (Applebaum and Hanson, 1989; Franco and Serralheiro, 1990; Gish, 1992; Hampshire, 1993; Ljolje et al., 1990).

In essence, the MCE loss function is a smooth approximation of the recognition error rate, suitable for use in gradient-based optimization (Juang and Katagiri, 1992a; McDermott, 1997). Use of the MCE criterion function in the design of classification systems is directly aimed at minimizing classification error, rather than at learning the true data probability distributions, the target of maximum likelihood estimation (MLE) via Baum-Welch or Viterbi training.

Many studies have confirmed the effectiveness of MCE for speech recognition (e.g., Chou et al., 1993; McDermott et al., 2000). MCE is particularly effective compared to MLE when the number of parameters is small (McDermott and Katagiri, 1994). On the other hand, MCE training is significantly more time consuming than MLE (be it via Viterbi or Baum-Welch training), since it involves a recognition pass for each training utterance. However, from a practical point of view, smaller systems are clearly desirable, even if this requires a longer design time.

It should be noted that another popular approach to discriminative training is maximum mutual information (MMI) (Bahl et al., 1986; Brown, 1987; Nadas et al., 1988; Gopalakrishnan et al., 1988; Gopalakrishnan et al., 1991; Normandin, 1991). MMI too has yielded improvements in recognition accuracy for many tasks (e.g., Valtchev et al., 1996; Woodland and Povey, 2002). The main conceptual difference between MMI and MCE is, as their names suggest, that MMI is focused on maximizing mutual information, while MCE is focused on minimizing classification error. The former does not, in general, imply the latter (Gopalakrishnan et al., 1988). In many practical situations, however, the difference between these two approaches may be slight. Several studies provide detailed discussion of the similarities between MMI and MCE (McDermott, 1997; Katagiri et al., 1998).

The specific issue addressed in this article is the meaning of the smoothness of the MCE loss function in relation to the overall goal of minimizing classification error. So far, the motivation for using a smooth loss function has been to (1) enable the use of gradient-based optimization techniques, and (2) enhance generalization to unseen data. The convergence of the MCE criterion to the theoretical classification risk as the number of training tokens increases and the loss is made steeper has been discussed in previous work (Juang and Katagiri, 1992a). Nonetheless, some have viewed with skepticism the fact that the MCE loss function is an approximation of the true, binary 0–1 classification error, and not the true error itself.

The aim of this article is to present a new theoretical derivation of the MCE criterion that clarifies the nature of the smoothness of the MCE loss function, as well as the relationship between minimization of an overall MCE loss summed over a finite set of training data and minimization of the theoretical classification risk measured over the continuous probability densities underlying the classification problem. We will show that the continuous, 0–1 MCE loss function can be derived from an estimate of the theoretical classification risk, using Parzen estimation of the density of a suitably defined variable, the misclassification measure. In this analysis, the specific kernel type used for Parzen estimation leads to a specific type of MCE loss function, and vice versa; the width of the Parzen kernel directly corresponds to the steepness of the MCE loss function, and vice versa. Minimization of the MCE loss function is seen to be equivalent to the minimization of a Parzen window based estimate of the theoretical classification risk. The well-known convergence properties of Parzen estimates to the true densities, as the training set increases and the kernel width is narrowed, can now be applied directly to the MCE framework. Importantly, for the context of speech recognition, this analysis applies both to single pattern vectors as well as to variable-length patterns where each token consists of a sequence of pattern vectors, e.g., speech-derived feature vectors.

Though the derivation presented here does not have direct practical consequences in terms of actual MCE use (since it arrives at expressions of risk that are identical to previous MCE expressions of overall loss), it significantly broadens the theoretical understanding of MCE and strengthens the foundations of the MCE framework. The Parzen derivation presented here shows that the smooth MCE loss function is properly seen not as an ad hoc approximation of the true loss, but rather as the direct consequence of using a well-understood type of smoothing, Parzen estimation, to estimate the theoretical classification risk. The analysis establishes more clearly than before the link between the MCE empirical cost, measured on finite training data, and the theoretical classification risk. This explicitly formalizes the essential point that the goal is not to minimize the classification error on the training data, but on all data.

Section snippets

The minimum classification error framework

This section gives an overview of the MCE framework. This will help ground the presentation of the new theoretical derivation presented in Section 3. The reader familiar with MCE might want to skip to that section directly.

The MCE framework has been described in several publications (Juang and Katagiri, 1992a; Katagiri et al., 1998; McDermott, 1997). For each training token, MCE uses a three-step definition, mapping a training pattern token x and the system parameters $Λ$ to a 0–1 loss function

A novel analysis of the smoothness of the MCE loss function

The smoothness of the the 0–1 MCE loss function has two important roles. First, it enables the use of gradient-based optimization techniques. Second, it has a strong impact on generalization, as discussed in previous work (Juang and Katagiri, 1992b; McDermott and Katagiri, 1994). Here we clarify the nature and meaning of the MCE loss function. In particular, we show that the smoothness of the MCE loss function can be viewed as the direct consequence of using a well-understood type of smoothing,

Summary

This article presented a new theoretical analysis showing that the MCE loss function can be derived from Parzen window-based estimation of the theoretical classification risk. The width of the Parzen kernel used is inversely related to the steepness of the MCE loss function. The analysis is not restricted to a particular Parzen kernel, but can be used to derive an MCE loss function for any of the wide variety of legitimate Parzen kernels. The well-known convergence properties of Parzen

Acknowledgements

The ideas presented here have their seeds in several stimulating discussions with B.-H. Juang that took place in the early 1990s. We are grateful to him for his helpful advice and thought-provoking comments.

References (29)

E. McDermott et al.
Prototype based discriminative training for various speech units
Computer Speech and Language
(1994)
P. Woodland et al.
Large scale discriminative training of hidden Markov models for speech recognition
Computer Speech and Language
(2002)
T. Applebaum et al.
Enhancing the discrimination of speaker independent hidden Markov models with corrective training
Proceedings of the IEEE, International Conference on Acoustics, Speech and Signal Processing
(1989)
L. Bahl et al.
Maximum mutual information estimation of hidden Markov parameters for speech recognition
Proceedings of the IEEE, International Conference on Acoustics, Speech and Signal Processing
(1986)
A. Biem et al.
An application of discriminative feature extraction to filter-bank-based speech recognition
IEEE Transactions on Speech and Audio Processing
(2001)
Brown, P.F., 1987. The acoustic-modeling problem in automatic speech recognition. PhD Thesis, Department of Computer...
W. Chou et al.
Segmental GPD training of HMM based speech recognizer
Proceedings of the IEEE, International Conference on Acoustics, Speech and Signal Processing
(1992)
W. Chou et al.
Minimum error rate training based on N-best string models
Proceedings of the IEEE, International Conference on Acoustics, Speech and Signal Processing
(1993)
R.O. Duda et al.
Pattern Classification and Scene Analysis
(1973)
Fahlman, S.E., 1988. An empirical study of learning speed in back-propagation networks, Technical Report CMU-CS-88-162,...

H. Franco et al.

Training HMMs using a minimum error approach

Proceedings of the IEEE, International Conference on Acoustics, Speech and Signal Processing

(1990)

H. Gish

A minimum classification error, maximum likelihood, neural network

Proceedings of the IEEE, International Conference on Acoustics, Speech and Signal Processing

(1992)

P.S. Gopalakrishnan et al.

Decoder selection based on cross-entropies

Proceedings of the IEEE, International Conference on Acoustics, Speech and Signal Processing

(1988)

P.S. Gopalakrishnan et al.

An inequality for rational functions with applications to some statistical estimation problems

Transactions on Information Theory

(1991)

Cited by (24)

Minimum classification error learning for sequential data in the wavelet domain
2010, Pattern Recognition
Citation Excerpt :
MCE training has shown to outperform the conventional maximum likelihood approach in many applications. This success has also triggered several efforts both to ground the method on a more principled basis [22,23] and to improve its efficiency in large-scale applications [24]. Nevertheless, most of these works deal only with standard hidden Markov models and are not suitable for wavelet representations.
Wavelet analysis has found widespread use in signal processing and many classification tasks. Nevertheless, its use in dynamic pattern recognition have been much more restricted since most of wavelet models cannot handle variable length sequences properly. Recently, composite hidden Markov models which observe structured data in the wavelet domain were proposed to deal with this kind of sequences. In these models, hidden Markov trees account for local dynamics in a multiresolution framework, while standard hidden Markov models capture longer correlations in time. Despite these models have shown promising results in simple applications, only generative approaches have been used so far for parameter estimation. The goal of this work is to take a step forward in the development of dynamic pattern recognizers using wavelet features by introducing a new discriminative training method for this Markov models. The learning strategy relies on the minimum classification error approach and provides re-estimation formulas for fully non-tied models. Numerical experiments on phoneme recognition show important improvement over the recognition rate achieved by the same models trained using maximum likelihood estimation.
Large-margin minimum classification error training: A theoretical risk minimization perspective
2008, Computer Speech and Language
Large-margin discriminative training of hidden Markov models has received significant attention recently. A natural and interesting question is whether the existing discriminative training algorithms can be extended directly to embed the concept of margin. In this paper, we give this question an affirmative answer by showing that the sigmoid bias in the conventional minimum classification error (MCE) training can be interpreted as a soft margin. We justify this claim from a theoretical classification risk minimization perspective where the loss function associated with a non-zero sigmoid bias is shown to include not only empirical error rates but also a margin-bound risk. Based on this perspective, we propose a practical optimization strategy that adjusts the margin (sigmoid bias) incrementally in the MCE training process so that a desirable balance between the empirical error rates on the training set and the margin can be achieved. We call this modified MCE training process large-margin minimum classification error (LM-MCE) training to differentiate it from the conventional MCE. Speech recognition experiments have been carried out on two tasks. First, in the TIDIGITS recognition task, LM-MCE outperforms the state-of-the-art MCE method with 17% relative digit-error reduction and 19% relative string-error reduction. Second, on the Microsoft internal large vocabulary telephony speech recognition task (with 2000 h of training data and 120 K words in the vocabulary), significant recognition accuracy improvement is achieved, demonstrating that our formulation of LM-MCE can be successfully scaled up and applied to large-scale speech recognition tasks.
An Improved Boundary Uncertainty-Based Estimation for Classifier Evaluation
2021, Journal of Signal Processing Systems
Robust and efficient pattern classification using large geometric margin minimum classification error training
2014, Journal of Signal Processing Systems
Minimum classification error training incorporating automatic loss smoothness determination
2014, Journal of Signal Processing Systems
Minimum classification error training employing real-coded genetic algorithms
2012, IEEE Region 10 Annual International Conference, Proceedings/TENCON

View all citing articles on Scopus

View full text

A derivation of minimum classification error from the theoretical classification risk using Parzen estimation

Abstract

Introduction

Section snippets

The minimum classification error framework

A novel analysis of the smoothness of the MCE loss function

Summary

Acknowledgements

Computer Speech and Language

Computer Speech and Language

Enhancing the discrimination of speaker independent hidden Markov models with corrective training

Proceedings of the IEEE, International Conference on Acoustics, Speech and Signal Processing

Maximum mutual information estimation of hidden Markov parameters for speech recognition

Proceedings of the IEEE, International Conference on Acoustics, Speech and Signal Processing

An application of discriminative feature extraction to filter-bank-based speech recognition

IEEE Transactions on Speech and Audio Processing

Segmental GPD training of HMM based speech recognizer

Proceedings of the IEEE, International Conference on Acoustics, Speech and Signal Processing

Minimum error rate training based on N-best string models

Proceedings of the IEEE, International Conference on Acoustics, Speech and Signal Processing

Pattern Classification and Scene Analysis

Training HMMs using a minimum error approach

Proceedings of the IEEE, International Conference on Acoustics, Speech and Signal Processing

A minimum classification error, maximum likelihood, neural network

Proceedings of the IEEE, International Conference on Acoustics, Speech and Signal Processing

Decoder selection based on cross-entropies

Proceedings of the IEEE, International Conference on Acoustics, Speech and Signal Processing

An inequality for rational functions with applications to some statistical estimation problems

Transactions on Information Theory