A derivation of minimum classification error from the theoretical classification risk using Parzen estimation
Introduction
Minimum classification error (MCE) is a member of a broad family of approaches to pattern classifier design known as generalized probabilistic descent (GPD) (Katagiri et al., 1990, Katagiri et al., 1991a, Katagiri et al., 1998). The GPD family uses general discriminant functions to formulate different approaches to pattern classifier design, including MCE, discriminative feature extraction (DFE) (Biem et al., 2001) and minimum spotting error (MSPE) (Komori and Katagiri, 1993). GPD was proposed around the same time as a number of similar, though on the whole less general, approaches to discriminative training (Applebaum and Hanson, 1989; Franco and Serralheiro, 1990; Gish, 1992; Hampshire, 1993; Ljolje et al., 1990).
In essence, the MCE loss function is a smooth approximation of the recognition error rate, suitable for use in gradient-based optimization (Juang and Katagiri, 1992a; McDermott, 1997). Use of the MCE criterion function in the design of classification systems is directly aimed at minimizing classification error, rather than at learning the true data probability distributions, the target of maximum likelihood estimation (MLE) via Baum-Welch or Viterbi training.
Many studies have confirmed the effectiveness of MCE for speech recognition (e.g., Chou et al., 1993; McDermott et al., 2000). MCE is particularly effective compared to MLE when the number of parameters is small (McDermott and Katagiri, 1994). On the other hand, MCE training is significantly more time consuming than MLE (be it via Viterbi or Baum-Welch training), since it involves a recognition pass for each training utterance. However, from a practical point of view, smaller systems are clearly desirable, even if this requires a longer design time.
It should be noted that another popular approach to discriminative training is maximum mutual information (MMI) (Bahl et al., 1986; Brown, 1987; Nadas et al., 1988; Gopalakrishnan et al., 1988; Gopalakrishnan et al., 1991; Normandin, 1991). MMI too has yielded improvements in recognition accuracy for many tasks (e.g., Valtchev et al., 1996; Woodland and Povey, 2002). The main conceptual difference between MMI and MCE is, as their names suggest, that MMI is focused on maximizing mutual information, while MCE is focused on minimizing classification error. The former does not, in general, imply the latter (Gopalakrishnan et al., 1988). In many practical situations, however, the difference between these two approaches may be slight. Several studies provide detailed discussion of the similarities between MMI and MCE (McDermott, 1997; Katagiri et al., 1998).
The specific issue addressed in this article is the meaning of the smoothness of the MCE loss function in relation to the overall goal of minimizing classification error. So far, the motivation for using a smooth loss function has been to (1) enable the use of gradient-based optimization techniques, and (2) enhance generalization to unseen data. The convergence of the MCE criterion to the theoretical classification risk as the number of training tokens increases and the loss is made steeper has been discussed in previous work (Juang and Katagiri, 1992a). Nonetheless, some have viewed with skepticism the fact that the MCE loss function is an approximation of the true, binary 0–1 classification error, and not the true error itself.
The aim of this article is to present a new theoretical derivation of the MCE criterion that clarifies the nature of the smoothness of the MCE loss function, as well as the relationship between minimization of an overall MCE loss summed over a finite set of training data and minimization of the theoretical classification risk measured over the continuous probability densities underlying the classification problem. We will show that the continuous, 0–1 MCE loss function can be derived from an estimate of the theoretical classification risk, using Parzen estimation of the density of a suitably defined variable, the misclassification measure. In this analysis, the specific kernel type used for Parzen estimation leads to a specific type of MCE loss function, and vice versa; the width of the Parzen kernel directly corresponds to the steepness of the MCE loss function, and vice versa. Minimization of the MCE loss function is seen to be equivalent to the minimization of a Parzen window based estimate of the theoretical classification risk. The well-known convergence properties of Parzen estimates to the true densities, as the training set increases and the kernel width is narrowed, can now be applied directly to the MCE framework. Importantly, for the context of speech recognition, this analysis applies both to single pattern vectors as well as to variable-length patterns where each token consists of a sequence of pattern vectors, e.g., speech-derived feature vectors.
Though the derivation presented here does not have direct practical consequences in terms of actual MCE use (since it arrives at expressions of risk that are identical to previous MCE expressions of overall loss), it significantly broadens the theoretical understanding of MCE and strengthens the foundations of the MCE framework. The Parzen derivation presented here shows that the smooth MCE loss function is properly seen not as an ad hoc approximation of the true loss, but rather as the direct consequence of using a well-understood type of smoothing, Parzen estimation, to estimate the theoretical classification risk. The analysis establishes more clearly than before the link between the MCE empirical cost, measured on finite training data, and the theoretical classification risk. This explicitly formalizes the essential point that the goal is not to minimize the classification error on the training data, but on all data.
Section snippets
The minimum classification error framework
This section gives an overview of the MCE framework. This will help ground the presentation of the new theoretical derivation presented in Section 3. The reader familiar with MCE might want to skip to that section directly.
The MCE framework has been described in several publications (Juang and Katagiri, 1992a; Katagiri et al., 1998; McDermott, 1997). For each training token, MCE uses a three-step definition, mapping a training pattern token x and the system parameters to a 0–1 loss function
A novel analysis of the smoothness of the MCE loss function
The smoothness of the the 0–1 MCE loss function has two important roles. First, it enables the use of gradient-based optimization techniques. Second, it has a strong impact on generalization, as discussed in previous work (Juang and Katagiri, 1992b; McDermott and Katagiri, 1994). Here we clarify the nature and meaning of the MCE loss function. In particular, we show that the smoothness of the MCE loss function can be viewed as the direct consequence of using a well-understood type of smoothing,
Summary
This article presented a new theoretical analysis showing that the MCE loss function can be derived from Parzen window-based estimation of the theoretical classification risk. The width of the Parzen kernel used is inversely related to the steepness of the MCE loss function. The analysis is not restricted to a particular Parzen kernel, but can be used to derive an MCE loss function for any of the wide variety of legitimate Parzen kernels. The well-known convergence properties of Parzen
Acknowledgements
The ideas presented here have their seeds in several stimulating discussions with B.-H. Juang that took place in the early 1990s. We are grateful to him for his helpful advice and thought-provoking comments.
References (29)
- et al.
Prototype based discriminative training for various speech units
Computer Speech and Language
(1994) - et al.
Large scale discriminative training of hidden Markov models for speech recognition
Computer Speech and Language
(2002) - et al.
Enhancing the discrimination of speaker independent hidden Markov models with corrective training
Proceedings of the IEEE, International Conference on Acoustics, Speech and Signal Processing
(1989) - et al.
Maximum mutual information estimation of hidden Markov parameters for speech recognition
Proceedings of the IEEE, International Conference on Acoustics, Speech and Signal Processing
(1986) - et al.
An application of discriminative feature extraction to filter-bank-based speech recognition
IEEE Transactions on Speech and Audio Processing
(2001) - Brown, P.F., 1987. The acoustic-modeling problem in automatic speech recognition. PhD Thesis, Department of Computer...
- et al.
Segmental GPD training of HMM based speech recognizer
Proceedings of the IEEE, International Conference on Acoustics, Speech and Signal Processing
(1992) - et al.
Minimum error rate training based on N-best string models
Proceedings of the IEEE, International Conference on Acoustics, Speech and Signal Processing
(1993) - et al.
Pattern Classification and Scene Analysis
(1973) - Fahlman, S.E., 1988. An empirical study of learning speed in back-propagation networks, Technical Report CMU-CS-88-162,...
Training HMMs using a minimum error approach
Proceedings of the IEEE, International Conference on Acoustics, Speech and Signal Processing
A minimum classification error, maximum likelihood, neural network
Proceedings of the IEEE, International Conference on Acoustics, Speech and Signal Processing
Decoder selection based on cross-entropies
Proceedings of the IEEE, International Conference on Acoustics, Speech and Signal Processing
An inequality for rational functions with applications to some statistical estimation problems
Transactions on Information Theory
Cited by (24)
Minimum classification error learning for sequential data in the wavelet domain
2010, Pattern RecognitionCitation Excerpt :MCE training has shown to outperform the conventional maximum likelihood approach in many applications. This success has also triggered several efforts both to ground the method on a more principled basis [22,23] and to improve its efficiency in large-scale applications [24]. Nevertheless, most of these works deal only with standard hidden Markov models and are not suitable for wavelet representations.
Large-margin minimum classification error training: A theoretical risk minimization perspective
2008, Computer Speech and LanguageAn Improved Boundary Uncertainty-Based Estimation for Classifier Evaluation
2021, Journal of Signal Processing SystemsRobust and efficient pattern classification using large geometric margin minimum classification error training
2014, Journal of Signal Processing SystemsMinimum classification error training incorporating automatic loss smoothness determination
2014, Journal of Signal Processing SystemsMinimum classification error training employing real-coded genetic algorithms
2012, IEEE Region 10 Annual International Conference, Proceedings/TENCON