Abstract
Phoneme classification is a classification sub-task of automatic speech recognition (ASR), which is essential in order to achieve good speech recognition accuracy. However, unlike most classification tasks, besides finding the correct class, providing good posterior scores is also an important requirement of it. Partly because of this, formerly Gaussian Mixture Models, while recently Artificial Neural Networks (ANNs) are used in this task, while other common machine learning methods like Support Vector Machines and AdaBoost.MH are applied only rarely. In a previous study, we showed that AdaBoost.MH can match the performance of ANNs in terms of classification accuracy, but lags behind it when utilizing its output in the speech recognition process. This is in part due to the imprecise posterior scores that AdaBoost.MH produces, which is a well-known weakness of this method. To improve the quality of posterior scores produced, it is common to perform some kind of posterior calibration. In this study, we test several posterior calibration techniques in order to improve the overall performance of AdaBoost.MH. We found that posterior calibration is a good way to improve ASR accuracy, especially when we integrate the speech recognition process into the calibration workflow.
Similar content being viewed by others
Notes
The indicator function \({{\mathbb {I}}\left\{ {A}\right\} }\) is 1 if its argument A is true and 0 otherwise.
References
Ayer M, Brunk H, Ewing G, Reid W, Silverman E (1955) An empirical distribution function for sampling with incomplete information. Ann Math Stat 5(26):641–647
Bartlett PL, Traskin M (2007) AdaBoost is consistent. J Mach Learn Res 8:2347–2368
Benbouzid D, Busa-Fekete R, Casagrande N, Collin FD, Kégl B (2012) MultiBoost: a multi-purpose boosting package. J Mach Learn Res 13:549–553
Bishop CM (1995) Neural networks for pattern recognition. Clarendon Press, Oxford
Bodnár P, Nyúl LG (2015) Improved QR code localization using boosted cascade of weak classifiers. Acta Cybern 22(1):21–33
Busa-Fekete R, Kégl B (2009) Accelerating AdaBoost using UCB. In: KDDCup 2009 (JMLR W&CP), vol 7, pp 111–122, Paris, France
Busa-Fekete R, Kégl B, Éltetö T, Szarvas G (2013) Tune and mix: learning to rank using ensembles of calibrated multi-class classifiers. Mach Learn 93(2–3):261–292
Crammer K, Singer Y (2001) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2:265–292
Drish J (2001) Obtaining calibrated probability estimates from support vector machines. Technical report, University of California, San Diego, CA, USA
Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York
Ensor KB, Glynn PW (1997) Stochastic optimization via grid search. In: Lectures in Applied Mathematics, vol 33. American Mathematical Society, pp 89–100
Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28:337–374
Gosztolya G (2014) Is AdaBoost competitive for phoneme classification? In: Proceedings of CINTI (IEEE), pp 61–66, Budapest, Hungary
Gosztolya G (2015) On evaluation metrics for social signal detection. In: Proceedings of InterSpeech, pp 2504–2508, Dresden, Germany
Gosztolya G, Busa-Fekete R, Tóth L (2013) Detecting autism, emotions and social signals using AdaBoost. In: Proceedings of InterSpeech, pp 220–224, Lyon, France
Gosztolya G, Beke A, Neuberger T, Tóth L (2016) Laughter classification using deep rectifier neural networks with a minimal feature subset. Arch Acoust 41(4):669–682
Gupta R, Audhkhasi K, Lee S, Narayanan SS (2013) Speech paralinguistic event detection using probabilistic time-series smoothing and masking. In: Proceedings of InterSpeech, pp 173–177
Imseng D, Bourlard H, Magimai-Doss M, Dines J (2011) Language dependent universal phoneme posterior estimation for mixed language speech recognition. In: Proceedings of ICASSP, pp. 5012–5015, Prague, Czech Republic
Jelinek F (1997) Statistical methods for speech recognition. MIT Press, Cambridge
Kaya H, Karpov AA, Salah AA (2015) Fisher Vectors with cascaded normalization for paralinguistic analysis. In: Proceedings of InterSpeech, pp 909–913
Lamel L, Kassel R, Seneff S (1986) Speech database development: design and analysis of the acoustic-phonetic corpus. In: Proceedings of DARPA speech recognition workshop, pp 121–124
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10(8):707–710
Manning C, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Mease D, Wyner A, Buja A (2007) Boosted classification trees and class probability/quantile estimation. J Mach Learn Res 8:409–439
Morgan N, Bourland H (1995) An introduction to hybrid HMM/connectionist continuous speech recognition. Signal Process Mag 1025–1028, May 1995
Neuberger T, Beke A (2013) Automatic laughter detection in spontaneous speech using GMM–SVM method. In: Proceedings of TSD, pp 113–120
Niculescu-Mizil A, Caruana R (2005) Obtaining calibrated probabilities from boosting. In: Proceedings of 21st conference on uncertainty in artificial intelligence (UAI’05), pp 413–420
Platt J (2000) Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In: Smola A, Bartlett P, Schoelkopf B, Schuurmans D (eds) Advances in large margin classifiers. MIT Press, Cambridge, pp 61–74
Rabiner L, Juang BH (1993) Fundamentals of speech recognition. Prentice Hall, Englewood Cliffs
Robertson T, Wright F, Dykstra R (1988) Order restricted statistical inference. Wiley, New York
Schapire RE, Freund Y (2012) Boosting: foundations and algorithms. MIT Press, Cambridge
Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37(3):297–336
Schölkopf B, Platt JC, Shawe-Taylor JC, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1471
Tóth L, Kocsor A, Csirik J (2005) On naive Bayes in speech recognition. Int J Appl Math Compu Sci 15(2):287–294
Tóth S, Sztahó D, Vicsi K (2012) Speech emotion perception by human and machine. In: Proceedings of COST action, pp 213–224, Patras, Greece
van Leeuwen DA, Martin AF, Przybocki MA, Bouten JS (2006) NIST and NFI-TNO evaluations of automatic speaker recognition. Comput Speech Lang 20(2–3):128–158
Waegeman W, Dembczynski K, Jachnik A, Cheng W, Hüllermeier E (2014) On the Bayes-optimality of f-measure maximizers. J Mach Learn Res 15(1):3333–3388
Wu T, Lin C, Weng R (2004) Probability estimates for multi-class classification by pairwise coupling. J Mach Learn Res 5:975–1005
Young S, Evermann G, Gales MJF, Hain T, Kershaw D, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P (2006) The HTK book. Cambridge University, Cambridge
Zadrozny B, Elkan C (2001) Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In: Proceedings of ICML, pp 609–616
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Communicated by V. Loia.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A Proof of Proposition 1
A Proof of Proposition 1
Proof
Let us assume a strong classifier \(\mathbf{f}^{(T)} (\mathbf{x})\). Then, we seek to find weak classifier \(\mathbf{h}(.)\) such that \(J( \mathbf{f}^{(T)} + c \mathbf{h})\) is minimized. By using the linearity of the expectation, one may consider the second-order expansion of \(J ( \mathbf{f}^{(T)} + c \mathbf{h})\) for fixed \(c \in \mathbb {R}_{+}\) and \(\mathbf{h}( x ) = (0, \dots , 0)\) as
where the last equation holds because \(h_{\ell } (\mathbf{x})\in \{-\,1, 1 \}\) and \(y_{\ell }\in \{-\,1, 1 \}\). This means that minimizing the multi-class exponential loss with respect to \(\mathbf{h}\) is equivalent to maximizing the weighted expectation
where
Note that the weak learner maximizes the edge given in (14) computed on a finite dataset, which is an estimate of the weighted expectation.
The label distribution in the multi-class case for fixed \(\mathbf{x}\) is a multinomial distribution with \((p_1, \dots ,p_K ) \in \varDelta _{K}\), where \(\varDelta _{K}\) is the K dimensional probability simplex. Note that \((p_1, \dots ,p_K )\) depends on \(\mathbf{x}\), but with a slight abuse of notation we shall not indicate this dependence if \(\mathbf{x}\) is fixed and this does not lead to any confusion. We shall write \((p_1 (\mathbf{x}), \dots ,p_K (\mathbf{x}))\) to denote the dependence on \(\mathbf{x}\).
Recall that the goal is to maximize the weighted expectation given in (31). As a next step we are going to compute the form of the optimal weak classifier. Let us define the vector \(\mathbf {e}_{\ell }\) as
Furthermore, assume that we are given a weak classifier \(\mathbf{h}(. )\) such that \(h_{\widehat{\ell }}( \mathbf{x}) = 1\) for some \(\widehat{\ell }\) and \(h_{\ell }( \mathbf{x}) = 0\) otherwise. Then, the numerator of (31) can be written for a fixed \(\mathbf{x}\) with a label distribution \((p_1, \dots ,p_K )\) as
Since only the last term depends on \(\widehat{\ell }\), the optimal classifier can be written in the form
and hence, the optimal classifier \(\mathbf{h}( \mathbf{x})\) can be written in the form of
Considering (32) over the joint distribution of the labels and features, we get that the classifier which is in the form of (33) minimizes the population-level edge defined as
where the expectation is taken over the joint distribution.
As a next step, one can determine the optimal coefficient c of \(\mathbf{h}(.)\) given in (33) by optimizing \(J ( \mathbf{f}^{(T)} + c \mathbf{h})\):
Then, one can compute analytically that the right-hand side is minimized by
where \(\gamma ^*\) is the population-level edge. Note that the AdaBoost.MH computes the coefficient of the weak classifier in a similar way by using the edge on a finite training data instead of taking the population-level edge. The rest of the proof regarding the weight update is analogous to that of Theorem 1 in Friedman et al. (2000). \(\square \)
Rights and permissions
About this article
Cite this article
Gosztolya, G., Busa-Fekete, R. Calibrating AdaBoost for phoneme classification. Soft Comput 23, 115–128 (2019). https://doi.org/10.1007/s00500-018-3577-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-018-3577-z