Skip to main content
Log in

Calibrating AdaBoost for phoneme classification

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Phoneme classification is a classification sub-task of automatic speech recognition (ASR), which is essential in order to achieve good speech recognition accuracy. However, unlike most classification tasks, besides finding the correct class, providing good posterior scores is also an important requirement of it. Partly because of this, formerly Gaussian Mixture Models, while recently Artificial Neural Networks (ANNs) are used in this task, while other common machine learning methods like Support Vector Machines and AdaBoost.MH are applied only rarely. In a previous study, we showed that AdaBoost.MH can match the performance of ANNs in terms of classification accuracy, but lags behind it when utilizing its output in the speech recognition process. This is in part due to the imprecise posterior scores that AdaBoost.MH produces, which is a well-known weakness of this method. To improve the quality of posterior scores produced, it is common to perform some kind of posterior calibration. In this study, we test several posterior calibration techniques in order to improve the overall performance of AdaBoost.MH. We found that posterior calibration is a good way to improve ASR accuracy, especially when we integrate the speech recognition process into the calibration workflow.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. The indicator function \({{\mathbb {I}}\left\{ {A}\right\} }\) is 1 if its argument A is true and 0 otherwise.

References

  • Ayer M, Brunk H, Ewing G, Reid W, Silverman E (1955) An empirical distribution function for sampling with incomplete information. Ann Math Stat 5(26):641–647

    Article  MathSciNet  MATH  Google Scholar 

  • Bartlett PL, Traskin M (2007) AdaBoost is consistent. J Mach Learn Res 8:2347–2368

    MathSciNet  MATH  Google Scholar 

  • Benbouzid D, Busa-Fekete R, Casagrande N, Collin FD, Kégl B (2012) MultiBoost: a multi-purpose boosting package. J Mach Learn Res 13:549–553

    MATH  Google Scholar 

  • Bishop CM (1995) Neural networks for pattern recognition. Clarendon Press, Oxford

    MATH  Google Scholar 

  • Bodnár P, Nyúl LG (2015) Improved QR code localization using boosted cascade of weak classifiers. Acta Cybern 22(1):21–33

    Article  MathSciNet  MATH  Google Scholar 

  • Busa-Fekete R, Kégl B (2009) Accelerating AdaBoost using UCB. In: KDDCup 2009 (JMLR W&CP), vol 7, pp 111–122, Paris, France

  • Busa-Fekete R, Kégl B, Éltetö T, Szarvas G (2013) Tune and mix: learning to rank using ensembles of calibrated multi-class classifiers. Mach Learn 93(2–3):261–292

    Article  MathSciNet  MATH  Google Scholar 

  • Crammer K, Singer Y (2001) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2:265–292

    MATH  Google Scholar 

  • Drish J (2001) Obtaining calibrated probability estimates from support vector machines. Technical report, University of California, San Diego, CA, USA

  • Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York

    MATH  Google Scholar 

  • Ensor KB, Glynn PW (1997) Stochastic optimization via grid search. In: Lectures in Applied Mathematics, vol 33. American Mathematical Society, pp 89–100

  • Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28:337–374

    Article  MathSciNet  MATH  Google Scholar 

  • Gosztolya G (2014) Is AdaBoost competitive for phoneme classification? In: Proceedings of CINTI (IEEE), pp 61–66, Budapest, Hungary

  • Gosztolya G (2015) On evaluation metrics for social signal detection. In: Proceedings of InterSpeech, pp 2504–2508, Dresden, Germany

  • Gosztolya G, Busa-Fekete R, Tóth L (2013) Detecting autism, emotions and social signals using AdaBoost. In: Proceedings of InterSpeech, pp 220–224, Lyon, France

  • Gosztolya G, Beke A, Neuberger T, Tóth L (2016) Laughter classification using deep rectifier neural networks with a minimal feature subset. Arch Acoust 41(4):669–682

    Article  Google Scholar 

  • Gupta R, Audhkhasi K, Lee S, Narayanan SS (2013) Speech paralinguistic event detection using probabilistic time-series smoothing and masking. In: Proceedings of InterSpeech, pp 173–177

  • Imseng D, Bourlard H, Magimai-Doss M, Dines J (2011) Language dependent universal phoneme posterior estimation for mixed language speech recognition. In: Proceedings of ICASSP, pp. 5012–5015, Prague, Czech Republic

  • Jelinek F (1997) Statistical methods for speech recognition. MIT Press, Cambridge

    Google Scholar 

  • Kaya H, Karpov AA, Salah AA (2015) Fisher Vectors with cascaded normalization for paralinguistic analysis. In: Proceedings of InterSpeech, pp 909–913

  • Lamel L, Kassel R, Seneff S (1986) Speech database development: design and analysis of the acoustic-phonetic corpus. In: Proceedings of DARPA speech recognition workshop, pp 121–124

  • Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10(8):707–710

    MathSciNet  Google Scholar 

  • Manning C, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  • Mease D, Wyner A, Buja A (2007) Boosted classification trees and class probability/quantile estimation. J Mach Learn Res 8:409–439

    MATH  Google Scholar 

  • Morgan N, Bourland H (1995) An introduction to hybrid HMM/connectionist continuous speech recognition. Signal Process Mag 1025–1028, May 1995

  • Neuberger T, Beke A (2013) Automatic laughter detection in spontaneous speech using GMM–SVM method. In: Proceedings of TSD, pp 113–120

  • Niculescu-Mizil A, Caruana R (2005) Obtaining calibrated probabilities from boosting. In: Proceedings of 21st conference on uncertainty in artificial intelligence (UAI’05), pp 413–420

  • Platt J (2000) Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In: Smola A, Bartlett P, Schoelkopf B, Schuurmans D (eds) Advances in large margin classifiers. MIT Press, Cambridge, pp 61–74

    Google Scholar 

  • Rabiner L, Juang BH (1993) Fundamentals of speech recognition. Prentice Hall, Englewood Cliffs

    Google Scholar 

  • Robertson T, Wright F, Dykstra R (1988) Order restricted statistical inference. Wiley, New York

    MATH  Google Scholar 

  • Schapire RE, Freund Y (2012) Boosting: foundations and algorithms. MIT Press, Cambridge

    MATH  Google Scholar 

  • Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37(3):297–336

    Article  MATH  Google Scholar 

  • Schölkopf B, Platt JC, Shawe-Taylor JC, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1471

    Article  MATH  Google Scholar 

  • Tóth L, Kocsor A, Csirik J (2005) On naive Bayes in speech recognition. Int J Appl Math Compu Sci 15(2):287–294

    MathSciNet  MATH  Google Scholar 

  • Tóth S, Sztahó D, Vicsi K (2012) Speech emotion perception by human and machine. In: Proceedings of COST action, pp 213–224, Patras, Greece

  • van Leeuwen DA, Martin AF, Przybocki MA, Bouten JS (2006) NIST and NFI-TNO evaluations of automatic speaker recognition. Comput Speech Lang 20(2–3):128–158

    Article  Google Scholar 

  • Waegeman W, Dembczynski K, Jachnik A, Cheng W, Hüllermeier E (2014) On the Bayes-optimality of f-measure maximizers. J Mach Learn Res 15(1):3333–3388

    MathSciNet  MATH  Google Scholar 

  • Wu T, Lin C, Weng R (2004) Probability estimates for multi-class classification by pairwise coupling. J Mach Learn Res 5:975–1005

    MathSciNet  MATH  Google Scholar 

  • Young S, Evermann G, Gales MJF, Hain T, Kershaw D, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P (2006) The HTK book. Cambridge University, Cambridge

    Google Scholar 

  • Zadrozny B, Elkan C (2001) Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In: Proceedings of ICML, pp 609–616

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gábor Gosztolya.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Proof of Proposition 1

A Proof of Proposition 1

Proof

Let us assume a strong classifier \(\mathbf{f}^{(T)} (\mathbf{x})\). Then, we seek to find weak classifier \(\mathbf{h}(.)\) such that \(J( \mathbf{f}^{(T)} + c \mathbf{h})\) is minimized. By using the linearity of the expectation, one may consider the second-order expansion of \(J ( \mathbf{f}^{(T)} + c \mathbf{h})\) for fixed \(c \in \mathbb {R}_{+}\) and \(\mathbf{h}( x ) = (0, \dots , 0)\) as

$$\begin{aligned}&J ( \mathbf{f}^{(T)} + c \mathbf{h}) \\&\quad = \mathbb {E}\left[ \sum _{ \ell = 1 }^K I ( \mathbf{y}, \ell ) \exp (-\,y_{\ell } ( f_{\ell }^{(T)} (\mathbf{x}) + c h_{\ell } (\mathbf{x})))\right] \\&\quad = \sum _{ \ell = 1 }^K \mathbb {E}\left[ I ( \mathbf{y}, \ell ) \exp (-\,y_{\ell } ( f_{\ell }^{(T)} (\mathbf{x}) + c h_{\ell } (\mathbf{x})))\right] \\&\quad \approx \sum _{ \ell = 1 }^K \mathbb {E}\left[ I ( \mathbf{y}, \ell ) \exp (-\,y_{\ell } f_{\ell }^{(T)} (\mathbf{x}) ) \right. \\&\qquad \quad \left. ( 1 - c y_{\ell } h_{\ell }(\mathbf{x}) + c^2_{\ell } y_{\ell }^2 h_{\ell }(\mathbf{x})^2/2 ) \right] \\&\quad = \sum _{ \ell = 1 }^K \mathbb {E}\left[ I ( \mathbf{y}, \ell ) \exp (-\,y_{\ell } f_{\ell }^{(T)} (\mathbf{x}) ) \right. \\&\quad \left. ( 1 - c y_{\ell } h_{\ell }(\mathbf{x}) + c^2_{\ell }/2 ) \right] \end{aligned}$$

where the last equation holds because \(h_{\ell } (\mathbf{x})\in \{-\,1, 1 \}\) and \(y_{\ell }\in \{-\,1, 1 \}\). This means that minimizing the multi-class exponential loss with respect to \(\mathbf{h}\) is equivalent to maximizing the weighted expectation

$$\begin{aligned} \mathbb {E}\left[ \sum _{\ell =1}^K w(\mathbf{x}, \mathbf{y}, \ell )y_{\ell } h_{\ell } (\mathbf{x}) \right] \end{aligned}$$
(31)

where

$$\begin{aligned} w( \mathbf{x}, \mathbf{y}, \ell ) = I( \mathbf{y}, \ell ) \exp \left( -\,y_{\ell } f_{\ell }^{(T)} (\mathbf{x})\right) . \end{aligned}$$

Note that the weak learner maximizes the edge given in (14) computed on a finite dataset, which is an estimate of the weighted expectation.

The label distribution in the multi-class case for fixed \(\mathbf{x}\) is a multinomial distribution with \((p_1, \dots ,p_K ) \in \varDelta _{K}\), where \(\varDelta _{K}\) is the K dimensional probability simplex. Note that \((p_1, \dots ,p_K )\) depends on \(\mathbf{x}\), but with a slight abuse of notation we shall not indicate this dependence if \(\mathbf{x}\) is fixed and this does not lead to any confusion. We shall write \((p_1 (\mathbf{x}), \dots ,p_K (\mathbf{x}))\) to denote the dependence on \(\mathbf{x}\).

Recall that the goal is to maximize the weighted expectation given in (31). As a next step we are going to compute the form of the optimal weak classifier. Let us define the vector \(\mathbf {e}_{\ell }\) as

$$\begin{aligned} e_{\ell ,\ell ^{\prime } } = {\left\{ \begin{array}{ll} 1, &{} \ell = \ell ^{\prime } \\ -\,1, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

Furthermore, assume that we are given a weak classifier \(\mathbf{h}(. )\) such that \(h_{\widehat{\ell }}( \mathbf{x}) = 1\) for some \(\widehat{\ell }\) and \(h_{\ell }( \mathbf{x}) = 0\) otherwise. Then, the numerator of (31) can be written for a fixed \(\mathbf{x}\) with a label distribution \((p_1, \dots ,p_K )\) as

$$\begin{aligned} \mathbb {E}&\left[ \sum _{\ell =1}^K w(\mathbf{x}, \mathbf{y}, \ell ) y_{\ell } h_{\ell } (\mathbf{x}) \vert \mathbf{x}\right] \nonumber \\&= \sum _{\ell = 1}^K p_{\ell } \left[ w(\mathbf{x}, \mathbf {e}_{\ell }, \ell ) h_{\ell } (\mathbf{x}) - \sum _{\ell ^{\prime } \ne \ell } w(\mathbf{x}, \mathbf {e}_{\ell }, \ell ^{\prime } ) h_{\ell ^{\prime }} (\mathbf{x}) \right] \nonumber \\&= p_{\widehat{\ell }} \left[ w(\mathbf{x}, \mathbf {e}_{\widehat{\ell }}, \widehat{\ell } ) + \sum _{\ell ^{\prime } \ne \widehat{\ell } } w(\mathbf{x}, \mathbf {e}_{\widehat{\ell }}, \ell ^{\prime } ) \right] \nonumber \\&\quad + \sum _{\ell \ne \widehat{\ell } } p_{\ell } \left[ -\, w(\mathbf{x}, \mathbf {e}_{\ell }, \ell ) - w(\mathbf{x}, \mathbf {e}_{\ell }, \widehat{\ell } ) + \sum _{\ell ^{\prime } \ne \ell , \widehat{\ell } } w(\mathbf{x}, \mathbf {e}_{\ell }, \ell ^{\prime } ) \right] \nonumber \\&= p_{\widehat{\ell }} \sum _{\ell =1 }^K w(\mathbf{x}, \mathbf {e}_{\widehat{\ell }}, \ell ) \nonumber \\&\quad + \sum _{\ell \ne \widehat{\ell } } p_{\ell } \left[ -\, 2w(\mathbf{x}, \mathbf {e}_{\ell }, \ell ) - 2w(\mathbf{x}, \mathbf {e}_{\ell }, \widehat{\ell } ) + \sum _{\ell ^{\prime } = 1 }^K w(\mathbf{x}, \mathbf {e}_{\ell }, \ell ^{\prime } ) \right] \nonumber \\&= \sum _{\ell = 1 }^K \sum _{\ell ^{\prime } = 1 }^K w(\mathbf{x}, \mathbf {e}_{\ell }, \ell ^{\prime } ) -2 \sum _{\ell \ne \widehat{\ell } } p_{\ell } \left[ w(\mathbf{x}, \mathbf {e}_{\ell }, \ell ) + w(\mathbf{x}, \mathbf {e}_{\ell }, \widehat{\ell } ) \right] \nonumber \\&= \sum _{\ell = 1 }^K \sum _{\ell ^{\prime } = 1 }^K w(\mathbf{x}, \mathbf {e}_{\ell }, \ell ^{\prime } ) - 2 \sum _{\ell =1 }^K p_{\ell } w(\mathbf{x}, \mathbf {e}_{\ell }, \ell ) \nonumber \\&\quad + 2 \left[ p_{\widehat{\ell }} w(\mathbf{x}, \mathbf {e}_{\widehat{\ell }}, \widehat{\ell } ) - \sum _{\ell \ne \widehat{\ell } } p_{\ell } w(\mathbf{x}, \mathbf {e}_{\ell }, \widehat{\ell } ) \right] \end{aligned}$$
(32)

Since only the last term depends on \(\widehat{\ell }\), the optimal classifier can be written in the form

$$\begin{aligned}&\widehat{\ell }(\mathbf{x}) = \mathrm{arg\, max}_{1\le \ell \le K } \\&\quad \left[ p_{\ell }(\mathbf{x}) w(\mathbf{x}, \mathbf {e}_{\ell }, \ell ) - \sum _{\ell ^{\prime } \ne \ell } p_{\ell ^{\prime }}(\mathbf{x}) w(\mathbf{x}, \mathbf {e}_{\ell ^{\prime }}, \ell )\right] \\&= \mathrm{arg\, max}_{1\le \ell \le K } \left[ p_{\ell }(\mathbf{x}) w(\mathbf{x}, \mathbf {e}_{\ell }, \ell ) \right] , \end{aligned}$$

and hence, the optimal classifier \(\mathbf{h}( \mathbf{x})\) can be written in the form of

$$\begin{aligned} h_{\ell } (\mathbf{x}) = {\left\{ \begin{array}{ll} 1 ,&{} \text {if }~ \ell = \widehat{\ell }(\mathbf{x})\\ -\,1, &{} \text {otherwise} \end{array}\right. } . \end{aligned}$$
(33)

Considering (32) over the joint distribution of the labels and features, we get that the classifier which is in the form of (33) minimizes the population-level edge defined as

$$\begin{aligned} \gamma ^* = \mathbb {E}\left[ \sum _{\ell =1}^K w(\mathbf{x}, \mathbf{y}, \ell ) y_{\ell } h_{\ell } (\mathbf{x}) \right] , \end{aligned}$$

where the expectation is taken over the joint distribution.

As a next step, one can determine the optimal coefficient c of \(\mathbf{h}(.)\) given in (33) by optimizing \(J ( \mathbf{f}^{(T)} + c \mathbf{h})\):

$$\begin{aligned} c&= \mathop {\text {argmin}}\limits _{0< c} \mathbb {E}\left[ \sum _{\ell =1}^K w(\mathbf{x}, \mathbf{y}, \ell ) \exp (-\,c y_{\ell } h_{\ell } (\mathbf{x}) )\right] \\&\le \mathop {\text {argmin}}\limits _{0 < c} \mathbb {E}\left[ \sum _{\ell =1}^K w(\mathbf{x}, \mathbf{y}, \ell ) \left( \frac{1+ y_{\ell } h_{\ell } (\mathbf{x})}{2} \exp (-\,c) \right. \right. \\&\quad \left. \left. + \frac{1- y_{\ell } h_{\ell } (\mathbf{x})}{2} \exp (c) \right) \right] \end{aligned}$$

Then, one can compute analytically that the right-hand side is minimized by

$$\begin{aligned} c = \frac{1}{2}\log \frac{1+\gamma ^*}{1-\gamma ^*}, \end{aligned}$$

where \(\gamma ^*\) is the population-level edge. Note that the AdaBoost.MH computes the coefficient of the weak classifier in a similar way by using the edge on a finite training data instead of taking the population-level edge. The rest of the proof regarding the weight update is analogous to that of Theorem 1 in Friedman et al. (2000). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gosztolya, G., Busa-Fekete, R. Calibrating AdaBoost for phoneme classification. Soft Comput 23, 115–128 (2019). https://doi.org/10.1007/s00500-018-3577-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-018-3577-z

Keywords

Navigation