Calibrating AdaBoost for phoneme classification

Gosztolya, Gábor; Busa-Fekete, Róbert

doi:10.1007/s00500-018-3577-z

Calibrating AdaBoost for phoneme classification

Methodologies and Application
Published: 25 October 2018

Volume 23, pages 115–128, (2019)
Cite this article

Soft Computing Aims and scope Submit manuscript

283 Accesses
6 Citations
Explore all metrics

Abstract

Phoneme classification is a classification sub-task of automatic speech recognition (ASR), which is essential in order to achieve good speech recognition accuracy. However, unlike most classification tasks, besides finding the correct class, providing good posterior scores is also an important requirement of it. Partly because of this, formerly Gaussian Mixture Models, while recently Artificial Neural Networks (ANNs) are used in this task, while other common machine learning methods like Support Vector Machines and AdaBoost.MH are applied only rarely. In a previous study, we showed that AdaBoost.MH can match the performance of ANNs in terms of classification accuracy, but lags behind it when utilizing its output in the speech recognition process. This is in part due to the imprecise posterior scores that AdaBoost.MH produces, which is a well-known weakness of this method. To improve the quality of posterior scores produced, it is common to perform some kind of posterior calibration. In this study, we test several posterior calibration techniques in order to improve the overall performance of AdaBoost.MH. We found that posterior calibration is a good way to improve ASR accuracy, especially when we integrate the speech recognition process into the calibration workflow.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Phoneme class based feature adaptation for mismatch acoustic modeling and recognition of distant noisy speech

Article 17 August 2017

A Cross Dataset Approach for Noisy Speech Identification

The Impact of Inaccurate Phonetic Annotations on Speech Recognition Performance

Notes

The indicator function ${{\mathbb {I}}\left\{ {A}\right\} }$ is 1 if its argument A is true and 0 otherwise.

References

Ayer M, Brunk H, Ewing G, Reid W, Silverman E (1955) An empirical distribution function for sampling with incomplete information. Ann Math Stat 5(26):641–647
Article MathSciNet MATH Google Scholar
Bartlett PL, Traskin M (2007) AdaBoost is consistent. J Mach Learn Res 8:2347–2368
MathSciNet MATH Google Scholar
Benbouzid D, Busa-Fekete R, Casagrande N, Collin FD, Kégl B (2012) MultiBoost: a multi-purpose boosting package. J Mach Learn Res 13:549–553
MATH Google Scholar
Bishop CM (1995) Neural networks for pattern recognition. Clarendon Press, Oxford
MATH Google Scholar
Bodnár P, Nyúl LG (2015) Improved QR code localization using boosted cascade of weak classifiers. Acta Cybern 22(1):21–33
Article MathSciNet MATH Google Scholar
Busa-Fekete R, Kégl B (2009) Accelerating AdaBoost using UCB. In: KDDCup 2009 (JMLR W&CP), vol 7, pp 111–122, Paris, France
Busa-Fekete R, Kégl B, Éltetö T, Szarvas G (2013) Tune and mix: learning to rank using ensembles of calibrated multi-class classifiers. Mach Learn 93(2–3):261–292
Article MathSciNet MATH Google Scholar
Crammer K, Singer Y (2001) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2:265–292
MATH Google Scholar
Drish J (2001) Obtaining calibrated probability estimates from support vector machines. Technical report, University of California, San Diego, CA, USA
Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York
MATH Google Scholar
Ensor KB, Glynn PW (1997) Stochastic optimization via grid search. In: Lectures in Applied Mathematics, vol 33. American Mathematical Society, pp 89–100
Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28:337–374
Article MathSciNet MATH Google Scholar
Gosztolya G (2014) Is AdaBoost competitive for phoneme classification? In: Proceedings of CINTI (IEEE), pp 61–66, Budapest, Hungary
Gosztolya G (2015) On evaluation metrics for social signal detection. In: Proceedings of InterSpeech, pp 2504–2508, Dresden, Germany
Gosztolya G, Busa-Fekete R, Tóth L (2013) Detecting autism, emotions and social signals using AdaBoost. In: Proceedings of InterSpeech, pp 220–224, Lyon, France
Gosztolya G, Beke A, Neuberger T, Tóth L (2016) Laughter classification using deep rectifier neural networks with a minimal feature subset. Arch Acoust 41(4):669–682
Article Google Scholar
Gupta R, Audhkhasi K, Lee S, Narayanan SS (2013) Speech paralinguistic event detection using probabilistic time-series smoothing and masking. In: Proceedings of InterSpeech, pp 173–177
Imseng D, Bourlard H, Magimai-Doss M, Dines J (2011) Language dependent universal phoneme posterior estimation for mixed language speech recognition. In: Proceedings of ICASSP, pp. 5012–5015, Prague, Czech Republic
Jelinek F (1997) Statistical methods for speech recognition. MIT Press, Cambridge
Google Scholar
Kaya H, Karpov AA, Salah AA (2015) Fisher Vectors with cascaded normalization for paralinguistic analysis. In: Proceedings of InterSpeech, pp 909–913
Lamel L, Kassel R, Seneff S (1986) Speech database development: design and analysis of the acoustic-phonetic corpus. In: Proceedings of DARPA speech recognition workshop, pp 121–124
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10(8):707–710
MathSciNet Google Scholar
Manning C, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Book MATH Google Scholar
Mease D, Wyner A, Buja A (2007) Boosted classification trees and class probability/quantile estimation. J Mach Learn Res 8:409–439
MATH Google Scholar
Morgan N, Bourland H (1995) An introduction to hybrid HMM/connectionist continuous speech recognition. Signal Process Mag 1025–1028, May 1995
Neuberger T, Beke A (2013) Automatic laughter detection in spontaneous speech using GMM–SVM method. In: Proceedings of TSD, pp 113–120
Niculescu-Mizil A, Caruana R (2005) Obtaining calibrated probabilities from boosting. In: Proceedings of 21st conference on uncertainty in artificial intelligence (UAI’05), pp 413–420
Platt J (2000) Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In: Smola A, Bartlett P, Schoelkopf B, Schuurmans D (eds) Advances in large margin classifiers. MIT Press, Cambridge, pp 61–74
Google Scholar
Rabiner L, Juang BH (1993) Fundamentals of speech recognition. Prentice Hall, Englewood Cliffs
Google Scholar
Robertson T, Wright F, Dykstra R (1988) Order restricted statistical inference. Wiley, New York
MATH Google Scholar
Schapire RE, Freund Y (2012) Boosting: foundations and algorithms. MIT Press, Cambridge
MATH Google Scholar
Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37(3):297–336
Article MATH Google Scholar
Schölkopf B, Platt JC, Shawe-Taylor JC, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1471
Article MATH Google Scholar
Tóth L, Kocsor A, Csirik J (2005) On naive Bayes in speech recognition. Int J Appl Math Compu Sci 15(2):287–294
MathSciNet MATH Google Scholar
Tóth S, Sztahó D, Vicsi K (2012) Speech emotion perception by human and machine. In: Proceedings of COST action, pp 213–224, Patras, Greece
van Leeuwen DA, Martin AF, Przybocki MA, Bouten JS (2006) NIST and NFI-TNO evaluations of automatic speaker recognition. Comput Speech Lang 20(2–3):128–158
Article Google Scholar
Waegeman W, Dembczynski K, Jachnik A, Cheng W, Hüllermeier E (2014) On the Bayes-optimality of f-measure maximizers. J Mach Learn Res 15(1):3333–3388
MathSciNet MATH Google Scholar
Wu T, Lin C, Weng R (2004) Probability estimates for multi-class classification by pairwise coupling. J Mach Learn Res 5:975–1005
MathSciNet MATH Google Scholar
Young S, Evermann G, Gales MJF, Hain T, Kershaw D, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P (2006) The HTK book. Cambridge University, Cambridge
Google Scholar
Zadrozny B, Elkan C (2001) Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In: Proceedings of ICML, pp 609–616

Download references

Author information

Authors and Affiliations

MTA-SZTE Research Group on Artificial Intelligence, Hungarian Academy of Sciences, Tisza Lajos krt. 103, Szeged, Hungary
Gábor Gosztolya
Yahoo Research, New York, NY, USA
Róbert Busa-Fekete

Authors

Gábor Gosztolya
View author publications
You can also search for this author in PubMed Google Scholar
Róbert Busa-Fekete
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gábor Gosztolya.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Proof of Proposition 1

Proof

Let us assume a strong classifier $\mathbf{f}^{(T)} (\mathbf{x})$. Then, we seek to find weak classifier $\mathbf{h}(.)$ such that $J( \mathbf{f}^{(T)} + c \mathbf{h})$ is minimized. By using the linearity of the expectation, one may consider the second-order expansion of $J ( \mathbf{f}^{(T)} + c \mathbf{h})$ for fixed $c \in \mathbb {R}_{+}$ and $\mathbf{h}( x ) = (0, \dots , 0)$ as

$$\begin{aligned}&J ( \mathbf{f}^{(T)} + c \mathbf{h}) \\&\quad = \mathbb {E}\left[ \sum _{ \ell = 1 }^K I ( \mathbf{y}, \ell ) \exp (-\,y_{\ell } ( f_{\ell }^{(T)} (\mathbf{x}) + c h_{\ell } (\mathbf{x})))\right] \\&\quad = \sum _{ \ell = 1 }^K \mathbb {E}\left[ I ( \mathbf{y}, \ell ) \exp (-\,y_{\ell } ( f_{\ell }^{(T)} (\mathbf{x}) + c h_{\ell } (\mathbf{x})))\right] \\&\quad \approx \sum _{ \ell = 1 }^K \mathbb {E}\left[ I ( \mathbf{y}, \ell ) \exp (-\,y_{\ell } f_{\ell }^{(T)} (\mathbf{x}) ) \right. \\&\qquad \quad \left. ( 1 - c y_{\ell } h_{\ell }(\mathbf{x}) + c^2_{\ell } y_{\ell }^2 h_{\ell }(\mathbf{x})^2/2 ) \right] \\&\quad = \sum _{ \ell = 1 }^K \mathbb {E}\left[ I ( \mathbf{y}, \ell ) \exp (-\,y_{\ell } f_{\ell }^{(T)} (\mathbf{x}) ) \right. \\&\quad \left. ( 1 - c y_{\ell } h_{\ell }(\mathbf{x}) + c^2_{\ell }/2 ) \right] \end{aligned}$$

where the last equation holds because $h_{\ell } (\mathbf{x})\in \{-\,1, 1 \}$ and $y_{\ell }\in \{-\,1, 1 \}$. This means that minimizing the multi-class exponential loss with respect to $\mathbf{h}$ is equivalent to maximizing the weighted expectation

$$\begin{aligned} \mathbb {E}\left[ \sum _{\ell =1}^K w(\mathbf{x}, \mathbf{y}, \ell )y_{\ell } h_{\ell } (\mathbf{x}) \right] \end{aligned}$$

(31)

where

$$\begin{aligned} w( \mathbf{x}, \mathbf{y}, \ell ) = I( \mathbf{y}, \ell ) \exp \left( -\,y_{\ell } f_{\ell }^{(T)} (\mathbf{x})\right) . \end{aligned}$$

Note that the weak learner maximizes the edge given in (14) computed on a finite dataset, which is an estimate of the weighted expectation.

The label distribution in the multi-class case for fixed $\mathbf{x}$ is a multinomial distribution with $(p_1, \dots ,p_K ) \in \varDelta _{K}$, where $\varDelta _{K}$ is the K dimensional probability simplex. Note that $(p_1, \dots ,p_K )$ depends on $\mathbf{x}$, but with a slight abuse of notation we shall not indicate this dependence if $\mathbf{x}$ is fixed and this does not lead to any confusion. We shall write $(p_1 (\mathbf{x}), \dots ,p_K (\mathbf{x}))$ to denote the dependence on $\mathbf{x}$.

Recall that the goal is to maximize the weighted expectation given in (31). As a next step we are going to compute the form of the optimal weak classifier. Let us define the vector $\mathbf {e}_{\ell }$ as

$$\begin{aligned} e_{\ell ,\ell ^{\prime } } = {\left\{ \begin{array}{ll} 1, &{} \ell = \ell ^{\prime } \\ -\,1, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

Furthermore, assume that we are given a weak classifier $\mathbf{h}(. )$ such that $h_{\widehat{\ell }}( \mathbf{x}) = 1$ for some $\widehat{\ell }$ and $h_{\ell }( \mathbf{x}) = 0$ otherwise. Then, the numerator of (31) can be written for a fixed $\mathbf{x}$ with a label distribution $(p_1, \dots ,p_K )$ as

$$\begin{aligned} \mathbb {E}&\left[ \sum _{\ell =1}^K w(\mathbf{x}, \mathbf{y}, \ell ) y_{\ell } h_{\ell } (\mathbf{x}) \vert \mathbf{x}\right] \nonumber \\&= \sum _{\ell = 1}^K p_{\ell } \left[ w(\mathbf{x}, \mathbf {e}_{\ell }, \ell ) h_{\ell } (\mathbf{x}) - \sum _{\ell ^{\prime } \ne \ell } w(\mathbf{x}, \mathbf {e}_{\ell }, \ell ^{\prime } ) h_{\ell ^{\prime }} (\mathbf{x}) \right] \nonumber \\&= p_{\widehat{\ell }} \left[ w(\mathbf{x}, \mathbf {e}_{\widehat{\ell }}, \widehat{\ell } ) + \sum _{\ell ^{\prime } \ne \widehat{\ell } } w(\mathbf{x}, \mathbf {e}_{\widehat{\ell }}, \ell ^{\prime } ) \right] \nonumber \\&\quad + \sum _{\ell \ne \widehat{\ell } } p_{\ell } \left[ -\, w(\mathbf{x}, \mathbf {e}_{\ell }, \ell ) - w(\mathbf{x}, \mathbf {e}_{\ell }, \widehat{\ell } ) + \sum _{\ell ^{\prime } \ne \ell , \widehat{\ell } } w(\mathbf{x}, \mathbf {e}_{\ell }, \ell ^{\prime } ) \right] \nonumber \\&= p_{\widehat{\ell }} \sum _{\ell =1 }^K w(\mathbf{x}, \mathbf {e}_{\widehat{\ell }}, \ell ) \nonumber \\&\quad + \sum _{\ell \ne \widehat{\ell } } p_{\ell } \left[ -\, 2w(\mathbf{x}, \mathbf {e}_{\ell }, \ell ) - 2w(\mathbf{x}, \mathbf {e}_{\ell }, \widehat{\ell } ) + \sum _{\ell ^{\prime } = 1 }^K w(\mathbf{x}, \mathbf {e}_{\ell }, \ell ^{\prime } ) \right] \nonumber \\&= \sum _{\ell = 1 }^K \sum _{\ell ^{\prime } = 1 }^K w(\mathbf{x}, \mathbf {e}_{\ell }, \ell ^{\prime } ) -2 \sum _{\ell \ne \widehat{\ell } } p_{\ell } \left[ w(\mathbf{x}, \mathbf {e}_{\ell }, \ell ) + w(\mathbf{x}, \mathbf {e}_{\ell }, \widehat{\ell } ) \right] \nonumber \\&= \sum _{\ell = 1 }^K \sum _{\ell ^{\prime } = 1 }^K w(\mathbf{x}, \mathbf {e}_{\ell }, \ell ^{\prime } ) - 2 \sum _{\ell =1 }^K p_{\ell } w(\mathbf{x}, \mathbf {e}_{\ell }, \ell ) \nonumber \\&\quad + 2 \left[ p_{\widehat{\ell }} w(\mathbf{x}, \mathbf {e}_{\widehat{\ell }}, \widehat{\ell } ) - \sum _{\ell \ne \widehat{\ell } } p_{\ell } w(\mathbf{x}, \mathbf {e}_{\ell }, \widehat{\ell } ) \right] \end{aligned}$$

(32)

Since only the last term depends on $\widehat{\ell }$, the optimal classifier can be written in the form

$$\begin{aligned}&\widehat{\ell }(\mathbf{x}) = \mathrm{arg\, max}_{1\le \ell \le K } \\&\quad \left[ p_{\ell }(\mathbf{x}) w(\mathbf{x}, \mathbf {e}_{\ell }, \ell ) - \sum _{\ell ^{\prime } \ne \ell } p_{\ell ^{\prime }}(\mathbf{x}) w(\mathbf{x}, \mathbf {e}_{\ell ^{\prime }}, \ell )\right] \\&= \mathrm{arg\, max}_{1\le \ell \le K } \left[ p_{\ell }(\mathbf{x}) w(\mathbf{x}, \mathbf {e}_{\ell }, \ell ) \right] , \end{aligned}$$

and hence, the optimal classifier $\mathbf{h}( \mathbf{x})$ can be written in the form of

$$\begin{aligned} h_{\ell } (\mathbf{x}) = {\left\{ \begin{array}{ll} 1 ,&{} \text {if }~ \ell = \widehat{\ell }(\mathbf{x})\\ -\,1, &{} \text {otherwise} \end{array}\right. } . \end{aligned}$$

(33)

Considering (32) over the joint distribution of the labels and features, we get that the classifier which is in the form of (33) minimizes the population-level edge defined as

$$\begin{aligned} \gamma ^* = \mathbb {E}\left[ \sum _{\ell =1}^K w(\mathbf{x}, \mathbf{y}, \ell ) y_{\ell } h_{\ell } (\mathbf{x}) \right] , \end{aligned}$$

where the expectation is taken over the joint distribution.

As a next step, one can determine the optimal coefficient c of $\mathbf{h}(.)$ given in (33) by optimizing $J ( \mathbf{f}^{(T)} + c \mathbf{h})$:

$$\begin{aligned} c&= \mathop {\text {argmin}}\limits _{0< c} \mathbb {E}\left[ \sum _{\ell =1}^K w(\mathbf{x}, \mathbf{y}, \ell ) \exp (-\,c y_{\ell } h_{\ell } (\mathbf{x}) )\right] \\&\le \mathop {\text {argmin}}\limits _{0 < c} \mathbb {E}\left[ \sum _{\ell =1}^K w(\mathbf{x}, \mathbf{y}, \ell ) \left( \frac{1+ y_{\ell } h_{\ell } (\mathbf{x})}{2} \exp (-\,c) \right. \right. \\&\quad \left. \left. + \frac{1- y_{\ell } h_{\ell } (\mathbf{x})}{2} \exp (c) \right) \right] \end{aligned}$$

Then, one can compute analytically that the right-hand side is minimized by

$$\begin{aligned} c = \frac{1}{2}\log \frac{1+\gamma ^*}{1-\gamma ^*}, \end{aligned}$$

where $\gamma ^*$ is the population-level edge. Note that the AdaBoost.MH computes the coefficient of the weak classifier in a similar way by using the edge on a finite training data instead of taking the population-level edge. The rest of the proof regarding the weight update is analogous to that of Theorem 1 in Friedman et al. (2000). $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gosztolya, G., Busa-Fekete, R. Calibrating AdaBoost for phoneme classification. Soft Comput 23, 115–128 (2019). https://doi.org/10.1007/s00500-018-3577-z

Download citation

Published: 25 October 2018
Issue Date: 24 January 2019
DOI: https://doi.org/10.1007/s00500-018-3577-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Calibrating AdaBoost for phoneme classification

Abstract

Access this article

Similar content being viewed by others

Phoneme class based feature adaptation for mismatch acoustic modeling and recognition of distant noisy speech

A Cross Dataset Approach for Noisy Speech Identification

The Impact of Inaccurate Phonetic Annotations on Speech Recognition Performance

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

A Proof of Proposition 1

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Calibrating AdaBoost for phoneme classification

Abstract

Access this article

Similar content being viewed by others

Phoneme class based feature adaptation for mismatch acoustic modeling and recognition of distant noisy speech

A Cross Dataset Approach for Noisy Speech Identification

The Impact of Inaccurate Phonetic Annotations on Speech Recognition Performance

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

A Proof of Proposition 1

A Proof of Proposition 1

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation