Ensemble Acoustic Modeling for CD-DNN-HMM Using Random Forests of Phonetic Decision Trees

Zhao, Tuo; Zhao, Yunxin; Chen, Xin

doi:10.1007/s11265-015-1001-9

Ensemble Acoustic Modeling for CD-DNN-HMM Using Random Forests of Phonetic Decision Trees

Published: 21 April 2015

Volume 82, pages 187–196, (2016)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

319 Accesses
Explore all metrics

Abstract

We propose a novel approach to generate an ensemble of context-dependent deep neural networks (CD-DNNs) by using random forests of phonetic decision trees (RF-PDTs) and construct an ensemble acoustic model (EAM) accordingly for speech recognition. We present evaluation results on the TIMIT dataset and a telemedicine automatic captioning dataset and demonstrate the superior performance of the proposed RF-PDT+CD-DNN based EAM over the conventional CD-DNN based single acoustic model (SAM) in phone and word recognition accuracies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Acoustic domain classification and recognition through ensemble based multilevel classification

Article 11 October 2018

Recurrent DNNs and Its Ensembles on the TIMIT Phone Recognition Task

Articulatory-feature-based methods for performance improvement of Multilingual Phone Recognition Systems using Indian languages

Article 30 July 2020

References

Young, S.J., Odell, J.J., & Woodland, P.C. (1994). Tree-based state tying for high accuracy modeling. In Proc. ARPA Human Lang. Tech. Workshop (pp. 307–312).
Dahl, G.E., Yu, D., Deng, L., & Acero, A. (2012). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 20(1), 30–42.
Article Google Scholar
Deng, L., Yu, D., & Platt, J. (2012). Scalable stacking and learning for building deep architectures. In Proc. ICASSP (pp. 2133–2136).
Cook, G., & Robinson, T. (1996). Boosting the performance of connectionist large vocabulary speech recognition. ICSLP, 3, 1305–1308.
Google Scholar
Cook, G., Waterhouse, S., & Robinson, A. (1997). Ensemble methods for connectionist acoustic modelling. Proc. Eurospeech, 3, 1959–1962.
Google Scholar
Schwenk, H. (1999). Using boosting to improve a hybrid HMM/neural network speech recognizer. In Proc. ICASSP (pp. 1009–1012).
Kazemi, A., Sobhanmanesh, F., & Boostani, R. (2011). Boosting small MLPs with entropy combination improves phoneme posteriors enstimation. In Proc. International Symposium on AISP (pp. 11–14).
Qian, Y., & Liu, J. (2012). Cross-lingual and ensemble MLPs strategies for low-resource speech recognition. In Proc. Interspeech (pp. 354–358).
Chen, X., & Zhao, Y. (2013). Building acoustic model ensembles by data sampling with enhanced trainings and features. IEEE Transactions on Audio, Speech and Language Processing, 21(3), 498–507.
Article Google Scholar
Xue, J., & Zhao, Y. (2008). Random forests of phonetic decision trees for acoustic modeling in conversational speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 16(3), 519–528.
Article Google Scholar
Siohan, O., Ramabhadran, B., & Kingsbury, B. (2005). Constructing ensembles of ASR systems using randomized decision trees. In Proc. ICASSP (pp. I-197-I-200).
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Article MATH Google Scholar
Tumer, K., & Ghosh, J. (1996). Analysis of decision boundaries in linearly combined neural classifiers. Pattern Recognition, 29(2), 341–348.
Article Google Scholar
Krogh, A., & Vedelsby, J. (1995). Neural network ensembles, cross validation, and active learning. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems (pp. 231–238).
Audhkhasi, K., Zavou, A.M., Georgiou, P.G., & Narayanan, S.S. (2014). Theoretical analysis of diversity in an ensemble of automatic speech recognition systems. IEEE Transactions on ASLP, 22(3), 711–726.
Google Scholar
Zhao, Y., Xue, J., & Chen, X. (2014). Ensemble learning approaches in speech recognition. In T. Ogunfunmi, R. Togneri, & M. Narasimha (Eds.), Speech and audio processing for coding, enhancement and recognition: Springer.
Fiscus, J.G. (1997). A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER). In Proc. IEEE ASRU (pp. 347–352).
Shinozaki, T., & Furui, S. (2004). Spontaneous speech recognition using a massively parallel decoder. In Proc. ICSLP (pp. 1705–1708).
Zhao, Y., Zhang, X., Hu, R.-S., Xue, J., Li, X., Che, L., Hu, R., & Schopp, L. (2006). An automatic captioning system for telemedicine. In Proc. ICASSP (pp. I-957-I-960).
Zhao, T., Zhao, Y., & Chen, X. (2014). Building an ensemble of CD-DNN-HMM acoustic model using random forests of phonetic decision trees. In Proc. ISCSLP (pp. 98–102).
Hinton, G.E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554.
Article MATH MathSciNet Google Scholar
Seide, F., Li, G., Chen, X., & Yu, D. (2011). Feature engineering in context-dependent deep neural networks for conversational speech transcription. In Proc. IEEE ASRU (pp. 24–29).
(2009). The hidden Markov model toolkit (HTK). CUED Machine Intelligence Lab. accessed 28 June 2013. http://htk.eng.cam.ac.uk/ftp/software/HTK-3.4.1.tar.gz.
Vesely, K., Burget, L., & Grezl, F. (2010). Parallel training of neural networks for speech recognition. In Proc. International Conf Text, Speech and Dialog (pp. 439–446).
Lee, K., & Hon, H. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Audio, Speech and Language Processing, 37(11), 1641–1648.
Google Scholar
Zhang, X., Zhao, Y., & Schopp, L. (2007). A novel method of language modeling for automatic captioning in telemedicine. IEEE Transactions on Information Technology in Biomedicine, 11(3), 332–337.
Article MATH Google Scholar
Sun, X., & Zhao, Y. (2014). Integrated exemplar-based template matching and statistical modeling for continuous speech recognition. In Proc. EURASIP Journal on Audio, Speech and Music (Vol. 4, p. 16).

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Missouri, Columbia, MO, 65211, USA
Tuo Zhao & Yunxin Zhao
Pearson Knowledge Technologies, Menlo Park, CA, 94025, USA
Xin Chen

Authors

Tuo Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yunxin Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Xin Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yunxin Zhao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, T., Zhao, Y. & Chen, X. Ensemble Acoustic Modeling for CD-DNN-HMM Using Random Forests of Phonetic Decision Trees. J Sign Process Syst 82, 187–196 (2016). https://doi.org/10.1007/s11265-015-1001-9

Download citation

Received: 15 November 2014
Revised: 17 February 2015
Accepted: 27 March 2015
Published: 21 April 2015
Issue Date: February 2016
DOI: https://doi.org/10.1007/s11265-015-1001-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ensemble Acoustic Modeling for CD-DNN-HMM Using Random Forests of Phonetic Decision Trees

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Acoustic domain classification and recognition through ensemble based multilevel classification

Recurrent DNNs and Its Ensembles on the TIMIT Phone Recognition Task

Articulatory-feature-based methods for performance improvement of Multilingual Phone Recognition Systems using Indian languages

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Ensemble Acoustic Modeling for CD-DNN-HMM Using Random Forests of Phonetic Decision Trees

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Acoustic domain classification and recognition through ensemble based multilevel classification

Recurrent DNNs and Its Ensembles on the TIMIT Phone Recognition Task

Articulatory-feature-based methods for performance improvement of Multilingual Phone Recognition Systems using Indian languages

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation