Abstract
We propose a novel approach to generate an ensemble of context-dependent deep neural networks (CD-DNNs) by using random forests of phonetic decision trees (RF-PDTs) and construct an ensemble acoustic model (EAM) accordingly for speech recognition. We present evaluation results on the TIMIT dataset and a telemedicine automatic captioning dataset and demonstrate the superior performance of the proposed RF-PDT+CD-DNN based EAM over the conventional CD-DNN based single acoustic model (SAM) in phone and word recognition accuracies.
Similar content being viewed by others
References
Young, S.J., Odell, J.J., & Woodland, P.C. (1994). Tree-based state tying for high accuracy modeling. In Proc. ARPA Human Lang. Tech. Workshop (pp. 307–312).
Dahl, G.E., Yu, D., Deng, L., & Acero, A. (2012). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 20(1), 30–42.
Deng, L., Yu, D., & Platt, J. (2012). Scalable stacking and learning for building deep architectures. In Proc. ICASSP (pp. 2133–2136).
Cook, G., & Robinson, T. (1996). Boosting the performance of connectionist large vocabulary speech recognition. ICSLP, 3, 1305–1308.
Cook, G., Waterhouse, S., & Robinson, A. (1997). Ensemble methods for connectionist acoustic modelling. Proc. Eurospeech, 3, 1959–1962.
Schwenk, H. (1999). Using boosting to improve a hybrid HMM/neural network speech recognizer. In Proc. ICASSP (pp. 1009–1012).
Kazemi, A., Sobhanmanesh, F., & Boostani, R. (2011). Boosting small MLPs with entropy combination improves phoneme posteriors enstimation. In Proc. International Symposium on AISP (pp. 11–14).
Qian, Y., & Liu, J. (2012). Cross-lingual and ensemble MLPs strategies for low-resource speech recognition. In Proc. Interspeech (pp. 354–358).
Chen, X., & Zhao, Y. (2013). Building acoustic model ensembles by data sampling with enhanced trainings and features. IEEE Transactions on Audio, Speech and Language Processing, 21(3), 498–507.
Xue, J., & Zhao, Y. (2008). Random forests of phonetic decision trees for acoustic modeling in conversational speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 16(3), 519–528.
Siohan, O., Ramabhadran, B., & Kingsbury, B. (2005). Constructing ensembles of ASR systems using randomized decision trees. In Proc. ICASSP (pp. I-197-I-200).
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Tumer, K., & Ghosh, J. (1996). Analysis of decision boundaries in linearly combined neural classifiers. Pattern Recognition, 29(2), 341–348.
Krogh, A., & Vedelsby, J. (1995). Neural network ensembles, cross validation, and active learning. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems (pp. 231–238).
Audhkhasi, K., Zavou, A.M., Georgiou, P.G., & Narayanan, S.S. (2014). Theoretical analysis of diversity in an ensemble of automatic speech recognition systems. IEEE Transactions on ASLP, 22(3), 711–726.
Zhao, Y., Xue, J., & Chen, X. (2014). Ensemble learning approaches in speech recognition. In T. Ogunfunmi, R. Togneri, & M. Narasimha (Eds.), Speech and audio processing for coding, enhancement and recognition: Springer.
Fiscus, J.G. (1997). A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER). In Proc. IEEE ASRU (pp. 347–352).
Shinozaki, T., & Furui, S. (2004). Spontaneous speech recognition using a massively parallel decoder. In Proc. ICSLP (pp. 1705–1708).
Zhao, Y., Zhang, X., Hu, R.-S., Xue, J., Li, X., Che, L., Hu, R., & Schopp, L. (2006). An automatic captioning system for telemedicine. In Proc. ICASSP (pp. I-957-I-960).
Zhao, T., Zhao, Y., & Chen, X. (2014). Building an ensemble of CD-DNN-HMM acoustic model using random forests of phonetic decision trees. In Proc. ISCSLP (pp. 98–102).
Hinton, G.E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554.
Seide, F., Li, G., Chen, X., & Yu, D. (2011). Feature engineering in context-dependent deep neural networks for conversational speech transcription. In Proc. IEEE ASRU (pp. 24–29).
(2009). The hidden Markov model toolkit (HTK). CUED Machine Intelligence Lab. accessed 28 June 2013. http://htk.eng.cam.ac.uk/ftp/software/HTK-3.4.1.tar.gz.
Vesely, K., Burget, L., & Grezl, F. (2010). Parallel training of neural networks for speech recognition. In Proc. International Conf Text, Speech and Dialog (pp. 439–446).
Lee, K., & Hon, H. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Audio, Speech and Language Processing, 37(11), 1641–1648.
Zhang, X., Zhao, Y., & Schopp, L. (2007). A novel method of language modeling for automatic captioning in telemedicine. IEEE Transactions on Information Technology in Biomedicine, 11(3), 332–337.
Sun, X., & Zhao, Y. (2014). Integrated exemplar-based template matching and statistical modeling for continuous speech recognition. In Proc. EURASIP Journal on Audio, Speech and Music (Vol. 4, p. 16).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhao, T., Zhao, Y. & Chen, X. Ensemble Acoustic Modeling for CD-DNN-HMM Using Random Forests of Phonetic Decision Trees. J Sign Process Syst 82, 187–196 (2016). https://doi.org/10.1007/s11265-015-1001-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-015-1001-9