Abstract
This paper compared the performance of different acoustic modeling units in deep neural networks (DNNs) based large vocabulary continuous speech recognition (LVCSR) systems for Chinese. Recently, the deep neural networks based acoustic modeling method has achieved very competitive performance for many speech recognition tasks, and has become the focus of current LVCSR research. Some previous work have studied the context independent and context dependent DNNs based acoustic models. For Chinese, a syllabic language, the choice of basic modeling units under the background of DNNs based LVCSR systems is a very important issue. In this work, three basic modeling units: syllables, Initial/Finals, phones, are discussed and compared. Experimental results showed that, in the DNNs based systems, the context dependent phones obtain the best performance, and the context independent syllables have the similar performance with the context dependent Initial/Finals. Besides, how the number of clustered states impact on the performance of DNNs based systems is also discussed, which showed some different properties from the GMMs based systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Mag. 29(6), 82–97 (2012)
Hinton, G., Osindero, S., Teh, Y.: A fast learning algorithm for deep belief nets. Neural Compution 18(7), 1527–1554 (2006)
Mohamed, A., Dahl, G., Hinton, G.: Deep belief networks for phone recognition. In: Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications (2009)
Mohamed, A., Dahl, G., Hinton, G.: Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Processing 20(1), 14–22 (2012)
Dahl, G., Yu, D., Deng, L., Acero, A.: Context-dependent pretrained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Processing 20(1), 30–42 (2012)
Seide, F., Li, G., Yu, D.: Conversational speech transcription using context-dependent deep neural networks. In: Proc. Interspeech, pp. 437–440 (2011)
Yu, D., Deng, L., Seide, F.: Large vocabulary speech recognition using deep tensor neural networks. In: Proc. Interspeech (2012)
Deng, L., Yu, D.: Deep convex network: A scalable architecture for speech pattern classification. In: Proc. Interspeech, pp. 2285–2288 (2011)
Deng, L., Yu, D., Platt, J.: Scalable stacking and learning for building deep architectures. In: Proc. ICASSP, pp. 2133–2136 (2012)
Yu, D., Deng, L., Li, G., Seide, F.: Discriminative pretraining of deep neural networks. U.S. Patent Filling (November 2011)
Wu, H., Wu, X.H.: Context Dependent Syllable Acoustic model for Continuous Chinese speech recognition. In: Proc. Interspeech, pp. 1713–1716 (2007)
Hinton, G.: A practical guide to training restricted Boltzmann machines. Technical Report UTML TR 2010-003, University of Toronto (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, X., Yang, Y., Wu, X. (2013). A Comparative Study on Selecting Acoustic Modeling Units in Deep Neural Networks Based Large Vocabulary Chinese Speech Recognition. In: Sun, C., Fang, F., Zhou, ZH., Yang, W., Liu, ZY. (eds) Intelligence Science and Big Data Engineering. IScIDE 2013. Lecture Notes in Computer Science, vol 8261. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-42057-3_60
Download citation
DOI: https://doi.org/10.1007/978-3-642-42057-3_60
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-42056-6
Online ISBN: 978-3-642-42057-3
eBook Packages: Computer ScienceComputer Science (R0)