Abstract
In this article the relevant training aspects for building robust and accurate HMM models for large vocabulary recognition system are discussed and adjusted, namely: speech features, training steps, and the tying options for context dependent (CD) phonemes. As the basis for building HMM models the well known MASPER training scheme is assumed. First the incorporation of the voicing information and its effect on the classical extraction methods like MFCC and PLP will be shown together with the derivative features, where the relative error reductions are up to 50%. Next the suggested enhancement of the standard training procedure by introducing garbled speech models will be presented and tested on real data. As it will be shown it brings more than a 5% drop in the error rate. Finally, the options for tying states of CD phonemes using decision trees and phoneme classification will be adjusted, tested, and explained.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Nouza, J., Zdansky, J., David, P., Cerva, P., Kolorenc, J., & Nejedlova, D. (2005). Fully automated system for Czech spoken broadcast transcription with very large (300K+) lexicon. In Proceedings of interspeech 2005, Lisbon, Portugal, September, 2005 (pp. 1681–1684). ISSN 1018-4074.
Baum, L., & Eagon, J. (1967). An inequality with applications to statistical estimation for probabilities functions of a Markov process and to models for ecology. Bulletin of the AMS, 73, 360–363.
Huang, X., Ariki, Y., & Jack, M. (1990). Hidden Markov models for speech recognition. Edinburg University Press.
Jiang, H., & Li, X. (2007). A general approximation-optimization approach to large margin estimation of HMMs. In Robust speech recognition and understanding. I-Tech education and publishing, Croatia, ISBN 978-3-902613-08-0.
Bonafonte, A., Vidal, J., & Nogueiras, A. (1996). Duration modeling with expanded HMM applied to speech recognition. In Proceedings of ICSLP 96, Philadelphia, USA (Vol. 2, pp. 1097–1100). ISBN: 0-7803-3555-4.
Casar, M., & Fonllosa, J. (2007). Double layer architectures for automatic speech recognition using HMM. In Robust speech recognition and understanding. I-Tech education and publishing, Croatia. ISBN 978-3-902613-08-0.
Hermasky, H., & Morgan, N. (1994). RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4).
Nadeu, C., & Macho, D. (2001). Time and Frequency Filtering of Filter-Bank energies for robust HMM speech recognition. Speech Communication, 34.
Cheng, O., Abdulla, W., & Salcic, Z. (2005). Performance evaluation of front-end processing for speech recognition systems. School of Engineering Report No. 621, Electrical and Computer Engineering Department, School of Engineering, The University of Auckland.
Haque, S., Togneri, R., & Zaknich, A. (2009). Perceptual features for automatic speech recognition in noisy environments. Speech Communication, 51, 58–75.
Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. The Annals of Statistics, 11, 95–103.
Darjaa, S., Rusko, M., & Trnka, M. (2006). MobilDat-SK–a mobile telephone extension to the SpeechDat-E SK telephone speech database in Slovak. In Proceedings of the 11-th international conference speech and computer (SPECOM’2006), St. Petersburg, Russia (pp. 449–454).
Zgank, A., Kacic, Z., Diehel, F., Vicsi, K., Szaszak, G., Juhar, J., & Lihan, S. (2004). The Cost 278 MASPER initiative—crosslingual speech recognition with large telephone databases. In Proceedings of language resources and evaluation (LREC), Lisbon (pp. 2107–2110).
Lindberg, B., Johansen, F., Warakagoda, N., Lehtinen, G., Kacic, Z., Zgang, A., Elenius, K., & Salvi, G. (2000). A noise robust multilingual reference recognizer based on SpeechDat(II). In Proceedings of ICSLP 2000, Beijing, China, October 2000.
Rabiner, L., & Juan, B. (1993). Fundamentals of speech recognition. New Jersey: Prentice Hall. ISBN 0-13-015157-2
Hönig, F., Stemmer, G., Hacker, Ch., & Brugnara, F. (2005). Revising perceptual linear prediction (PLP). In Proceedings of INTERSPEECH, Lisbon, Portugal, Sept. 2005 (pp. 2997–3000).
Lee, K., Hon, H., & Reddy, R. (1990). An overview of the SPHINX speech recognition system. IEEE Transactions on Acoustics Speech and Signal Processing, 38(1).
Hermansky, H., Hanson, B. A., & Wakita, H. (1985). Perceptually based linear predictive analysis of speech. New York: IEEE.
Rabaoui, A., Kadri, H., Lachiri, Z., & Ellouze, N. (2008). Using robust features with multi-class SVMs to classify noisy sounds. In ISCCSP, Malta.
Cheveigne, A., & Kawahara, H. (2002). YIN, a fundamental frequency estimator for speech and music. Journal of the Acoustical Society of America, 111(4).
Kacur, J., & Rozinaj, G. (2009). Adding voicing features into speech recognition based on HMM in Slovak. In IWSSIP09, Greece.
Juhar, J., Ondas, S., Cizmar, A., Rusko, M., Rozinaj, G., & Jarina, R. (2006). Galaxy/VoiceXML based spoken Slovak dialogue system to access the Internet. In ECAI 2006 workshop on language-enabled educational technology and development and evaluation of robust spoken dialogue systems, Riva del Garda, Italy, August 29, 2006 (pp. 34–37).
Johansen, F. T., Warakagoda, N., Lindberg, B., et al. (2000). The cost 249 SpeechDat multilingual reference recognizer. In 2nd international conference on language resources and evaluation (LREC-2000), Athens, May 2000.
Höge, H., Draxler, C., Van den Heuvel, H., Johansen, F. T., Sanders, E., & Tropf, H. S. (1999). SpeechDat multilingual speech databases for teleservices: across the finish line. In Proc. Europ. conf. speech proc. and techn. (EUROSPEECH).
Young, S., Evermann, G., & Hain, T. (2002). The HTK book V.3.2.1. Cambridge University Engineering Department.
Kacur, J., & Ceresna, M. (2007). A modified MASPER training procedure for ASR systems and its performance on Slovak MOBILDAT database. In IWSSIP07, Slovenia.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kačur, J., Rozinaj, G. Building accurate and robust HMM models for practical ASR systems. Telecommun Syst 52, 1683–1696 (2013). https://doi.org/10.1007/s11235-011-9660-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11235-011-9660-8