Abstract
Speaker variability is a well-known problem of state-of-the-art Automatic Speech Recognition (ASR) systems. In particular, handling children speech is challenging because of substantial differences in pronunciation of the speech units between adult and child speakers. To build accurate ASR systems for all types of speakers Hidden Markov Models with Gaussian Mixture Densities were intensively used in combination with model adaptation techniques.
This paper compares different ways to improve the recognition of children speech and describes a novel approach relying on Class-Structured Gaussian Mixture Model (GMM).
A common solution for reducing the speaker variability relies on gender and age adaptation. First, it is proposed to replace gender and age by unsupervised clustering. Speaker classes are first used for adaptation of the conventional HMM. Second, speaker classes are used for initializing structured GMM, where the components of Gaussian densities are structured with respect to the speaker classes. In a first approach mixture weights of the structured GMM are set dependent on the speaker class. In a second approach the mixture weights are replaced by explicit dependencies between Gaussian components of mixture densities (as in stranded GMMs, but here the GMMs are class-structured).
The different approaches are evaluated and compared on the TIDIGITS task. The best improvement is achieved when structured GMM is combined with feature adaptation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Beaufays, F., Vanhoucke, V., Strope, B.: Unsupervised discovery and training of maximally dissimilar cluster models. In: Proceedings of the INTERSPEECH, Makuhari, Japan, pp. 66–69 (2010), http://www.isca-speech.org/archive/interspeech_2004/i04_0377.html
Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., Fissore, L., Laface, P., Mertins, A., Ris, C., Tyagi, V., Wellekens, C.: Automatic speech recognition and speech variability: a review. Speech Commun. 49(10), 763–786 (2007)
Burnett, D.C., Fanty, M.: Rapid unsupervised adaptation to children’s speech on a connected-digit task. In: Proceedings of the ICSLP, vol. 2, pp. 1145–1148. IEEE (1996)
CMU: Sphinx toolkit (2014), http://cmusphinx.sourceforge.net
Gales, M.J.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)
Gauvain, J.L., Lee, C.H.: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2(2), 291–298 (1994)
Gorin, A., Jouvet, D.: Class-based speech recognition using a maximum dissimilarity criterion and a tolerance classification margin. In: 2012 IEEE Proceedings of the Spoken Language Technology Workshop (SLT), pp. 91–96. IEEE (2012)
Gorin, A., Jouvet, D.: Efficient constrained parametrization of GMM with class-based mixture weights for automatic speech recognition. In: Proceedings of the LTC-6th Language & Technologies Conference, pp. 550–554 (2013)
Jouvet, D., Gorin, A., Vinuesa, N.: Exploitation d’une marge de tolérance de classification pour améliorer l’apprentissage de modèles acoustiques de classes en reconnaissance de la parole. In: JEP-TALN-RECITAL, pp. 763–770 (2012)
Kuhn, R., Nguyen, P., Junqua, J.C., Goldwasser, L., Niedzielski, N., Fincke, S., Field, K., Contolini, M.: Eigenvoices for speaker adaptation. In: Proceedings of the ICSLP, vol. 98, pp. 1774–1777 (1998)
Leonard, R.G., Doddington, G.: Tidigits speech corpus. Texas Instruments, Inc. (1993)
O’Shaughnessy, D.: Acoustic analysis for automatic speech recognition. Proc. IEEE 101(5), 1038–1053 (2013)
Panchapagesan, S., Alwan, A.: Frequency warping for vtln and speaker adaptation by linear transformation of standard mfcc. Computer Speech Lang. 23(1), 42–64 (2009)
Stern, R.M., Morgan, N.: Hearing is believing: Biologically inspired methods for robust automatic speech recognition. IEEE Signal Process. Mag. 29(6), 34–43 (2012), http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6296528
Wellekens, C.J.: Explicit time correlation in hidden Markov models for speech recognition. In: Proceedings of the ICASSP, pp. 384–386 (1987)
Wenxuan, T., Gravier, G., Bimbot, F., Soufflet, F.: Rapid speaker adaptation by reference model interpolation. In: Proceedings of the INTERSPEECH, pp. 258–261 (2007)
Zhan, P., Waibel, A.: Vocal tract length normalization for large vocabulary continuous speech recognition. Technical report. DTIC Document (1997)
Zhao, Y., Juang, B.H.: Stranded Gaussian mixture hidden Markov models for robust speech recognition. In: Proceedings of the ICASSP, pp. 4301–4304 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Gorin, A., Jouvet, D. (2014). Structured GMM Based on Unsupervised Clustering for Recognizing Adult and Child Speech. In: Besacier, L., Dediu, AH., Martín-Vide, C. (eds) Statistical Language and Speech Processing. SLSP 2014. Lecture Notes in Computer Science(), vol 8791. Springer, Cham. https://doi.org/10.1007/978-3-319-11397-5_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-11397-5_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11396-8
Online ISBN: 978-3-319-11397-5
eBook Packages: Computer ScienceComputer Science (R0)