Structured GMM Based on Unsupervised Clustering for Recognizing Adult and Child Speech

Gorin, Arseniy; Jouvet, Denis

doi:10.1007/978-3-319-11397-5_8

Arseniy Gorin^7,8,9 &
Denis Jouvet^7,8,9

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8791))

Included in the following conference series:

International Conference on Statistical Language and Speech Processing

1044 Accesses

Abstract

Speaker variability is a well-known problem of state-of-the-art Automatic Speech Recognition (ASR) systems. In particular, handling children speech is challenging because of substantial differences in pronunciation of the speech units between adult and child speakers. To build accurate ASR systems for all types of speakers Hidden Markov Models with Gaussian Mixture Densities were intensively used in combination with model adaptation techniques.

This paper compares different ways to improve the recognition of children speech and describes a novel approach relying on Class-Structured Gaussian Mixture Model (GMM).

A common solution for reducing the speaker variability relies on gender and age adaptation. First, it is proposed to replace gender and age by unsupervised clustering. Speaker classes are first used for adaptation of the conventional HMM. Second, speaker classes are used for initializing structured GMM, where the components of Gaussian densities are structured with respect to the speaker classes. In a first approach mixture weights of the structured GMM are set dependent on the speaker class. In a second approach the mixture weights are replaced by explicit dependencies between Gaussian components of mixture densities (as in stranded GMMs, but here the GMMs are class-structured).

The different approaches are evaluated and compared on the TIDIGITS task. The best improvement is achieved when structured GMM is combined with feature adaptation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Speaker Classification via Supervised Hierarchical Clustering Using ICA Mixture Model

Improved Text-Independent Speaker Identification and Verification with Gaussian Mixture Models

2S-Norm: A New Score Normalization for a GMM Based Text-Independent Speaker Identification System

References

Beaufays, F., Vanhoucke, V., Strope, B.: Unsupervised discovery and training of maximally dissimilar cluster models. In: Proceedings of the INTERSPEECH, Makuhari, Japan, pp. 66–69 (2010), http://www.isca-speech.org/archive/interspeech_2004/i04_0377.html
Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., Fissore, L., Laface, P., Mertins, A., Ris, C., Tyagi, V., Wellekens, C.: Automatic speech recognition and speech variability: a review. Speech Commun. 49(10), 763–786 (2007)
Article Google Scholar
Burnett, D.C., Fanty, M.: Rapid unsupervised adaptation to children’s speech on a connected-digit task. In: Proceedings of the ICSLP, vol. 2, pp. 1145–1148. IEEE (1996)
Google Scholar
CMU: Sphinx toolkit (2014), http://cmusphinx.sourceforge.net
Gales, M.J.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)
Article Google Scholar
Gauvain, J.L., Lee, C.H.: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2(2), 291–298 (1994)
Article Google Scholar
Gorin, A., Jouvet, D.: Class-based speech recognition using a maximum dissimilarity criterion and a tolerance classification margin. In: 2012 IEEE Proceedings of the Spoken Language Technology Workshop (SLT), pp. 91–96. IEEE (2012)
Google Scholar
Gorin, A., Jouvet, D.: Efficient constrained parametrization of GMM with class-based mixture weights for automatic speech recognition. In: Proceedings of the LTC-6th Language & Technologies Conference, pp. 550–554 (2013)
Google Scholar
Jouvet, D., Gorin, A., Vinuesa, N.: Exploitation d’une marge de tolérance de classification pour améliorer l’apprentissage de modèles acoustiques de classes en reconnaissance de la parole. In: JEP-TALN-RECITAL, pp. 763–770 (2012)
Google Scholar
Kuhn, R., Nguyen, P., Junqua, J.C., Goldwasser, L., Niedzielski, N., Fincke, S., Field, K., Contolini, M.: Eigenvoices for speaker adaptation. In: Proceedings of the ICSLP, vol. 98, pp. 1774–1777 (1998)
Google Scholar
Leonard, R.G., Doddington, G.: Tidigits speech corpus. Texas Instruments, Inc. (1993)
Google Scholar
O’Shaughnessy, D.: Acoustic analysis for automatic speech recognition. Proc. IEEE 101(5), 1038–1053 (2013)
Article Google Scholar
Panchapagesan, S., Alwan, A.: Frequency warping for vtln and speaker adaptation by linear transformation of standard mfcc. Computer Speech Lang. 23(1), 42–64 (2009)
Article Google Scholar
Stern, R.M., Morgan, N.: Hearing is believing: Biologically inspired methods for robust automatic speech recognition. IEEE Signal Process. Mag. 29(6), 34–43 (2012), http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6296528
Wellekens, C.J.: Explicit time correlation in hidden Markov models for speech recognition. In: Proceedings of the ICASSP, pp. 384–386 (1987)
Google Scholar
Wenxuan, T., Gravier, G., Bimbot, F., Soufflet, F.: Rapid speaker adaptation by reference model interpolation. In: Proceedings of the INTERSPEECH, pp. 258–261 (2007)
Google Scholar
Zhan, P., Waibel, A.: Vocal tract length normalization for large vocabulary continuous speech recognition. Technical report. DTIC Document (1997)
Google Scholar
Zhao, Y., Juang, B.H.: Stranded Gaussian mixture hidden Markov models for robust speech recognition. In: Proceedings of the ICASSP, pp. 4301–4304 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Speech Group, LORIA, Inria, 615 Rue du Jardin Botanique, 54600, Villers-lès-Nancy, France
Arseniy Gorin & Denis Jouvet
Universitè de Lorraine, LORIA, UMR 7503, 54600, Villers-lès-Nancy, France
Arseniy Gorin & Denis Jouvet
CNRS, LORIA, UMR 7503, 54600, Villers-lès-Nancy, France
Arseniy Gorin & Denis Jouvet

Authors

Arseniy Gorin
View author publications
You can also search for this author in PubMed Google Scholar
Denis Jouvet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arseniy Gorin .

Editor information

Editors and Affiliations

University Joseph Fourier, Grenoble, France
Laurent Besacier
Rovira i Virgili University, Tarragona, Spain
Adrian-Horia Dediu
Rovira i Virgili University, Tarragona, Spain
Carlos Martín-Vide

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gorin, A., Jouvet, D. (2014). Structured GMM Based on Unsupervised Clustering for Recognizing Adult and Child Speech. In: Besacier, L., Dediu, AH., Martín-Vide, C. (eds) Statistical Language and Speech Processing. SLSP 2014. Lecture Notes in Computer Science(), vol 8791. Springer, Cham. https://doi.org/10.1007/978-3-319-11397-5_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-11397-5_8
Published: 03 September 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11396-8
Online ISBN: 978-3-319-11397-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics