Abstract
In this article, we present a new approach to modeling speaker-dependent systems. The approach was inspired by the eigenfaces techniques used in face recognition. We build a linear vector space of low dimensionality, called eigenspace, in which speakers are located. The basis vectors of this space are called eigenvoices. Each eigenvoice models a direction of inter-speaker variability. The eigenspace is built during the training phase. Then, any speaker model can be expressed as a linear combination of eigenvoices. The benefits of this technique as set forth in this article reside in the reduction of the number of parameters that describe a model. Thereby we are able to reduce the number of parameters to estimate, as well as computation and/or storage costs. We apply the approach to speaker adaptation and speaker recognition. Some experimental results are supplied.
Résumé
Cet article présente une nouvelle approche inspirée de la reconnaissance d’images, adaptée et appliquée à la parole. Un espace vectoriel de dimension réduite, appelé espace propre (eigenspace), dans lequel les locuteurs se trouvent confinés est construit. Les vecteurs de base de cet espace sont appelés voix propres (eigenvoices). Chaque voix propre modélise une composante de variabilité inter-locuteur. L’espace propre est construit lors de la phase d’apprentissage classique pour des systèmes liés à la parole. Un modèle du locuteur est par la suite associé à une combinaison linéaire des vecteurs de l’espace réduit des locuteurs. L’avantage de cette méthode, mis en avant dans l’article, est la réduction du nombre de paramètres caractéristiques d’un modèle. De ce fait, le nombre de paramètres à estimer est réduit, ainsi que le temps de calcul et/ou de stockage. Cette technique est ici appliquée à l’adaptation du locuteur pour un système de reconnaissance automatique du locuteur et à la reconnaissance automatique du locuteur. Quelques résultats expérimentaux sont présentés à cette occasion.
Similar content being viewed by others
References
Ahadi-Sarkani (S.), “Bayesian and predictive techniques for speaker adaptation”,Ph.D. Thesis, 1996, Cambridge University.
Bimbot (F.), Magrin-Chagnolleau (I.), Mathan (L.), “Second-order statistical measures for text-independent speaker identification”,Speech Communication, 1995,17., pp. 177–192.
Beigi (H.S.M.), Maes (S.H.), Sorensen (J.S.), “A distance measure between collections of distributions and its applications to speaker recognition”,Proceedings of theInternational Conference on Acoustics, Speech and Signal Processing (ICASSP), 1998,2, pp. 753–757.
Chen (S.), De Souza (P.), “Speaker adaptation by correlation (abc)”,Proceedings of Eurospeech, 1997, pp. 2111–21s14.
Chou (W.), “Maximum a posterior linear regression with elliptically symmetric matrix variate priors”,Proceedings of Eurospeech, 1999, V. l,pp. 1–4.
Comon (P.), “Independent component analysis, a new concept.?”,Signal Processing, 1994,36, n°. 3, pp. 287–314.
Dempster (A.P.), Laird (N.M.), Rubin (D.P.), “Maximum-likelihood from incomplete data via the em algorithm”,Journal of the Royal Statistical Society, 1977, Vol. B, pp. 1–38.
Forsyth (M.), “Hidden Markov models for automatic speaker verification”,PhD thesis, University of Edinburgh, 1995.
Fukunaga (K.), “Introduction to statistical pattern recognition”, 1972,Academic Press, New York and London.
Gauvain (J.-L.), Lee (C.-H.), “Bayesian learning for hidden Markov model with Gaussian mixture state observation densities”,Speech Communications, 1992,11, pp. 205–213.
Gales (M.F.J.), “Transformation smoothing for speaker and environmental adaptation”,Proceedings of Eurospech, 1997, pp. 2067–2071.
Gales (M.F.J.), “Cluster adaptive training for speech recognition”,Proceedings of the International Conference on Speech and Language Processing (ICSLP), 1998,5, pp. 1783–1786.
Gales (M.F.J.), Woodland (P.), “Mean and variance adaptation within the mllr framework”,Computer Speech and Language, 1996,.10, n°. 4, pp. 250–264.
Goronzy (S.), Kompe (R.), “A MAP-like weighting scheme for mllr speaker adaptation”,Proceedings of Eurospeech, 1999,1, pp. 5–8.
Hazen (T.), “The use of speaker correlation information for automatic speech recognition”,PhD Thesis, 1998, MIT.
Hermansky (H.), “Perceptual linear predictive (plp) analysis of speech”,Journal of the American Society of Acoustics (JASA), 1990,87, n° 4, pp. 1738–1752.
Jolliffe (LT.), “Principal component analysis”,Springer- Verlag, 1986.
Kannan (A.), Ostendorf (M.) “Modeling dependency in adaptation of acoustic models using multiscale tree processes”,Proceedings of Eurospech, 1997, pp. 1863–1867.
Kuhn (R.), Nguyen (P.), Junqua (J.-C), Boman (R.), Nledzielski (N.), FlNCKE (S.), Field (K.), Contolini (M.), “Fast speaker adaptation in eigenvoice space”,Proceedings of theInternational Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1999,2, pp 749–752.
Nguyen (P.), Wellekens (C), Junqua (J.-C), “Maximum-likelihood eigenspace and mllr for speech recognition in noisy environments”,Proceedings of Eurospeech, 1999,6, pp. 2519–2522.
Legetter (C. J.), Woodland (P. C), “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models“,Computer Speech and Language, 1995,9, pp. 171–185
Olsen (J.), “Speaker recognition based on discriminative projection models”,Proceedings of the International Conference on Speech and Language Processing (ICSLP), 1998,56 pp. 1919–1922.
Reynolds (D.A.), “Speaker identification and verification using Gaussian mixture speaker models”,Speech Communication,17, 1995, pp. 91–108
Rosenberg (A.E.), Lee (C.-H.), Juang (B.-H.), Song (F.K.), “The use of cohort normalized scores for speaker verification”,Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1992,56 pp. 262–272.
Rosenberg (A.E.), Lee (C.-H.), Song (F.K.), McGee (A.), “Experiments in automatic talker verification using sub-word unit hidden Markov models”,Proceedings of the International Conference on Speech and Language Processing (ICSLP), 1990,56 pp. 141–144
Rose (R.C.), Reynolds (D.A.), “Text-independent speaker identification using automatic acoustic segmentation”, Proceedings of theInternational Conference on Acoustics,Speech and Signal Processing (ICASSP), 1990, pp. 293– 296.
Suzuki (M.), Abe (T.), Mori (H.), Marino (S.) and Aso (H.), “High-Speed speaker adaptation using phoneme-dependent tree-structured speaker clustering”, Proceedings of theInternational Conference on Speech and Language Processing (ICSLP), 1998, pp. 2299–2302.
Turk (M.) andPentland (A.), “Eigenfaces for Recognition”,Journal of Cognitive Neuroscience, 1991, V.3, n° 1, pp. 71–86.
Viikki (O.), Laurila (K.), “Incremental online speaker adaptation in adverse conditions”, Proceedings of theInternational Conference on Speech and Language Processing (ICSLP), 1998, V. 5, pp. 1779–1782.
Author information
Authors and Affiliations
Corresponding authors
Rights and permissions
About this article
Cite this article
Nguyen, P., Kuhn, R., Junqua, JC. et al. Eigenvoices: A compact representation of speakers in model space. Ann. Télécommun. 55, 163–171 (2000). https://doi.org/10.1007/BF03001909
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/BF03001909