Abstract
In this paper, we describe a parametric mixture model for modelling the resonant characteristics of the vocal tract where Gaussian distributions are used to model spectral frequency regions. A mixtures of Gaussian (MoG) based parametrisation scheme is used for modelling a smoothed representation of the spectra. This smoothing procedure removes all signal periodicity from the spectra allowing highly natural analysis, manipulation and synthesis of speech. The goal of this parametrisation scheme is to ease the correspondence between the resonant characteristics of the vocal tract and the parametric distributions and modelling the spectrum with an appropriate number of parameters. Previously, a maximum likelihood (ML) approach to this parametrisation scheme was introduced. However, this approach has inherent local optima problems. Noting that, a relatively small class of Gaussian densities can approximate a large class of distributions, we propose a new scheme whereby starting with a large number of distributions in the mixture, we systematically reduce their number and re-approximate the densities in the mixture based on a distance criterion. The Kullback-Leibler (KL) distance was found to allow optimal MoG solutions to the spectra. Furthermore, a fitness measure based on KL information is used to provide a figure for estimating the model order in representing formant-like features. The proposed model is subjectively evaluated and is shown to reduce the number of Gaussian with an appreciable loss in the quality of the re-synthesised speech.
Similar content being viewed by others
References
J. N. Holmes, W. J. Holmes, and P. N. Garner, “Using Formant Frequencies in Speech Recognition,” in Proceedings of the European Conference on Speech Communication and Technology, ISCA, Rhodes, Greece, vol. 4, 1997, pp. 2083–2086.
H. Kawahara, “Speech Representation and Transformation Using Adaptive Interpolation of Weighted Spectrum: Vocoder Revisited,” in Proc. ICASSP, IEEE, Munich, vol. 2, 1997, pp. 1303–1306.
P. Zolfaghari and A. Robinson, “Formant Analysis Using Mixtures of Gaussians,” in Proceedings of the International Conference on Spoken Language Processing, ISCA, Philadelphia, USA, vol. 2, 1996, pp. 1229–1232.
P. Zolfaghari, “Sinusoidal Model Based Segmental Speech Coding,” Ph.D. thesis, Cambridge University, 1998.
A. Dempster, N. Laird, and D. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society Series B, vol. 39, 1977, pp. 1–38.
P. Zolfaghari, S. Watanabe, A. Nakamura, and S. Katagiri, “Bayesian Modelling of the Spectrum using Gaussian Mixtures,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, Montreal, Canada, 2004.
P. Harrison and C. Stevens, “Bayesian Forecasting,” Journal of the Royal Statistical Society Series B, vol. 38, 1976, pp. 205–247.
G. Kitagawa and W. Gersch, in Smoothness Priors Analysis of Time Series, Lectures Notes in Statistics, vol. 116, Springer, Berlin Heidelberg New York, 1996.
D. Titterington, A. Smith, and U. Makov, Statistical Analysis of Finite Mixtures, Wiley, New York, USA, 1985.
W. Penny, “Kl-divergences of Normal, Gamma, Dirichlet and Wishart densities,” Technical report, Wellcome Department of Cognitive Neurology, University College London.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zolfaghari, P., Kato, H., Minami, Y. et al. Dynamic Assignment of Gaussian Components in Modelling Speech Spectra. J VLSI Sign Process Syst Sign Image Video Technol 45, 7–19 (2006). https://doi.org/10.1007/s11265-006-9768-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-006-9768-3