Skip to main content
Log in

Abstract

In this paper, we describe a parametric mixture model for modelling the resonant characteristics of the vocal tract where Gaussian distributions are used to model spectral frequency regions. A mixtures of Gaussian (MoG) based parametrisation scheme is used for modelling a smoothed representation of the spectra. This smoothing procedure removes all signal periodicity from the spectra allowing highly natural analysis, manipulation and synthesis of speech. The goal of this parametrisation scheme is to ease the correspondence between the resonant characteristics of the vocal tract and the parametric distributions and modelling the spectrum with an appropriate number of parameters. Previously, a maximum likelihood (ML) approach to this parametrisation scheme was introduced. However, this approach has inherent local optima problems. Noting that, a relatively small class of Gaussian densities can approximate a large class of distributions, we propose a new scheme whereby starting with a large number of distributions in the mixture, we systematically reduce their number and re-approximate the densities in the mixture based on a distance criterion. The Kullback-Leibler (KL) distance was found to allow optimal MoG solutions to the spectra. Furthermore, a fitness measure based on KL information is used to provide a figure for estimating the model order in representing formant-like features. The proposed model is subjectively evaluated and is shown to reduce the number of Gaussian with an appreciable loss in the quality of the re-synthesised speech.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. J. N. Holmes, W. J. Holmes, and P. N. Garner, “Using Formant Frequencies in Speech Recognition,” in Proceedings of the European Conference on Speech Communication and Technology, ISCA, Rhodes, Greece, vol. 4, 1997, pp. 2083–2086.

  2. H. Kawahara, “Speech Representation and Transformation Using Adaptive Interpolation of Weighted Spectrum: Vocoder Revisited,” in Proc. ICASSP, IEEE, Munich, vol. 2, 1997, pp. 1303–1306.

  3. P. Zolfaghari and A. Robinson, “Formant Analysis Using Mixtures of Gaussians,” in Proceedings of the International Conference on Spoken Language Processing, ISCA, Philadelphia, USA, vol. 2, 1996, pp. 1229–1232.

  4. P. Zolfaghari, “Sinusoidal Model Based Segmental Speech Coding,” Ph.D. thesis, Cambridge University, 1998.

  5. A. Dempster, N. Laird, and D. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society Series B, vol. 39, 1977, pp. 1–38.

    MathSciNet  Google Scholar 

  6. P. Zolfaghari, S. Watanabe, A. Nakamura, and S. Katagiri, “Bayesian Modelling of the Spectrum using Gaussian Mixtures,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, Montreal, Canada, 2004.

  7. P. Harrison and C. Stevens, “Bayesian Forecasting,” Journal of the Royal Statistical Society Series B, vol. 38, 1976, pp. 205–247.

    MathSciNet  Google Scholar 

  8. G. Kitagawa and W. Gersch, in Smoothness Priors Analysis of Time Series, Lectures Notes in Statistics, vol. 116, Springer, Berlin Heidelberg New York, 1996.

    Google Scholar 

  9. D. Titterington, A. Smith, and U. Makov, Statistical Analysis of Finite Mixtures, Wiley, New York, USA, 1985.

    Google Scholar 

  10. W. Penny, “Kl-divergences of Normal, Gamma, Dirichlet and Wishart densities,” Technical report, Wellcome Department of Cognitive Neurology, University College London.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Parham Zolfaghari.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zolfaghari, P., Kato, H., Minami, Y. et al. Dynamic Assignment of Gaussian Components in Modelling Speech Spectra. J VLSI Sign Process Syst Sign Image Video Technol 45, 7–19 (2006). https://doi.org/10.1007/s11265-006-9768-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-006-9768-3

Keywords

Navigation