Abstract
This paper presents a method for the estimation and mapping of parametric models of speech resonance at formants for voice conversion. The spectral features at formants that contribute to voice characteristics are the trajectories of the frequencies, the bandwidths and intensities of the resonance at formants. The formant features are extracted from the poles of a linear prediction (LP) model of speech. The statistical distributions of formants are modelled by a two-dimensional hidden Markov model (HMM) spanning the time and frequency dimensions. Experimental results are presented which show a close match between HMM-based formant models and the histograms of formants. For voice conversion two alternative methods are explored for mapping the formants of a source speaker to those of a target speaker. The first method is based on an adaptive formant-tracking warping of the frequency response of the LP model and the second method is based on the rotation of the poles of the LP model of speech. Both methods transform all spectral parameters of the resonance at formants of the source speaker towards those of the target speaker. In addition, the issues affecting the selection of the warping ratios for the mapping functions are investigated. Experimental results of formant estimation and perceptual evaluation of voice morphing based on parametric formant models are presented.
Similar content being viewed by others
References
Abe, M., Nakamura, S., Shikano, K., and Kuwabara, H. (1988). Voice conversion through vector quantization, In Proceedings of ICASSP 1998, pp. 565–568.
Acero, A. (1999). Formant analysis and synthesis using hidden markov models, In Proc. of the Eurospeech Conference, Volume 3, Page 1047–1050.
Allen, J. Hunnicutt, S. Klatt, D. (1987). From Text to Speech: The MITalk System. Cambridge, Cambridge University Press.
Arslan L.M. and Talkin, D. (1997). Voice Conversion by codebook mapping of line spectral frequencies and excitation spectrum, EUROSPEECH 1997 Proceedings.
Bazzi, I., Acero, A., and Deng, Li. (2003). An expectation maximazation approach for Formant Tracking Using a Parameter-free Non-Linear Predictor. In Proc. ICASSP 2003, pp. 464–467.
Cahn, J.E. (1990). The generation of affect in synthesized speech, Journal of the American Voice I/O Society, 8(July): 1–19.
Carlson, R., Granstrom, B., and Karlsson, I. (1991). Experiments with voice modelling in speech synthesis. Speech Communication, 10: 481–489.
Carlson, R., Sigvardson, T. and Arvid, Sjolander. (2002). Data-driven formant synthesis, TMH-QPSR Vol.44 – Fonetik 2002.
Chen, Y., Chu, M., Chang, E., Liu, J., and Liu, R. (2003). Voice conversion with smoothed gmm and map adaptation, In Proc. Eurospeech 2003, pp. 2413–2416.
De Boor, C. (1978). A Practical Guide to Splines, Springer-Verlag.
Edrington, M. Lowry, A. Jackson, P. Breen, A. Minnis, S. (1998), Overview of Current Text-to-Speech Techniques: Part II - Prosody and Speech Generation, in Speech Technology for Telecommunications, Chapman & Hall, London, UK.
Fant G. (1986), Glottal flow: Models and interaction, Journal of Phonetics, 14: 393–399.
Furui, S. (1989). Digital Speech Processing, Synthesis, and Recognition, Marcel Dekker, New York.
Ho, C.H., Rentzos, D. Vaseghi, and S. (2002). Formant model estimation and transformation for voice morphing. In Proc. ICSLP, pp. 2149–2152.
Holmes, J. Holmes, W. and Garner, P. (1997). Using formant frequencies in speech recognition. In Proc. Eurospeech-97, vol. 4, pp. 2083–2086.
Horne, M. (ed). (2000), Prosody: Theory and Experiment. Studies Presented to Gösta Bruce. Kluwer Academic Publishers, Dordrecht.
Iwahashi N. and Sagisaka, Y. (1994). Speech Spectrum transformation by speaker interpolation, In Proceedings IEEE Int. Conference Acoustics, Speech Signal Processing.
Kain, A and Macon, M.W. (1998). Spectral voice conversion for text-to-speech synthesis. Proceedings of ICASSP, vol. 1, pp. 285–288.
Kopec, D.H. (1986). Formant tracking using hidden Markov models and vector quantisation. IEEE Trans on Acoust., Speech, Signal Processing, Vol. ASSP-34, No 4, pp. 709–729.
Kuwabara, H. and Sagisaka, Y. (1995). Acoustic characteristics of speaker individuality: Control and Conversion. 16: 165–173, Feb.
Lee, M. van Santen, J. Mobius, B. Olive, J. (1999). Formant tracking using segmental phonemic information” In Proceedings of the Eurospeech 1999, vol. 6, 2789–2792.
McAulay, R.J. and Quatieri, T.F. (1995). Sinusoidal coding, in speech coding and synthesis. In W.B. Kleijn and K.K. Paliwal, (Eds.) Elsevier Science, Hol, 4, pp. 121–173.
Moulines, E. and Charpentier, F. (1990). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones, Speech Communication, 9: 453–467.
Rao, A. and Kumaresan, R. (2000), On decomposing speech into modulated components. IEEE Trans. Speech and Audio Proc. 8(3): 240–254.
Rabiner L, Juang BH. (1993). Fundamentals of speech recognition, Prentice Hall, Englewood Cliffs.
Slaney, M., Covell, M., and Lassiter, B. (1996). Automatic audio morphing, In Proceedings of the 1996 ICASSP, Vol. 2 pp. 1001–1004.
Styger, T and Keller E. (1994). Formant synthesis. In E. Keller (Ed.), Fundamentals in Speech Synthesis and Speech Recognition, pp. 109–128. Wiley.
Stylianou, Y., Cappe, O., and Moulines, E. (1998). Continuous Probabilistic Transform for Voice Conversion, IEEE transactions on speech & audio processing, Vol.6, No.2, pp. 131–142.
Tang, M., C. Wang, and S. Seneff, (2001). Voice transformations: from speech synthesis to mammalian vocalizations. In Proceedings of the 7th European Conference on Speech Communication and Technology, Denmark 2001.
Turk, O. and Arslan, L.M. (2002). Subband based voice conversion, In Proceedings of the 2002 International Conference on Spoken Language Processing, pp. 289–292.
Valbret H., Moulines, E. and Tubach, J.P. (1992). Voice transformation using PSOLA techniques, Speech Communication, vol. 11, pp. 175–187.
Weber K., Ikbal S., Bengio S., and Bourlard H., (2003). Robust speech recognition and feature extraction using HMM2, Computer Speech and Language 17, pp. 195–211.
Woodland, P.C. and Young, S.J. (1993). The HTK Continuous Speech Recogniser. Proceedings Eurospeech 1993, pp. 2207–2219.
Xia, K. and Espy-Wilson, C. (2000). A new strategy of formant tracking based on dynamic programming. Intern. Conf. on Spoken Language Processing, Oct. 2000, pp. III 55–58.
Yan, Q., Vaseghi, S., Ho, C.H., Rentzos, D., Turajlic, E. (2003). Comparative analysis and synthesis of formant trajectories of british and broad australian accents. Proceedings of Eurospeech 2003, pp. 2941–2944.
Yegnanarayana, B. and Veldhuis R.N.J.(1998). Extraction of vocal-tract system characteristics from speech signal. IEEE Trans. On Speech and Audio Processing, vol. 6, pp. 313–327.
Zhan P. & Westphal, M. (1997). Speaker normalisation based on frequency warping in proceedings of ICASSP 1997, pp. 1039–1042.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Rentzos, D., Vaseghi, S., Yan, Q. et al. Parametric Formant Modelling and Transformation in Voice Conversion. Int J Speech Technol 8, 227–245 (2005). https://doi.org/10.1007/s10772-006-5692-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-006-5692-y