A pitch synchronous approach to design voice conversion system using source-filter correlation

Hussain Laskar, Rabul; Banerjee, Kalyan; Talukdar, Fazal Ahmed; Rao, K. Sreenivasa

doi:10.1007/s10772-012-9164-2

A pitch synchronous approach to design voice conversion system using source-filter correlation

Published: 26 June 2012

Volume 15, pages 419–431, (2012)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Rabul Hussain Laskar¹,
Kalyan Banerjee¹,
Fazal Ahmed Talukdar¹ &
…
K. Sreenivasa Rao²

324 Accesses
7 Citations
Explore all metrics

Abstract

We propose a pitch synchronous approach to design the voice conversion system taking into account the correlation between the excitation signal and vocal tract system characteristics of speech production mechanism. The glottal closure instants (GCIs) also known as epochs are used as anchor points for analysis and synthesis of the speech signal. The Gaussian mixture model (GMM) is considered to be the state-of-art method for vocal tract modification in a voice conversion framework. However, the GMM based models generate overly-smooth utterances and need to be tuned according to the amount of available training data. In this paper, we propose the support vector machine multi-regressor (M-SVR) based model that requires less tuning parameters to capture a mapping function between the vocal tract characteristics of the source and the target speaker. The prosodic features are modified using epoch based method and compared with the baseline pitch synchronous overlap and add (PSOLA) based method for pitch and time scale modification. The linear prediction residual (LP residual) signal corresponding to each frame of the converted vocal tract transfer function is selected from the target residual codebook using a modified cost function. The cost function is calculated based on mapped vocal tract transfer function and its dynamics along with minimum residual phase, pitch period and energy differences with the codebook entries. The LP residual signal corresponding to the target speaker is generated by concatenating the selected frame and its previous frame so as to retain the maximum information around the GCIs. The proposed system is also tested using GMM based model for vocal tract modification. The average mean opinion score (MOS) and ABX test results are 3.95 and 85 for GMM based system and 3.98 and 86 for the M-SVR based system respectively. The subjective and objective evaluation results suggest that the proposed M-SVR based model for vocal tract modification combined with modified residual selection and epoch based model for prosody modification can provide a good quality synthesized target output. The results also suggest that the proposed integrated system performs slightly better than the GMM based baseline system designed using either epoch based or PSOLA based model for prosody modification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimal Near-End Speech Intelligibility Improvement Using CLPSO-Based Voice Transformation in Realistic Noisy Environments

Article 25 July 2022

Voice Conversion for TTS Systems with Tuning on the Target Speaker Based on GMM

A novel voice conversion approach using cascaded powerful cepstrum predictors with excitation and phase extracted from the target training space encoded as a KD-tree

Article 08 October 2019

References

Abe, M., Nakanura, S., Shikano, K., & Kuwabara, H. (1988). Voice conversion through vector quantization. In Proc. of int. conf. on acoustics, speech, and signal process (Vol. 1, pp. 655–658). New York: IEEE.
Google Scholar
Akagi, M., & Ienaga, T. (1995). Speaker individualities in fundamental frequency contours and its control. In Proc. of Eurospeech (pp. 439–442).
Google Scholar
Arslan, L. M. (1999). Speaker transformation algorithm using segmental code books (STASC). Speech Communication, 28(3), 211–226.
Article Google Scholar
Baudoin, G., & Stylianou, Y. (1996). On the transformation of speech spectrum for voice conversion. In Proc. of int. conf. on spoken language process (Vol. 3, pp. 1045–1048).
Google Scholar
Chappel, D. T., & Hansen, J. H. (1998). Speaker specific pitch contour modeling and modification. In Proc. of int. conf. on acoustics, speech, and signal process (Vol. 2, pp. 885–888). Seattle: IEEE.
Google Scholar
Chen, Y., Chu, M., Chang, E., Liu, J., & Runsheng, L. (2003). Voice conversion using smooth GMM and MAP adaptation. In Proc. of Eurospeech, Geneva (pp. 2413–2416).
Google Scholar
Collobert, R., & Bengio, S. (2001). SVMTorch: support vector machines for large scale regression problems. Journal on Machine Learning, 1, 143–160.
MathSciNet Google Scholar
Cruz, F. P., & Rodríguez, A. A. (2004). Speeding up the IRWLS convergence to the SVM solution. In Proc. of int. joint conf. on neural networks, IEEE, special session on least squares support vector machines (Vol. 4, pp. 555–560).
Google Scholar
Cruz, F. P., Camps, G., Soria, E., Perez, J., Vidal, A. R. F., & Rodriguez, A. A. (2002). Multi-dimensional function approximation and regression estimation. In Proc. of int. conf. on artificial neural networks, Madrid, Spain (Vol. 2, pp. 757–762).
Google Scholar
Cruz, F. P., Calzon, C. B., & Rodriguez, A. A. (2005). Convergence of the IRWLS procedure to the support vector machine solution. Neural Computation, 17(1), 7–18.
Article MATH Google Scholar
Desai, S., Black, A. W., Yegnanarayana, B., & Prahallad, K. (2010). Spectral mapping using artificial neural networks for voice conversion. IEEE Transactions on Audio, Speech, and Language Processing, 18(5), 954–964.
Article Google Scholar
Dhananjaya, N., & Yegnarayana, B. (2010). Voiced/nonvoiced detection based on robustness of voiced epochs. IEEE Signal Processing Letters, 17(3), 273–276.
Article Google Scholar
Drugman, T., Moinet, A., Dutoit, T., & Wilfart, G. (2009). Using a pitch synchronous residual codebook for hybrid HMM/frame selection speech synthesis. In Proc. of int. conf. on acoustics, speech, and signal process (pp. 3793–3796). Taipei: IEEE.
Chapter Google Scholar
Fernandez, M. S., Cumplido, M. P., García, J. A., & Cruz, F. P. (2004). SVM multi-regression for nonlinear channel estimation in multiple-input multiple-output systems. IEEE Transactions on Signal Processing, 52(8), 2298–2307.
Article MathSciNet Google Scholar
Ghosh, P. K., & Narayanan, S. S. (2009). Pitch contour stylization using an optimal piecewise polynomial approximation. IEEE Signal Processing Letters, 16(9), 810–813.
Article Google Scholar
Han, X., Zhao, X., Fang, T., & Jia, X. (2011). Research on EEDSVQ of LSF parameters based on voiced and unvoiced classification. Journal of Convergence Information Technology, 6(1), 116–125.
Article Google Scholar
Helander, E., Silen, H., Virtanen, T., & Gabbouj, M. (2012). Voice conversion using dynamic kernel partial least squares regression. IEEE Transactions on Speech and Audio Processing, 20(3), 806–817.
Article Google Scholar
Inanoglu, Z. (2003). Transforming pitch in a voice conversion framework. M.Phil. thesis, St. Edmund’s College University of Cambridge. July, 2003.
Joachims, T. (1999). Making large-scale SVM learning practical. In B. Scholkopf, C. Burges & A. Smola (Eds.), Advances in kernel methods-support vector learning (pp. 169–184). Cambridge: MIT Press.
Google Scholar
Kain, A., & Macon, M. (1998). Spectral voice conversion for text-to-speech synthesis. In Proc. of int. conf. on acoustics, speech, and signal process (Vol. 1, pp. 285–288). New York: IEEE.
Google Scholar
Kain, A., & Macon, M. W. (2001). Design and evaluation of a voice conversion algorithm based on spectral envelop mapping and residual prediction. In Proc. of int. conf. on acoustics, speech, and signal process (Vol. 2, pp. 813–816). New York: IEEE.
Google Scholar
Kuwabara, H. (1984). A pitch-synchronous analysis/synthesis to independently modify formant frequencies and bandwidth for voiced speech. Speech Communication, 3(3), 211–220.
Article MathSciNet Google Scholar
Kuwabara, H., & Sagisaka, Y. (1995). Acoustics characteristics of speaker individuality: control and conversion. Speech Communication, 16(2), 165–173.
Article Google Scholar
Laskar, R. H., Talukdar, F. A., Paul, B., & Chakrabarty, D. (2011). Sample reduction using recursive and segmented data structure analysis. Journal of Engineering and Computer Innovations, 2(4), 59–67.
Google Scholar
Lee, K.-S. (2007). Statistical approach for voice personality transformation. IEEE Transactions on Audio, Speech, and Language Processing, 15(2), 641–651.
Article Google Scholar
Lee, K. S., Youn, D. H., & Cha, I. W. (1996). A new voice personality transformation based on both linear and non-linear prediction analysis. In Proc. of int. conf. on spoken language process (pp. 1401–1404).
Google Scholar
Mesbahi, L., Barreaud, V., & Boeffard, O. (2007). GMM-based speech transformation system under data reduction. In Proc. of int. speech comm. assoc., speech synthesis workshop (pp. 119–124). Bonn, Germany.
Google Scholar
Mousa, A. (2010). Voice conversion using pitch shifting algorithm by time stretching with PSOLA and re-sampling. Journal of Electrical Engineering, 61(1), 57–61.
Article MathSciNet Google Scholar
Murthy, K. S. R., & Yegnanarayana, B. (2006). Combining evidence from residual phase and MFCC features for speaker recognition. IEEE Signal Processing Letters, 13(1), 52–56.
Article Google Scholar
Narendranath, M., Murthy, H. A., Rajendran, S., & Yegnanarayana, B. (1995). Transformation of formants for voice conversion using artificial neural networks. Speech Communication, 16(2), 206–216.
Article Google Scholar
Perrot, P., Aversano, G., & Chollet, G. (2007). Voice disguise and automatic detection review and perspective. In Lecture notes in computer science (Vol. 4391, pp. 101–117). Berlin: Springer.
Google Scholar
Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. In B. Scholkopf, C. Burges & A. Smola (Eds.), Advances in kernel methods-support vector learning (pp. 185–208). Cambridge: MIT Press.
Google Scholar
Prasanna, S. R. M., Gupta, C. S., & Yegnanarayana, B. (2006). Extraction of speaker-specific information from linear prediction residual of speech. Speech Communication, 48(10), 1243–1261.
Article Google Scholar
Rao, K. S. (2010). Voice conversion by mapping the speaker-specific features using pitch synchronous approach. Computer Speech & Language Processing, 24(3), 474–494.
Article Google Scholar
Rao, K. S., & Yegnanarayana, B. (2006). Prosody modification using instants of significant excitation. IEEE Transactions on Audio, Speech, and Language Processing, 14(3), 972–980.
Article Google Scholar
Rao, K. S., Laskar, R. H., & Koolagudi, S. G. (2007). Voice transformation by mapping the features at syllable level. In Lecture notes in computer sciences (Vol. 4815, pp. 479–486). Berlin: Springer.
Google Scholar
Stylianou, Y., Cappe, Y., & Moulines, E. (1998). Continuous probabilistic transform for voice conversion. IEEE Transactions on Speech and Audio Processing, 6(2), 131–142.
Article Google Scholar
Suendermann, D., Ney, H., & Hoege, H. (2003). VTLN-based cross-language voice conversion. In Proc. of automatic speech recognition and understanding workshop (pp. 676–681). New York: IEEE.
Chapter Google Scholar
Sundermann, D., Bonafonte, A., Hoge, H., & Ney, H. (2004). Voice conversion using exclusively unaligned training data. In Proc. of ACL/SEPLN 2004, 42nd annu. meeting assoc. for comput. Linguistics/XX congreso de la sociedad espanola para el procesamiento del lenguaje natural, Barcelona, Spain, July, 2004.
Google Scholar
Suendermann, D., Bonafonte, A., Ney, H., & Hoege, H. (2005a). A study on residual prediction techniques for voice conversion. In Proc. of int. conf. on acoustics, speech, and signal process (pp. 13–16). New York: IEEE.
Google Scholar
Suendermann, D., Hoege, H., Bonafonte, A., Ney, H., & Black, A. (2005b). Residual prediction based on unit selection. In Proc. of automatic speech recognition and understanding workshop (pp. 369–374). New York: IEEE.
Chapter Google Scholar
Toda, T., Saruwatari, H., & Shikano, K. (2001). Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. In Proc. of int. conf. on acoustics, speech, and signal process (Vol. 2, pp. 841–844). New York: IEEE.
Google Scholar
Toth, A., & Black, A. W. (2008). Incorporating durational modification in voice transformation. In Proc. of interspeech, Brisbane, Australia (pp. 1088–1091).
Google Scholar
Turk, O., & Arslan, L. M. (2006). Robust processing techniques for voice conversion. Computer Speech & Language Processing, 20(4), 441–467.
Article Google Scholar
Verhelst, W., & Mertens, J. (1996). Voice conversion using partitions of spectral feature space. In Proc. of int. conf. on acoustics, speech, and signal process (Vol. 1, pp. 365–368). New York: IEEE.
Google Scholar
Wang, D., & Shi, L. (2008). Selecting valuable training samples for SVMs via data structure analysis. Neurocomputing, 71(13), 2772–2781.
Article Google Scholar
Ye, H., & Young, S. (2004). High quality voice morphing. In Proc. of int. conf. on acoustics, speech, and signal process (Vol. I, pp. 9–12). New York: IEEE.
Google Scholar
Yegnanarayana, B., Reddy, K. S., & Kishore, S. P. (2001). Source and system features for speaker recognition using AANN models. In Proc. of int. conf. on acoustics, speech, and signal process (Vol. 1, pp. 409–412). New York: IEEE.
Google Scholar
Yegnarayana, B., & Veldhuis, R. N. J. (1998). Extraction of vocal-tract system characteristics from speech signals. IEEE Transactions on Speech and Audio Processing, 6(4), 313–327.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronics & Communication Engineering, National Institute of Technology, Silchar, Assam, India
Rabul Hussain Laskar, Kalyan Banerjee & Fazal Ahmed Talukdar
School of Information Technology, IIT Kharagpur, Kharagpur, 721302, West Bengal, India
K. Sreenivasa Rao

Authors

Rabul Hussain Laskar
View author publications
You can also search for this author in PubMed Google Scholar
Kalyan Banerjee
View author publications
You can also search for this author in PubMed Google Scholar
Fazal Ahmed Talukdar
View author publications
You can also search for this author in PubMed Google Scholar
K. Sreenivasa Rao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rabul Hussain Laskar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hussain Laskar, R., Banerjee, K., Talukdar, F.A. et al. A pitch synchronous approach to design voice conversion system using source-filter correlation. Int J Speech Technol 15, 419–431 (2012). https://doi.org/10.1007/s10772-012-9164-2

Download citation

Received: 09 March 2012
Accepted: 12 June 2012
Published: 26 June 2012
Issue Date: September 2012
DOI: https://doi.org/10.1007/s10772-012-9164-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A pitch synchronous approach to design voice conversion system using source-filter correlation

Abstract

Access this article

Similar content being viewed by others

Optimal Near-End Speech Intelligibility Improvement Using CLPSO-Based Voice Transformation in Realistic Noisy Environments

Voice Conversion for TTS Systems with Tuning on the Target Speaker Based on GMM

A novel voice conversion approach using cascaded powerful cepstrum predictors with excitation and phase extracted from the target training space encoded as a KD-tree

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A pitch synchronous approach to design voice conversion system using source-filter correlation

Abstract

Access this article

Similar content being viewed by others

Optimal Near-End Speech Intelligibility Improvement Using CLPSO-Based Voice Transformation in Realistic Noisy Environments

Voice Conversion for TTS Systems with Tuning on the Target Speaker Based on GMM

A novel voice conversion approach using cascaded powerful cepstrum predictors with excitation and phase extracted from the target training space encoded as a KD-tree

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation