Skip to main content
Log in

A novel voice conversion approach using cascaded powerful cepstrum predictors with excitation and phase extracted from the target training space encoded as a KD-tree

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Voice conversion is an important problem in audio signal processing. The goal of voice conversion is to transform the speech signal of a source speaker such that it sounds as if it had been uttered by a target speaker. Our contribution in this paper includes a new methodology for designing the relationship between two sets of spectral envelopes. Our systems perform by: (1) cascading deep neural networks and Gaussian mixture model to construct DNN–GMM and GMM–DNN–GMM models in order to find a global mapping relationship between the cepstral vectors of the two speakers; (2) using a new spectral synthesis process with cascaded cepstrum predictors and excitation and phase extracted from the target training space encoded as a KD-tree. Experimental results of the proposed methods exhibit a great improvement of the intelligibility, the quality and naturalness of the converted speech signals when compared with stimuli obtained by baseline conversion methods. The extraction of excitation and phase from the target training space, permits the preservation of target speaker’s identity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  • Abe, M., Nakamura, S., Shikano, K., & Kuwabara, H. (1990). Voice conversion through vector quantization. Journal of the Acoustical Society of Japan (E), 11(2), 71–76.

    Article  Google Scholar 

  • Arslan, L. M. (1999). Speaker transformation algorithm using segmental codebooks (stasc) 1. Speech Communication, 28(3), 211–226.

    Article  Google Scholar 

  • Arya, S. (1996). Nearest neighbor searching and applications. PhD thesis, University of Maryland, College Park.

  • Azarov, E., Petrovsky, A., & Zubrycki, P. (2010). Multi voice text to speech synthesis based on the instantaneous parametric voice conversion. In Signal processing algorithms, architectures, arrangements, and applications SPA 2010 (pp. 78–82). IEEE.

  • Beauregard, G. T., Zhu, X., & Wyse, L. (2005). An efficient algorithm for real-time spectrogram inversion. In Proceedings of the 8th international conference on digital audio effects (pp. 116–118).

  • Ben Othmane, I., Di Martino, J., & Ouni, K. (2018a). Enhancement of esophageal speech obtained by a voice conversion technique using time dilated fourier cepstra. International Journal of Speech Technology, 22, 1–12.

    Google Scholar 

  • Ben Othmane, I., Di Martino, J., & Ouni, K. (2018b). Improving the computational performance of standard gmm-based voice conversion systems used in real-time applications. In 2018 International conference on electronics, control, optimization and computer science (ICECOCS) (pp. 1–5). IEEE.

  • Charlier, M., Ohtani, Y., Toda, T., Moinet, A., & Dutoit, T. (2009). Cross-language voice conversion based on eigenvoices. In 10th Annual conference of the international speech communication association

  • Chen, L.-H., Ling, Z.-H., Liu, L.-J., & Dai, L.-R. (2014). Voice conversion using deep neural networks with layer-wise generative training. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22(12), 1859–1872.

    Article  Google Scholar 

  • Chen, L.-H., Ling, Z.-H., Song, Y., & Dai, L.-R. (2013). Joint spectral distribution modeling using restricted Boltzmann machines for voice conversion. Interspeech, 87, 3052–3056.

    Google Scholar 

  • Chen, L.-H., Yang, C.-Y., Ling, Z.-H., Jiang, Y., Dai, L.-R., Hu, Y., & Wang, R.-H. (2011). The USTC system for blizzard challenge 2011. In Blizzard challenge workshop.

  • Desai, S., Black, A. W., Yegnanarayana, B., & Prahallad, K. (2010). Spectral mapping using artificial neural networks for voice conversion. IEEE Transactions on Audio, Speech, and Language Processing, 18(5), 954–964.

    Article  Google Scholar 

  • Desai, S., Raghavendra, E. V., Yegnanarayana, B., Black, A. W., & Prahallad, K. (2009). Voice conversion using artificial neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2009. ICASSP 2009 (pp. 3893–3896). IEEE.

  • Deza, M. M., & Deza, E. (2009). Encyclopedia of distances. In Encyclopedia of distances (pp. 1–583). Springer, Berlin

  • Doi, H., Toda, T., Nakamura, K., Saruwatari, H., & Shikano, K. (2014). Alaryngeal speech enhancement based on one-to-many eigenvoice conversion. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(1), 172–183.

    Article  Google Scholar 

  • Gibiansky, A., Arik, S., Diamos, G., Miller, J., Peng, K., Ping, W., et al. (2017). Deep voice 2: Multi-speaker neural text-to-speech. Advances in Neural Information Processing Systems, 2962–2970.

  • Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 236–243.

    Article  Google Scholar 

  • Helander, E., Silén, H., Virtanen, T., & Gabbouj, M. (2012). Voice conversion using dynamic kernel partial least squares regression. IEEE Transactions on Audio, Speech, and Language Processing, 20(3), 806–817.

    Article  Google Scholar 

  • Iwahashi, N., & Sagisaka, Y. (1995). Speech spectrum conversion based on speaker interpolation and multi-functional representation with weighting by radial basis function networks. Speech Communication, 16(2), 139–151.

    Article  Google Scholar 

  • Kain, A., & Macon, M. W. (1998). Spectral voice conversion for text-to-speech synthesis. In Proceedings of the 1998 IEEE international conference on acoustics, speech and signal processing, 1998. (Vol. 1, pp. 285–288). IEEE.

  • Kain, A. B. (2001). High resolution voice transformation.

  • Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., & Wu, A. Y. (2002). An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on Pattern Analysis & Machine Intelligence, 7, 881–892.

    Article  MATH  Google Scholar 

  • Kawahara, H., Masuda-Katsuse, I., & De Cheveigne, A. (1999). Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Communication, 27(3–4), 187–207.

    Article  Google Scholar 

  • Kawanami, H., Iwami, Y., Toda, T., Saruwatari, H., & Shikano, K. (2003). GMM-based voice conversion applied to emotional speech synthesis. In Eighth European Conference on Speech Communication and Technology.

  • Kobayashi, K., Toda, T., & Nakamura, S. (2016). F0 transformation techniques for statistical voice conversion with direct waveform modification with spectral differential. In 2016 IEEE Spoken Language Technology Workshop (SLT) (pp. 693–700). IEEE.

  • Kominek, J., & Black, A. W. (2004). The CMU arctic speech databases. In Fifth ISCA workshop on speech synthesis.

  • Ling, Z.-H., Kang, S.-Y., Zen, H., Senior, A., Schuster, M., Qian, X.-J., et al. (2015). Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends. IEEE Signal Processing Magazine, 32(3), 35–52.

    Article  Google Scholar 

  • Liu, L.-J., Chen, L.-H., Ling, Z.-H., & Dai, L.-R. (2015). Spectral conversion using deep neural networks trained with multi-source speakers. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4849–4853). IEEE.

  • Liu, L.-J., Ling, Z.-H., Jiang, Y., Zhou, M., & Dai, L.-R. (2018). Wavenet vocoder with limited training data for voice conversion. Interspeech, 1983–1987.

  • Lorenzo-Trueba, J., Yamagishi, J., Toda, T., Saito, D., Villavicencio, F., Kinnunen, T., & Ling, Z. (2018). The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods. arXiv:1804.04262.

  • Mashimo, M., Toda, T., Kawanami, H., Shikano, K., & Campbell, N. (2002). Cross-language voice conversion evaluation using bilingual databases.

  • Mizuno, H., & Abe, M. (1995). Voice conversion algorithm based on piecewise linear conversion rules of formant frequency and spectrum tilt. Speech Communication, 16(2), 153–164.

    Article  Google Scholar 

  • Mouchtaris, A., Van der Spiegel, J., & Mueller, P. (2004). A spectral conversion approach to the iterative wiener filter for speech enhancement. In 2004 IEEE international conference on multimedia and expo (ICME)(IEEE Cat. No. 04TH8763) (Vol. 3, pp. 1971–1974). IEEE.

  • Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10) (pp. 807–814).

  • Nakamura, K., Toda, T., Saruwatari, H., & Shikano, K. (2012). Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech. Speech Communication, 54(1), 134–146.

    Article  Google Scholar 

  • Nakashika, T., Takashima, R., Takiguchi, T., & Ariki, Y. (2013). Voice conversion in high-order eigen space using deep belief nets. Interspeech, 369–372.

  • Nakashika, T., Takiguchi, T., & Ariki, Y. (2014). High-order sequence modeling using speaker-dependent recurrent temporal restricted Boltzmann machines for voice conversion. In Fifteenth annual conference of the international speech communication association.

  • Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv:1609.03499.

  • Oppenheim, A. V. (1969). Speech analysis-synthesis system based on homomorphic filtering. The Journal of the Acoustical Society of America, 45(2), 458–465.

    Article  Google Scholar 

  • Orphanidou, C., Moroz, I. M., & Roberts, S. J. (2007). Multiscale voice morphing using radial basis function analysis. In Algorithms for Approximation (pp. 61–69). Springer, Berlin.

  • Park, K.-Y., & Kim, H. S. (2000). Narrowband to wideband conversion of speech using GMM based transformation. In 2000 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 00CH37100) (Vol. 3, pp. 1843–1846). IEEE.

  • Ramani, B., Jeeva, M. A., Vijayalakshmi, P., & Nagarajan, T. (2014). Cross-lingual voice conversion-based polyglot speech synthesizer for Indian languages. In Fifteenth annual conference of the international speech communication association.

  • Rumelhart, D. E., Hinton, G. E., Williams, R. J., et al. (1988). Learning representations by back-propagating errors. Cognitive Modeling, 5(3), 1.

    MATH  Google Scholar 

  • Saito, Y., Takamichi, S., & Saruwatari, H. (2017). Voice conversion using input-to-output highway networks. IEICE Transactions on Information and Systems, 100(8), 1925–1928.

    Article  Google Scholar 

  • Sakoe, H., & Chiba, S. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1), 43–49.

    Article  MATH  Google Scholar 

  • Sekii, Y., Orihara, R., Kojima, K., Sei, Y., Tahara, Y., & Ohsuga, A. (2017). Fast many-to-one voice conversion using autoencoders. ICAART, 2, 164–174.

    Google Scholar 

  • Seltzer, M. L., Acero, A., & Droppo, J. (2005). Robust bandwidth extension of noise-corrupted narrowband speech. In Ninth European conference on speech communication and technology.

  • Song, P., Jin, Y., Zheng, W., & Zhao, L. (2014). Text-independent voice conversion using speaker model alignment method from non-parallel speech. In Fifteenth annual conference of the international speech communication association.

  • Stylianou, Y. (2001). Applying the harmonic plus noise model in concatenative speech synthesis. IEEE Transactions on Speech and Audio Processing, 9(1), 21–29.

    Article  Google Scholar 

  • Stylianou, Y., Cappé, O., & Moulines, E. (1998). Continuous probabilistic transform for voice conversion. IEEE Transactions on Speech and Audio Processing, 6(2), 131–142.

    Article  Google Scholar 

  • Sun, L., Kang, S., Li, K., & Meng, H. (2015). Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4869–4873). IEEE.

  • Sundermann, D., Ney, H., & Hoge, H. (2003). VTLN-based cross-language voice conversion. In 2003 IEEE workshop on automatic speech recognition and understanding (IEEE Cat. No. 03EX721) (pp. 676–681). IEEE.

  • Tamamori, A., Hayashi, T., Kobayashi, K., Takeda, K., & Toda, T. (2017). Speaker-dependent wavenet vocoder. Interspeech, 1118–1122.

  • Toda, T., Black, A. W., & Tokuda, K. (2007). Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2222–2235.

    Article  Google Scholar 

  • Toda, T., Chen, L.-H., Saito, D., Villavicencio, F., Wester, M., Wu, Z., et al. (2016). The voice conversion challenge 2016. Interspeech, 1632–1636.

  • Turk, O., & Schroder, M. (2010). Evaluation of expressive speech synthesis with voice conversion and copy resynthesis techniques. IEEE Transactions on Audio, Speech, and Language Processing, 18(5), 965–973.

    Article  Google Scholar 

  • Upperman, G. (2004). Linear predictive coding in voice conversion.

  • Valbret, H., Moulines, E., & Tubach, J.-P. (1992). Voice transformation using Psola technique. Speech Communication, 11(2–3), 175–187.

    Article  Google Scholar 

  • Verhelst, W., & Mertens, J. (1996). Voice conversion using partitions of spectral feature space. In 1996 IEEE international conference on acoustics, speech, and signal processing conference proceedings (Vol. 1, pp. 365–368). IEEE.

  • Villavicencio, F., & Bonada, J. (2010). Applying voice conversion to concatenative singing-voice synthesis. In Eleventh annual conference of the international speech communication association.

  • Watanabe, T., Murakami, T., Namba, M., Hoya, T., & Ishida, Y. (2002). Transformation of spectral envelope for voice conversion based on radial basis function networks. In Seventh international conference on spoken language processing.

  • Werghi, A., Di Martino, J., & Jebara, S. B. (2010). On the use of an iterative estimation of continuous probabilistic transforms for voice conversion. In 2010 5th international symposium on I/V communications and mobile network (pp. 1–4). IEEE.

  • Wester, M., Wu, Z., & Yamagishi, J. (2016). Analysis of the voice conversion challenge 2016 evaluation results. Interspeech, 1637–1641.

  • Xu, N., Tang, Y., Bao, J., Jiang, A., Liu, X., & Yang, Z. (2014). Voice conversion based on gaussian processes by coherent and asymmetric training with limited training data. Speech Communication, 58, 124–138.

    Article  Google Scholar 

  • Yu, D., & Deng, L. (2010). Deep learning and its applications to signal and information processing [exploratory DSP]. IEEE Signal Processing Magazine, 28(1), 145–154.

    Article  Google Scholar 

  • Yu, D., & Deng, L. (2016). Automatic Speech Recognition. Berlin: Springer.

    MATH  Google Scholar 

  • Zhu, X., Beauregard, G. T., & Wyse, L. (2006). Real-time iterative spectrum inversion with look-ahead. In 2006 IEEE international conference on multimedia and expo (pp. 229–232). IEEE.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Imen Ben Othmane.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ben Othmane, I., Di Martino, J. & Ouni, K. A novel voice conversion approach using cascaded powerful cepstrum predictors with excitation and phase extracted from the target training space encoded as a KD-tree. Int J Speech Technol 22, 1007–1019 (2019). https://doi.org/10.1007/s10772-019-09643-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-019-09643-4

Keywords

Navigation