Skip to main content

Voice Conversion from Arbitrary Speakers Based on Deep Neural Networks with Adversarial Learning

  • Conference paper
  • First Online:
Advances in Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP 2017)

Abstract

In this study, we propose a voice conversion technique from arbitrary speakers based on deep neural networks using adversarial learning, which is realized by introducing adversarial learning to the conventional voice conversion. Adversarial learning is expected to enable us more natural voice conversion by using a discriminative model which classifies input speech to natural speech or converted speech in addition to a generative model. Experiments showed that proposed method was effective to enhance global variance (GV) of mel-cepstrum but naturalness of converted speech was a little lower than speech using the conventional variance compensation technique.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Desai, S., Raghavendra, E.V., Yegnanarayana, B., Black, A.W., Prahallad, K.: Voice conversion using artificial neural networks. In: Proceedings of the ICASSP, pp. 3893–3896 (2009)

    Google Scholar 

  2. Furui, S.: Speaker-independent isolated word recognition using dynamic features of speech spectrum. IEEE Trans. Acoust. Speech Sig. Process. 34(1), 52–59 (1986)

    Article  Google Scholar 

  3. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)

    Google Scholar 

  4. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  5. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint (2015). arXiv:1502.03167

  6. Kain, A., Macon, M.: Spectral voice conversion for text-to-speech synthesis. In: Proceedings of the ICASSP, pp. 285–288 (1998)

    Google Scholar 

  7. Koike, H., Nose, T., Shinozaki, T., Ito, A.: Improvement of quality of voice conversion based on spectral differential filter using straight-based mel-cepstral coefficients. J. Acoust. Soc. Am. 140(4), 2963–2963 (2016)

    Article  Google Scholar 

  8. Ling, Z.H., Wu, Y.J., Wang, Y.P., Qin, L., Wang, R.H.: USTC system for blizzard challenge 2006 an improved HMM-based speech synthesis method. In: Blizzard Challenge Workshop (2006)

    Google Scholar 

  9. Morise, M., Yokomori, F., Ozawa, K.: World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 99(7), 1877–1884 (2016)

    Article  Google Scholar 

  10. Nose, T., Ota, Y., Kobayashi, T.: HMM-based voice conversion using quantized F0 context. IEICE Trans. Inf. Syst. E93–D(9), 2483–2490 (2010)

    Article  Google Scholar 

  11. Nose, T.: Efficient implementation of global variance compensation for parametric speech synthesis. IEEE/ACM Trans. Audio Speech Lang. Process. 24(10), 1694–1704 (2016)

    Article  Google Scholar 

  12. Pilkington, N.C., Zen, H., Gales, M.J., et al.: Gaussian process experts for voice conversion. In: Proceedings of the INTERSPEECH, pp. 2772–2775 (2011)

    Google Scholar 

  13. Saito, Y., Takamichi, S., Saruwatari, H.: Training algorithm to deceive anti-spoofing verification for DNN-based speech synthesis. In: Proceedings of the ICASSP

    Google Scholar 

  14. Stylianou, Y.: Voice transformation: a survey. In: Proceedings of the ICASSP, pp. 3585–3588 (2009)

    Google Scholar 

  15. Tomoki, T., Tokuda, K.: A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Trans. Inf. Syst. 90(5), 816–824 (2007)

    Google Scholar 

Download references

Acknowledgment

Part of this work was supported by JSPS KAKENHI Grant Number JP26280055 and JP15H02720.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sou Miyamoto .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Cite this paper

Miyamoto, S. et al. (2018). Voice Conversion from Arbitrary Speakers Based on Deep Neural Networks with Adversarial Learning. In: Pan, JS., Tsai, PW., Watada, J., Jain, L. (eds) Advances in Intelligent Information Hiding and Multimedia Signal Processing. IIH-MSP 2017. Smart Innovation, Systems and Technologies, vol 82. Springer, Cham. https://doi.org/10.1007/978-3-319-63859-1_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-63859-1_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-63858-4

  • Online ISBN: 978-3-319-63859-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics