Skip to main content

Voice Conversion with Denoising Diffusion Probabilistic GAN Models

  • Conference paper
  • First Online:
Advanced Data Mining and Applications (ADMA 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14179))

Included in the following conference series:

Abstract

Voice conversion is a method that allows for the transformation of speaking style while maintaining the integrity of linguistic information. There are many researchers using deep generative models for voice conversion tasks. Generative Adversarial Networks (GANs) can quickly generate high-quality samples, but the generated samples lack diversity. The samples generated by the Denoising Diffusion Probabilistic Models (DDPMs) are better than GANs in terms of mode coverage and sample diversity. But the DDPMs have high computational costs and the inference speed is slower than GANs. In order to make GANs and DDPMs more practical we proposes DiffGAN-VC, a variant of GANs and DDPMS, to achieve non-parallel many-to-many voice conversion (VC). We use large steps to achieve denoising, and also introduce a multimodal conditional GANs to model the denoising diffusion generative adversarial network. According to both objective and subjective evaluation experiments, DiffGAN-VC has been shown to achieve high voice quality on non-parallel data sets. Compared with the CycleGAN-VC method, DiffGAN-VC achieves speaker similarity, naturalness and higher sound quality.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M., Chan, W.: Wavegrad: estimating gradients for waveform generation. In: 9th International Conference on Learning Representations. OpenReview.net (2021)

    Google Scholar 

  2. Deng, C., Yu, C., Lu, H., Weng, C., Yu, D.: Pitchnet: unsupervised singing voice conversion with pitch adversarial network. In: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7749–7753. IEEE (2020)

    Google Scholar 

  3. Ding, S., Gutierrez-Osuna, R.: Group latent embedding for vector quantized variational autoencoder in non-parallel voice conversion. In: Kubin, G., Kacic, Z. (eds.) 20th Annual Conference of the International Speech Communication Association, pp. 724–728. ISCA (2019)

    Google Scholar 

  4. Goodfellow, I.J., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)

    Article  MathSciNet  Google Scholar 

  5. Helander, E., Schwarz, J., Nurminen, J., Silén, H., Gabbouj, M.: On the impact of alignment on voice conversion performance. In: 9th Annual Conference of the International Speech Communication Association, pp. 1453–1456. ISCA (2008)

    Google Scholar 

  6. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020 (2020)

    Google Scholar 

  7. Huang, W., et al.: Unsupervised representation disentanglement using cross domain features and adversarial learning in variational autoencoder based voice conversion. IEEE Trans. Emerg. Top. Comput. Intell. 4(4), 468–479 (2020)

    Article  Google Scholar 

  8. Jeong, M., Kim, H., Cheon, S.J., Choi, B.J., Kim, N.S.: Diff-TTS: a denoising diffusion model for text-to-speech. In: Hermansky, H., Cernocký, H., Burget, L., Lamel, L., Scharenborg, O., Motlícek, P. (eds.) 22nd Annual Conference of the International Speech Communication Association, pp. 3605–3609. ISCA (2021)

    Google Scholar 

  9. Kameoka, H., Kaneko, T., Tanaka, K., Hojo, N.: StarGAN-VC: non-parallel many-to-many voice conversion using star generative adversarial networks. In: 2018 IEEE Spoken Language Technology Workshop, pp. 266–273. IEEE (2018)

    Google Scholar 

  10. Kameoka, H., Kaneko, T., Tanaka, K., Hojo, N.: ACVAE-VC: non-parallel voice conversion with auxiliary classifier variational autoencoder. IEEE/ACM Trans. Audio Speech Lang. Process. 27(9), 1432–1443 (2019)

    Article  Google Scholar 

  11. Kaneko, T., Kameoka, H.: CycleGAN-VC: non-parallel voice conversion using cycle-consistent adversarial networks. In: 26th European Signal Processing Conference, pp. 2100–2104. IEEE (2018)

    Google Scholar 

  12. Kaneko, T., Kameoka, H.: CycleGAN-VC3: examining and improving CycleGAN-VCs for mel-spectrogram conversion. In: Meng, H., Xu, B., Zheng, T.F. (eds.) 21st Annual Conference of the International Speech Communication Association, pp. 2017–2021. ISCA (2020)

    Google Scholar 

  13. Kaneko, T., Kameoka, H., Hiramatsu, K., Kashino, K.: Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks. In: Lacerda, F. (ed.) 18th Annual Conference of the International Speech Communication Association, pp. 1283–1287. ISCA (2017)

    Google Scholar 

  14. Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: CycleGAN-VC2: improved CycleGAN-based non-parallel voice conversion. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6820–6824. IEEE (2019)

    Google Scholar 

  15. Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: StarGAN-VC2: rethinking conditional methods for StarGAN-based voice conversion. In: Kubin, G., Kacic, Z. (eds.) 20th Annual Conference of the International Speech Communication Association, pp. 679–683. ISCA (2019)

    Google Scholar 

  16. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: Bengio, Y., LeCun, Y. (eds.) 2nd International Conference on Learning Representations (2014)

    Google Scholar 

  17. Kong, Z., Ping, W., Huang, J., Zhao, K., Catanzaro, B.: Diffwave: a versatile diffusion model for audio synthesis. In: 9th International Conference on Learning Representations (2021)

    Google Scholar 

  18. Liu, J., Li, C., Ren, Y., Chen, F., Zhao, Z.: Diffsinger: singing voice synthesis via shallow diffusion mechanism. In: Thirty-Sixth AAAI Conference on Artificial Intelligence, pp. 11020–11028. AAAI Press (2022)

    Google Scholar 

  19. Lu, Y., Wang, Z., Watanabe, S., Richard, A., Yu, C., Tsao, Y.: Conditional diffusion probabilistic model for speech enhancement. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7402–7406. IEEE (2022)

    Google Scholar 

  20. Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Gool, L.V.: Repaint: inpainting using denoising diffusion probabilistic models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11451–11461. IEEE (2022)

    Google Scholar 

  21. Mohammadi, S.H., Kain, A.: Voice conversion using deep neural networks with speaker-independent pre-training. In: 2014 IEEE Spoken Language Technology Workshop, pp. 19–23. IEEE (2014)

    Google Scholar 

  22. Nakamura, K., Toda, T., Saruwatari, H., Shikano, K.: Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech. Speech Commun. 54(1), 134–146 (2012)

    Article  Google Scholar 

  23. Qian, K., Jin, Z., Hasegawa-Johnson, M., Mysore, G.J.: F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder. In: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6284–6288. IEEE (2020)

    Google Scholar 

  24. Qian, K., Zhang, Y., Chang, S., Yang, X., Hasegawa-Johnson, M.: Autovc: zero-shot voice style transfer with only autoencoder loss. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 5210–5219. PMLR (2019)

    Google Scholar 

  25. Saharia, C., et al.: Palette: image-to-image diffusion models. In: Nandigjav, M., Mitra, N.J., Hertzmann, A. (eds.) SIGGRAPH 2022: Special Interest Group on Computer Graphics and Interactive Techniques Conference, pp. 15:1–15:10. ACM (2022)

    Google Scholar 

  26. Si, S., Wang, J., Zhang, X., Qu, X., Cheng, N., Xiao, J.: Boosting StarGANs for voice conversion with contrastive discriminator, pp. 355–366 (2023)

    Google Scholar 

  27. Sun, L., Kang, S., Li, K., Meng, H.M.: Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4869–4873. IEEE (2015)

    Google Scholar 

  28. Kaneko, T., Kameoka, H.: Maskcyclegan-VC: learning non-parallel voice conversion with filling in frames. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5919–5923. IEEE (2021)

    Google Scholar 

  29. Tang, H., Zhang, X., Wang, J., Cheng, N., Xiao, J.: Emomix: emotion mixing via diffusion models for emotional speech synthesis (2023)

    Google Scholar 

  30. Tang, H., Zhang, X., Wang, J., Cheng, N., Xiao, J.: QI-TTS: questioning intonation control for emotional speech synthesis. In: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1–5 (2023)

    Google Scholar 

  31. Tang, H., Zhang, X., Wang, J., Cheng, N., Xiao, J.: Learning speech representations with flexible hidden feature dimensions. In: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1–5 (2023)

    Google Scholar 

  32. Tang, H., Zhang, X., Wang, J., Cheng, N., Xiao, J.: VQ-CL: learning disentangled speech representations with contrastive learning and vector quantization. In: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1–5 (2023)

    Google Scholar 

  33. Tobing, P.L., Wu, Y., Hayashi, T., Kobayashi, K., Toda, T.: Non-parallel voice conversion with cyclic variational autoencoder. In: Kubin, G., Kacic, Z. (eds.) 20th Annual Conference of the International Speech Communication Association, pp. 674–678. ISCA (2019)

    Google Scholar 

  34. Toda, T., Black, A.W., Tokuda, K.: Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans. Audio Speech Lang. Process. 15(8), 2222–2235 (2007)

    Article  Google Scholar 

  35. Toda, T., Nakagiri, M., Shikano, K.: Statistical voice conversion techniques for body-conducted unvoiced speech enhancement. IEEE Trans. Speech Audio Process. 20(9), 2505–2517 (2012)

    Article  Google Scholar 

  36. Wu, Z., Virtanen, T., Chng, E., Li, H.: Exemplar-based sparse representation with residual compensation for voice conversion. IEEE/ACM Trans. Audio Speech Lang. Process. 22(10), 1506–1521 (2014)

    Article  Google Scholar 

Download references

Acknowledgement

This paper is supported by the Key Research and Development Program of Guangdong Province under grant No. 2021B0101400003.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianzong Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, X., Wang, J., Cheng, N., Xiao, J. (2023). Voice Conversion with Denoising Diffusion Probabilistic GAN Models. In: Yang, X., et al. Advanced Data Mining and Applications. ADMA 2023. Lecture Notes in Computer Science(), vol 14179. Springer, Cham. https://doi.org/10.1007/978-3-031-46674-8_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-46674-8_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-46673-1

  • Online ISBN: 978-3-031-46674-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics