Voice Conversion with Denoising Diffusion Probabilistic GAN Models

Zhang, Xulong; Wang, Jianzong; Cheng, Ning; Xiao, Jing

doi:10.1007/978-3-031-46674-8_11

Xulong Zhang¹⁵,
Jianzong Wang¹⁵,
Ning Cheng¹⁵ &
…
Jing Xiao¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14179))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

902 Accesses
6 Citations

Abstract

Voice conversion is a method that allows for the transformation of speaking style while maintaining the integrity of linguistic information. There are many researchers using deep generative models for voice conversion tasks. Generative Adversarial Networks (GANs) can quickly generate high-quality samples, but the generated samples lack diversity. The samples generated by the Denoising Diffusion Probabilistic Models (DDPMs) are better than GANs in terms of mode coverage and sample diversity. But the DDPMs have high computational costs and the inference speed is slower than GANs. In order to make GANs and DDPMs more practical we proposes DiffGAN-VC, a variant of GANs and DDPMS, to achieve non-parallel many-to-many voice conversion (VC). We use large steps to achieve denoising, and also introduce a multimodal conditional GANs to model the denoising diffusion generative adversarial network. According to both objective and subjective evaluation experiments, DiffGAN-VC has been shown to achieve high voice quality on non-parallel data sets. Compared with the CycleGAN-VC method, DiffGAN-VC achieves speaker similarity, naturalness and higher sound quality.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Robust Framework for High-Quality Voice Conversion with Conditional Generative Adversarial Network

NVCGAN: Leveraging Generative Adversarial Networks for Robust Voice Conversion

Non-parallel Voice Conversion Based on Perceptual Star Generative Adversarial Network

Article 31 March 2022

References

Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M., Chan, W.: Wavegrad: estimating gradients for waveform generation. In: 9th International Conference on Learning Representations. OpenReview.net (2021)
Google Scholar
Deng, C., Yu, C., Lu, H., Weng, C., Yu, D.: Pitchnet: unsupervised singing voice conversion with pitch adversarial network. In: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7749–7753. IEEE (2020)
Google Scholar
Ding, S., Gutierrez-Osuna, R.: Group latent embedding for vector quantized variational autoencoder in non-parallel voice conversion. In: Kubin, G., Kacic, Z. (eds.) 20th Annual Conference of the International Speech Communication Association, pp. 724–728. ISCA (2019)
Google Scholar
Goodfellow, I.J., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Article MathSciNet Google Scholar
Helander, E., Schwarz, J., Nurminen, J., Silén, H., Gabbouj, M.: On the impact of alignment on voice conversion performance. In: 9th Annual Conference of the International Speech Communication Association, pp. 1453–1456. ISCA (2008)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020 (2020)
Google Scholar
Huang, W., et al.: Unsupervised representation disentanglement using cross domain features and adversarial learning in variational autoencoder based voice conversion. IEEE Trans. Emerg. Top. Comput. Intell. 4(4), 468–479 (2020)
Article Google Scholar
Jeong, M., Kim, H., Cheon, S.J., Choi, B.J., Kim, N.S.: Diff-TTS: a denoising diffusion model for text-to-speech. In: Hermansky, H., Cernocký, H., Burget, L., Lamel, L., Scharenborg, O., Motlícek, P. (eds.) 22nd Annual Conference of the International Speech Communication Association, pp. 3605–3609. ISCA (2021)
Google Scholar
Kameoka, H., Kaneko, T., Tanaka, K., Hojo, N.: StarGAN-VC: non-parallel many-to-many voice conversion using star generative adversarial networks. In: 2018 IEEE Spoken Language Technology Workshop, pp. 266–273. IEEE (2018)
Google Scholar
Kameoka, H., Kaneko, T., Tanaka, K., Hojo, N.: ACVAE-VC: non-parallel voice conversion with auxiliary classifier variational autoencoder. IEEE/ACM Trans. Audio Speech Lang. Process. 27(9), 1432–1443 (2019)
Article Google Scholar
Kaneko, T., Kameoka, H.: CycleGAN-VC: non-parallel voice conversion using cycle-consistent adversarial networks. In: 26th European Signal Processing Conference, pp. 2100–2104. IEEE (2018)
Google Scholar
Kaneko, T., Kameoka, H.: CycleGAN-VC3: examining and improving CycleGAN-VCs for mel-spectrogram conversion. In: Meng, H., Xu, B., Zheng, T.F. (eds.) 21st Annual Conference of the International Speech Communication Association, pp. 2017–2021. ISCA (2020)
Google Scholar
Kaneko, T., Kameoka, H., Hiramatsu, K., Kashino, K.: Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks. In: Lacerda, F. (ed.) 18th Annual Conference of the International Speech Communication Association, pp. 1283–1287. ISCA (2017)
Google Scholar
Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: CycleGAN-VC2: improved CycleGAN-based non-parallel voice conversion. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6820–6824. IEEE (2019)
Google Scholar
Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: StarGAN-VC2: rethinking conditional methods for StarGAN-based voice conversion. In: Kubin, G., Kacic, Z. (eds.) 20th Annual Conference of the International Speech Communication Association, pp. 679–683. ISCA (2019)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: Bengio, Y., LeCun, Y. (eds.) 2nd International Conference on Learning Representations (2014)
Google Scholar
Kong, Z., Ping, W., Huang, J., Zhao, K., Catanzaro, B.: Diffwave: a versatile diffusion model for audio synthesis. In: 9th International Conference on Learning Representations (2021)
Google Scholar
Liu, J., Li, C., Ren, Y., Chen, F., Zhao, Z.: Diffsinger: singing voice synthesis via shallow diffusion mechanism. In: Thirty-Sixth AAAI Conference on Artificial Intelligence, pp. 11020–11028. AAAI Press (2022)
Google Scholar
Lu, Y., Wang, Z., Watanabe, S., Richard, A., Yu, C., Tsao, Y.: Conditional diffusion probabilistic model for speech enhancement. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7402–7406. IEEE (2022)
Google Scholar
Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Gool, L.V.: Repaint: inpainting using denoising diffusion probabilistic models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11451–11461. IEEE (2022)
Google Scholar
Mohammadi, S.H., Kain, A.: Voice conversion using deep neural networks with speaker-independent pre-training. In: 2014 IEEE Spoken Language Technology Workshop, pp. 19–23. IEEE (2014)
Google Scholar
Nakamura, K., Toda, T., Saruwatari, H., Shikano, K.: Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech. Speech Commun. 54(1), 134–146 (2012)
Article Google Scholar
Qian, K., Jin, Z., Hasegawa-Johnson, M., Mysore, G.J.: F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder. In: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6284–6288. IEEE (2020)
Google Scholar
Qian, K., Zhang, Y., Chang, S., Yang, X., Hasegawa-Johnson, M.: Autovc: zero-shot voice style transfer with only autoencoder loss. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 5210–5219. PMLR (2019)
Google Scholar
Saharia, C., et al.: Palette: image-to-image diffusion models. In: Nandigjav, M., Mitra, N.J., Hertzmann, A. (eds.) SIGGRAPH 2022: Special Interest Group on Computer Graphics and Interactive Techniques Conference, pp. 15:1–15:10. ACM (2022)
Google Scholar
Si, S., Wang, J., Zhang, X., Qu, X., Cheng, N., Xiao, J.: Boosting StarGANs for voice conversion with contrastive discriminator, pp. 355–366 (2023)
Google Scholar
Sun, L., Kang, S., Li, K., Meng, H.M.: Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4869–4873. IEEE (2015)
Google Scholar
Kaneko, T., Kameoka, H.: Maskcyclegan-VC: learning non-parallel voice conversion with filling in frames. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5919–5923. IEEE (2021)
Google Scholar
Tang, H., Zhang, X., Wang, J., Cheng, N., Xiao, J.: Emomix: emotion mixing via diffusion models for emotional speech synthesis (2023)
Google Scholar
Tang, H., Zhang, X., Wang, J., Cheng, N., Xiao, J.: QI-TTS: questioning intonation control for emotional speech synthesis. In: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1–5 (2023)
Google Scholar
Tang, H., Zhang, X., Wang, J., Cheng, N., Xiao, J.: Learning speech representations with flexible hidden feature dimensions. In: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1–5 (2023)
Google Scholar
Tang, H., Zhang, X., Wang, J., Cheng, N., Xiao, J.: VQ-CL: learning disentangled speech representations with contrastive learning and vector quantization. In: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1–5 (2023)
Google Scholar
Tobing, P.L., Wu, Y., Hayashi, T., Kobayashi, K., Toda, T.: Non-parallel voice conversion with cyclic variational autoencoder. In: Kubin, G., Kacic, Z. (eds.) 20th Annual Conference of the International Speech Communication Association, pp. 674–678. ISCA (2019)
Google Scholar
Toda, T., Black, A.W., Tokuda, K.: Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans. Audio Speech Lang. Process. 15(8), 2222–2235 (2007)
Article Google Scholar
Toda, T., Nakagiri, M., Shikano, K.: Statistical voice conversion techniques for body-conducted unvoiced speech enhancement. IEEE Trans. Speech Audio Process. 20(9), 2505–2517 (2012)
Article Google Scholar
Wu, Z., Virtanen, T., Chng, E., Li, H.: Exemplar-based sparse representation with residual compensation for voice conversion. IEEE/ACM Trans. Audio Speech Lang. Process. 22(10), 1506–1521 (2014)
Article Google Scholar

Download references

Acknowledgement

This paper is supported by the Key Research and Development Program of Guangdong Province under grant No. 2021B0101400003.

Author information

Authors and Affiliations

Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China
Xulong Zhang, Jianzong Wang, Ning Cheng & Jing Xiao

Authors

Xulong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jianzong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ning Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Jing Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianzong Wang .

Editor information

Editors and Affiliations

Northeastern University, Shenyang, China
Xiaochun Yang
The University of Indonesia, Depok, Indonesia
Heru Suhartanto
Beijing Institute of Technology, Beijing, China
Guoren Wang
Northeastern University, Shenyang, China
Bin Wang
University of Technology Sydney, Sydney, NSW, Australia
Jing Jiang
Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
Bing Li
Sun Yat-sen University, Guangzhou, China
Huaijie Zhu
Anhui University, Hefei, China
Ningning Cui

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, X., Wang, J., Cheng, N., Xiao, J. (2023). Voice Conversion with Denoising Diffusion Probabilistic GAN Models. In: Yang, X., et al. Advanced Data Mining and Applications. ADMA 2023. Lecture Notes in Computer Science(), vol 14179. Springer, Cham. https://doi.org/10.1007/978-3-031-46674-8_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-46674-8_11
Published: 05 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46673-1
Online ISBN: 978-3-031-46674-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Voice Conversion with Denoising Diffusion Probabilistic GAN Models