Abstract
Voice conversion is a method that allows for the transformation of speaking style while maintaining the integrity of linguistic information. There are many researchers using deep generative models for voice conversion tasks. Generative Adversarial Networks (GANs) can quickly generate high-quality samples, but the generated samples lack diversity. The samples generated by the Denoising Diffusion Probabilistic Models (DDPMs) are better than GANs in terms of mode coverage and sample diversity. But the DDPMs have high computational costs and the inference speed is slower than GANs. In order to make GANs and DDPMs more practical we proposes DiffGAN-VC, a variant of GANs and DDPMS, to achieve non-parallel many-to-many voice conversion (VC). We use large steps to achieve denoising, and also introduce a multimodal conditional GANs to model the denoising diffusion generative adversarial network. According to both objective and subjective evaluation experiments, DiffGAN-VC has been shown to achieve high voice quality on non-parallel data sets. Compared with the CycleGAN-VC method, DiffGAN-VC achieves speaker similarity, naturalness and higher sound quality.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M., Chan, W.: Wavegrad: estimating gradients for waveform generation. In: 9th International Conference on Learning Representations. OpenReview.net (2021)
Deng, C., Yu, C., Lu, H., Weng, C., Yu, D.: Pitchnet: unsupervised singing voice conversion with pitch adversarial network. In: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7749–7753. IEEE (2020)
Ding, S., Gutierrez-Osuna, R.: Group latent embedding for vector quantized variational autoencoder in non-parallel voice conversion. In: Kubin, G., Kacic, Z. (eds.) 20th Annual Conference of the International Speech Communication Association, pp. 724–728. ISCA (2019)
Goodfellow, I.J., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Helander, E., Schwarz, J., Nurminen, J., Silén, H., Gabbouj, M.: On the impact of alignment on voice conversion performance. In: 9th Annual Conference of the International Speech Communication Association, pp. 1453–1456. ISCA (2008)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020 (2020)
Huang, W., et al.: Unsupervised representation disentanglement using cross domain features and adversarial learning in variational autoencoder based voice conversion. IEEE Trans. Emerg. Top. Comput. Intell. 4(4), 468–479 (2020)
Jeong, M., Kim, H., Cheon, S.J., Choi, B.J., Kim, N.S.: Diff-TTS: a denoising diffusion model for text-to-speech. In: Hermansky, H., Cernocký, H., Burget, L., Lamel, L., Scharenborg, O., Motlícek, P. (eds.) 22nd Annual Conference of the International Speech Communication Association, pp. 3605–3609. ISCA (2021)
Kameoka, H., Kaneko, T., Tanaka, K., Hojo, N.: StarGAN-VC: non-parallel many-to-many voice conversion using star generative adversarial networks. In: 2018 IEEE Spoken Language Technology Workshop, pp. 266–273. IEEE (2018)
Kameoka, H., Kaneko, T., Tanaka, K., Hojo, N.: ACVAE-VC: non-parallel voice conversion with auxiliary classifier variational autoencoder. IEEE/ACM Trans. Audio Speech Lang. Process. 27(9), 1432–1443 (2019)
Kaneko, T., Kameoka, H.: CycleGAN-VC: non-parallel voice conversion using cycle-consistent adversarial networks. In: 26th European Signal Processing Conference, pp. 2100–2104. IEEE (2018)
Kaneko, T., Kameoka, H.: CycleGAN-VC3: examining and improving CycleGAN-VCs for mel-spectrogram conversion. In: Meng, H., Xu, B., Zheng, T.F. (eds.) 21st Annual Conference of the International Speech Communication Association, pp. 2017–2021. ISCA (2020)
Kaneko, T., Kameoka, H., Hiramatsu, K., Kashino, K.: Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks. In: Lacerda, F. (ed.) 18th Annual Conference of the International Speech Communication Association, pp. 1283–1287. ISCA (2017)
Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: CycleGAN-VC2: improved CycleGAN-based non-parallel voice conversion. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6820–6824. IEEE (2019)
Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: StarGAN-VC2: rethinking conditional methods for StarGAN-based voice conversion. In: Kubin, G., Kacic, Z. (eds.) 20th Annual Conference of the International Speech Communication Association, pp. 679–683. ISCA (2019)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: Bengio, Y., LeCun, Y. (eds.) 2nd International Conference on Learning Representations (2014)
Kong, Z., Ping, W., Huang, J., Zhao, K., Catanzaro, B.: Diffwave: a versatile diffusion model for audio synthesis. In: 9th International Conference on Learning Representations (2021)
Liu, J., Li, C., Ren, Y., Chen, F., Zhao, Z.: Diffsinger: singing voice synthesis via shallow diffusion mechanism. In: Thirty-Sixth AAAI Conference on Artificial Intelligence, pp. 11020–11028. AAAI Press (2022)
Lu, Y., Wang, Z., Watanabe, S., Richard, A., Yu, C., Tsao, Y.: Conditional diffusion probabilistic model for speech enhancement. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7402–7406. IEEE (2022)
Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Gool, L.V.: Repaint: inpainting using denoising diffusion probabilistic models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11451–11461. IEEE (2022)
Mohammadi, S.H., Kain, A.: Voice conversion using deep neural networks with speaker-independent pre-training. In: 2014 IEEE Spoken Language Technology Workshop, pp. 19–23. IEEE (2014)
Nakamura, K., Toda, T., Saruwatari, H., Shikano, K.: Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech. Speech Commun. 54(1), 134–146 (2012)
Qian, K., Jin, Z., Hasegawa-Johnson, M., Mysore, G.J.: F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder. In: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6284–6288. IEEE (2020)
Qian, K., Zhang, Y., Chang, S., Yang, X., Hasegawa-Johnson, M.: Autovc: zero-shot voice style transfer with only autoencoder loss. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 5210–5219. PMLR (2019)
Saharia, C., et al.: Palette: image-to-image diffusion models. In: Nandigjav, M., Mitra, N.J., Hertzmann, A. (eds.) SIGGRAPH 2022: Special Interest Group on Computer Graphics and Interactive Techniques Conference, pp. 15:1–15:10. ACM (2022)
Si, S., Wang, J., Zhang, X., Qu, X., Cheng, N., Xiao, J.: Boosting StarGANs for voice conversion with contrastive discriminator, pp. 355–366 (2023)
Sun, L., Kang, S., Li, K., Meng, H.M.: Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4869–4873. IEEE (2015)
Kaneko, T., Kameoka, H.: Maskcyclegan-VC: learning non-parallel voice conversion with filling in frames. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5919–5923. IEEE (2021)
Tang, H., Zhang, X., Wang, J., Cheng, N., Xiao, J.: Emomix: emotion mixing via diffusion models for emotional speech synthesis (2023)
Tang, H., Zhang, X., Wang, J., Cheng, N., Xiao, J.: QI-TTS: questioning intonation control for emotional speech synthesis. In: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1–5 (2023)
Tang, H., Zhang, X., Wang, J., Cheng, N., Xiao, J.: Learning speech representations with flexible hidden feature dimensions. In: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1–5 (2023)
Tang, H., Zhang, X., Wang, J., Cheng, N., Xiao, J.: VQ-CL: learning disentangled speech representations with contrastive learning and vector quantization. In: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1–5 (2023)
Tobing, P.L., Wu, Y., Hayashi, T., Kobayashi, K., Toda, T.: Non-parallel voice conversion with cyclic variational autoencoder. In: Kubin, G., Kacic, Z. (eds.) 20th Annual Conference of the International Speech Communication Association, pp. 674–678. ISCA (2019)
Toda, T., Black, A.W., Tokuda, K.: Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans. Audio Speech Lang. Process. 15(8), 2222–2235 (2007)
Toda, T., Nakagiri, M., Shikano, K.: Statistical voice conversion techniques for body-conducted unvoiced speech enhancement. IEEE Trans. Speech Audio Process. 20(9), 2505–2517 (2012)
Wu, Z., Virtanen, T., Chng, E., Li, H.: Exemplar-based sparse representation with residual compensation for voice conversion. IEEE/ACM Trans. Audio Speech Lang. Process. 22(10), 1506–1521 (2014)
Acknowledgement
This paper is supported by the Key Research and Development Program of Guangdong Province under grant No. 2021B0101400003.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, X., Wang, J., Cheng, N., Xiao, J. (2023). Voice Conversion with Denoising Diffusion Probabilistic GAN Models. In: Yang, X., et al. Advanced Data Mining and Applications. ADMA 2023. Lecture Notes in Computer Science(), vol 14179. Springer, Cham. https://doi.org/10.1007/978-3-031-46674-8_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-46674-8_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46673-1
Online ISBN: 978-3-031-46674-8
eBook Packages: Computer ScienceComputer Science (R0)