Skip to main content
Log in

Non-parallel Voice Conversion Based on Perceptual Star Generative Adversarial Network

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

This paper proposes a novel non-parallel many-to-many voice conversion method based on perceptual Star Generative Adversarial Network. First, we adopt a perceptual loss function to optimize the generator, contributing to learning high-level spectral features. Then, Switchable Normalization is applied to replace Batch Normalization for the model training to learn corresponding operations in different normalization layers. Furthermore, we choose Residual Network to establish the mapping of different layers between the encoder and decoder of the generator, aiming to retain more semantic information and reduce the difficulty of training. The objective and subjective evaluations demonstrate that the proposed approach outperforms the competitive baseline system consistently in terms of speech quality, naturalness, and speaker similarity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. M. Abe, S. Nakamura, K. Shikano, H. Kuwabara, Voice conversion through vector quantization, in ICASSP-88, International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 655–658 (1988)

  2. J.L. Ba, J.R. Kiros, G.E. Hinton, Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

  3. Y. Cao, Z. Liu, M. Chen, J. Ma, S. Wang, J. Xiao, Nonparallel emotional speech conversion using Vae-Gan, in INTERSPEECH (2020)

  4. L.-W. Chen, H. Yi Lee, Y. Tsao, Generative adversarial networks for unpaired voice transformation on impaired speech, in INTERSPEECH (2019)

  5. Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, J. Choo, Stargan: unified generative adversarial networks for multi-domain image-to-image translation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797 (2018)

  6. F. Fang, J. Yamagishi, I. Echizen, J. Lorenzo-Trueba, High-quality nonparallel voice conversion based on cycle-consistent adversarial network, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018), pp. 5279–5283

  7. R. Ferro, N. Obin, A. Roebel, Cyclegan voice conversion of spectral envelopes using adversarial weights, in 2020 28th European Signal Processing Conference (EUSIPCO) (2021), pp. 406–410

  8. L. Gatys, A.S. Ecker, M. Bethge, Texture synthesis using convolutional neural networks, in Advances in Neural Information Processing Systems (2015), pp. 262–270

  9. L.A. Gatys, A.S. Ecker, M. Bethge, A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576 (2015)

  10. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778

  11. E. Helander, T. Virtanen, J. Nurminen, M. Gabbouj, Voice conversion using partial least squares regression. IEEE Trans. Audio Speech Lang. Process. 18(5), 912–921 (2010)

    Article  Google Scholar 

  12. C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, H.-M. Wang, Voice conversion from non-parallel corpora using variational auto-encoder, in 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) (IEEE, 2016), pp. 1–6

  13. C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, H.-M. Wang, Voice conversion from unaligned corpora using variational autoencoding Wasserstein generative adversarial networks. arXiv preprint arXiv:1704.00849 (2017)

  14. W.-C. Huang, H.-T. Hwang, Y.-H. Peng, Y. Tsao, H.-M. Wang, Voice conversion based on cross-domain features using variational auto encoders, In 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP) (IEEE, 2018), pp. 51–55

  15. S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)

  16. J. Johnson, A. Alahi, L. Fei-Fei, Perceptual losses for real-time style transfer and super-resolution, in European Conference on Computer Vision (Springer, 2016), pp. 694–711

  17. H. Kameoka, T. Kaneko, K. Tanaka, N. Hojo, Stargan-vc: non-parallel many-to-many voice conversion using star generative adversarial networks, in 2018 IEEE Spoken Language Technology Workshop (SLT) (IEEE, 2018), pp. 266–273

  18. T. Kaneko, H. Kameoka, Parallel-data-free voice conversion using cycle-consistent adversarial networks. arXiv preprint arXiv:1711.11293 (2017)

  19. T. Kaneko, H. Kameoka, K. Hiramatsu, K. Kashino, Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks, in INTERSPEECH 2017 (2017), pp. 1283–1287

  20. T. Kaneko, H. Kameoka, K. Tanaka, N. Hojo, Cyclegan-vc2: improved cyclegan-based non-parallel voice conversion, in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 6820–6824

  21. T. Kaneko, H. Kameoka, K. Tanaka, N. Hojo, Stargan-vc2: rethinking conditional methods for Stargan-based voice conversion. arXiv preprint arXiv:1907.12279 (2019)

  22. D.P. Kingma, J. Ba, Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  23. J. Li, E. Chen, Z. Ding, L. Zhu, K. Lu, Z. Huang, Cycle-consistent conditional adversarial transfer networks, in Proceedings of the 27th ACM International Conference on Multimedia (2019), pp. 747–755

  24. Y. Li, Z. He, Y. Zhang, Z. Yang, High-quality many-to-many voice conversion using transitive star generative adversarial networks with adaptive instance normalization. J. Circuits Syst. Comput. 30, 2150188:1-2150188:19 (2021)

    Google Scholar 

  25. Y. Li, D. Xu, Y. Zhang, Y. Wang, B. Chen, Non-parallel many-to-many voice conversion with PSR-STARGAN, in INTERSPEECH (2020)

  26. K. Liu, J. Zhang, Y. Yan, High quality voice conversion through phoneme-based linear mapping functions with straight for mandarin, in Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007), vol. 4 (IEEE, 2007), pp. 410–414

  27. J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, Z. Ling, The voice conversion challenge 2018: promoting development of parallel and nonparallel methods. arXiv preprint arXiv:1804.04262 (2018)

  28. J. Lu, K. Zhou, B. Sisman, H. Li, Vaw-gan for singing voice conversion with non-parallel training data, in 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (2020), pp. 514–519

  29. P. Luo, R. Zhang, J. Ren, Z. Peng, J. Li, Switchable normalization for learning-to-normalize deep representation. IEEE Trans. Pattern Anal. Mach. Intell. (2019)

  30. H. Ming, D. Huang, L. Xie, J. Wu, M. Dong, H. Li, Deep bidirectional LSTM modeling of timbre and prosody for emotional voice conversion, in Interspeech (2016)

  31. H. Miyoshi, Y. Saito, S. Takamichi, H. Saruwatari, Voice conversion using sequence-to-sequence learning of context posterior probabilities. arXiv preprint arXiv:1704.02360 (2017)

  32. M. Morise, F. Yokomori, K. Ozawa, World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 99(7), 1877–1884 (2016)

    Article  Google Scholar 

  33. K. Nakamura, T. Toda, H. Saruwatari, K. Shikano, Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech. Speech Commun. 54, 134–146 (2012)

    Article  Google Scholar 

  34. J. Parekh, P. Rao, Y.-H. Yang, Speech-to-singing conversion in an encoder-decoder framework, in ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020), pp. 261–265

  35. H. Ren, M. El-Khamy, J. Lee, DN-RESNET: efficient deep residual network for image denoising, in Asian Conference on Computer Vision (Springer, 2018), pp. 215–230

  36. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  37. Y. Saito, Y. Ijima, K. Nishida, S. Takamichi, Non-parallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018), pp. 5274–5278

  38. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  39. Y. Stylianou, O. Cappé, E. Moulines, Continuous probabilistic transform for voice conversion. IEEE Trans. Speech Audio Process. 6(2), 131–142 (1998)

    Article  Google Scholar 

  40. L. Sun, S. Kang, K. Li, H. Meng, Voice conversion using deep bidirectional long short-term memory based recurrent neural networks, in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2015), pp. 4869–4873

  41. L. Sun, K. Li, H. Wang, S. Kang, H. Meng, Phonetic posteriorgrams for many-to-one voice conversion without parallel data training, in 2016 IEEE International Conference on Multimedia and Expo (ICME) (IEEE, 2016), pp. 1–6

  42. K. Tanaka, H. Kameoka, T. Kaneko, N. Hojo, Atts2s-vc: sequence-to-sequence voice conversion with attention and context preservation mechanisms, In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 6805–6809

  43. D. Ulyanov, A. Vedaldi, V. Lempitsky, Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022 (2016)

  44. K. Vijayan, H. Li, T. Toda, Speech-to-singing voice conversion: the challenges and strategies for improving vocal conversion processes. IEEE Signal Process. Mag. 36, 95–102 (2019)

    Article  Google Scholar 

  45. R. Wang, Y. Ding, L. Li, C. Fan, One-shot voice conversion using STAR-GAN, in ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020), pp. 7729–7733

  46. D. Wu, Y.-H. Yang, Speech-to-singing conversion based on boundary equilibrium GAN, in INTERSPEECH (2020)

  47. Y. Wu, K. He, Group normalization, in Proceedings of the European Conference on Computer Vision (ECCV) (2018), pages 3–19

  48. M. Zhang, B. Sisman, S.S. Rallabandi, H. Li, L. Zhao, Error reduction network for DBLSTM-based voice conversion, in 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (IEEE, 2018), pp. 823–828

  49. S. Zhao, T. H. Nguyen, H. Wang, B. Ma, Fast learning for non-parallel many-to-many voice conversion with residual star generative adversarial networks, in Interspeech (2019), pp. 689–693

  50. T. Zhou, P. Krahenbuhl, M. Aubry, Q. Huang, A.A. Efros, Learning dense correspondence via 3D-guided cycle consistency, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 117–126 (2016)

  51. J.-Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, in Proceedings of the IEEE International Conference on Computer Vision (2017), pp. 2223–2232

Download references

Acknowledgements

This work is supported by the National Nature Science Foundation of China under Grant Nos. 61401227, 61872199, 61872424, Special Project in Jinling Institute of Technology for Building Innovative Team on Intelligent Human Computer Interaction (218/010119200113).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanping Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Qiu, X., Cao, P. et al. Non-parallel Voice Conversion Based on Perceptual Star Generative Adversarial Network. Circuits Syst Signal Process 41, 4632–4648 (2022). https://doi.org/10.1007/s00034-022-01998-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-022-01998-5

Keywords

Navigation