Non-parallel Voice Conversion Based on Perceptual Star Generative Adversarial Network

Li, Yanping; Qiu, Xiangtian; Cao, Pan; Zhang, Yan; Bao, Bingkun

doi:10.1007/s00034-022-01998-5

Non-parallel Voice Conversion Based on Perceptual Star Generative Adversarial Network

Published: 31 March 2022

Volume 41, pages 4632–4648, (2022)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Yanping Li ORCID: orcid.org/0000-0002-6208-556X¹,
Xiangtian Qiu¹,
Pan Cao¹,
Yan Zhang² &
…
Bingkun Bao¹

352 Accesses
3 Citations
Explore all metrics

Abstract

This paper proposes a novel non-parallel many-to-many voice conversion method based on perceptual Star Generative Adversarial Network. First, we adopt a perceptual loss function to optimize the generator, contributing to learning high-level spectral features. Then, Switchable Normalization is applied to replace Batch Normalization for the model training to learn corresponding operations in different normalization layers. Furthermore, we choose Residual Network to establish the mapping of different layers between the encoder and decoder of the generator, aiming to retain more semantic information and reduce the difficulty of training. The objective and subjective evaluations demonstrate that the proposed approach outperforms the competitive baseline system consistently in terms of speech quality, naturalness, and speaker similarity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

NVCGAN: Leveraging Generative Adversarial Networks for Robust Voice Conversion

Voice Conversion with Denoising Diffusion Probabilistic GAN Models

Arbitrary Voice Conversion via Adversarial Learning and Cycle Consistency Loss

References

M. Abe, S. Nakamura, K. Shikano, H. Kuwabara, Voice conversion through vector quantization, in ICASSP-88, International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 655–658 (1988)
J.L. Ba, J.R. Kiros, G.E. Hinton, Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Y. Cao, Z. Liu, M. Chen, J. Ma, S. Wang, J. Xiao, Nonparallel emotional speech conversion using Vae-Gan, in INTERSPEECH (2020)
L.-W. Chen, H. Yi Lee, Y. Tsao, Generative adversarial networks for unpaired voice transformation on impaired speech, in INTERSPEECH (2019)
Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, J. Choo, Stargan: unified generative adversarial networks for multi-domain image-to-image translation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797 (2018)
F. Fang, J. Yamagishi, I. Echizen, J. Lorenzo-Trueba, High-quality nonparallel voice conversion based on cycle-consistent adversarial network, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018), pp. 5279–5283
R. Ferro, N. Obin, A. Roebel, Cyclegan voice conversion of spectral envelopes using adversarial weights, in 2020 28th European Signal Processing Conference (EUSIPCO) (2021), pp. 406–410
L. Gatys, A.S. Ecker, M. Bethge, Texture synthesis using convolutional neural networks, in Advances in Neural Information Processing Systems (2015), pp. 262–270
L.A. Gatys, A.S. Ecker, M. Bethge, A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576 (2015)
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778
E. Helander, T. Virtanen, J. Nurminen, M. Gabbouj, Voice conversion using partial least squares regression. IEEE Trans. Audio Speech Lang. Process. 18(5), 912–921 (2010)
Article Google Scholar
C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, H.-M. Wang, Voice conversion from non-parallel corpora using variational auto-encoder, in 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) (IEEE, 2016), pp. 1–6
C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, H.-M. Wang, Voice conversion from unaligned corpora using variational autoencoding Wasserstein generative adversarial networks. arXiv preprint arXiv:1704.00849 (2017)
W.-C. Huang, H.-T. Hwang, Y.-H. Peng, Y. Tsao, H.-M. Wang, Voice conversion based on cross-domain features using variational auto encoders, In 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP) (IEEE, 2018), pp. 51–55
S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
J. Johnson, A. Alahi, L. Fei-Fei, Perceptual losses for real-time style transfer and super-resolution, in European Conference on Computer Vision (Springer, 2016), pp. 694–711
H. Kameoka, T. Kaneko, K. Tanaka, N. Hojo, Stargan-vc: non-parallel many-to-many voice conversion using star generative adversarial networks, in 2018 IEEE Spoken Language Technology Workshop (SLT) (IEEE, 2018), pp. 266–273
T. Kaneko, H. Kameoka, Parallel-data-free voice conversion using cycle-consistent adversarial networks. arXiv preprint arXiv:1711.11293 (2017)
T. Kaneko, H. Kameoka, K. Hiramatsu, K. Kashino, Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks, in INTERSPEECH 2017 (2017), pp. 1283–1287
T. Kaneko, H. Kameoka, K. Tanaka, N. Hojo, Cyclegan-vc2: improved cyclegan-based non-parallel voice conversion, in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 6820–6824
T. Kaneko, H. Kameoka, K. Tanaka, N. Hojo, Stargan-vc2: rethinking conditional methods for Stargan-based voice conversion. arXiv preprint arXiv:1907.12279 (2019)
D.P. Kingma, J. Ba, Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
J. Li, E. Chen, Z. Ding, L. Zhu, K. Lu, Z. Huang, Cycle-consistent conditional adversarial transfer networks, in Proceedings of the 27th ACM International Conference on Multimedia (2019), pp. 747–755
Y. Li, Z. He, Y. Zhang, Z. Yang, High-quality many-to-many voice conversion using transitive star generative adversarial networks with adaptive instance normalization. J. Circuits Syst. Comput. 30, 2150188:1-2150188:19 (2021)
Google Scholar
Y. Li, D. Xu, Y. Zhang, Y. Wang, B. Chen, Non-parallel many-to-many voice conversion with PSR-STARGAN, in INTERSPEECH (2020)
K. Liu, J. Zhang, Y. Yan, High quality voice conversion through phoneme-based linear mapping functions with straight for mandarin, in Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007), vol. 4 (IEEE, 2007), pp. 410–414
J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, Z. Ling, The voice conversion challenge 2018: promoting development of parallel and nonparallel methods. arXiv preprint arXiv:1804.04262 (2018)
J. Lu, K. Zhou, B. Sisman, H. Li, Vaw-gan for singing voice conversion with non-parallel training data, in 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (2020), pp. 514–519
P. Luo, R. Zhang, J. Ren, Z. Peng, J. Li, Switchable normalization for learning-to-normalize deep representation. IEEE Trans. Pattern Anal. Mach. Intell. (2019)
H. Ming, D. Huang, L. Xie, J. Wu, M. Dong, H. Li, Deep bidirectional LSTM modeling of timbre and prosody for emotional voice conversion, in Interspeech (2016)
H. Miyoshi, Y. Saito, S. Takamichi, H. Saruwatari, Voice conversion using sequence-to-sequence learning of context posterior probabilities. arXiv preprint arXiv:1704.02360 (2017)
M. Morise, F. Yokomori, K. Ozawa, World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 99(7), 1877–1884 (2016)
Article Google Scholar
K. Nakamura, T. Toda, H. Saruwatari, K. Shikano, Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech. Speech Commun. 54, 134–146 (2012)
Article Google Scholar
J. Parekh, P. Rao, Y.-H. Yang, Speech-to-singing conversion in an encoder-decoder framework, in ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020), pp. 261–265
H. Ren, M. El-Khamy, J. Lee, DN-RESNET: efficient deep residual network for image denoising, in Asian Conference on Computer Vision (Springer, 2018), pp. 215–230
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Y. Saito, Y. Ijima, K. Nishida, S. Takamichi, Non-parallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018), pp. 5274–5278
K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Y. Stylianou, O. Cappé, E. Moulines, Continuous probabilistic transform for voice conversion. IEEE Trans. Speech Audio Process. 6(2), 131–142 (1998)
Article Google Scholar
L. Sun, S. Kang, K. Li, H. Meng, Voice conversion using deep bidirectional long short-term memory based recurrent neural networks, in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2015), pp. 4869–4873
L. Sun, K. Li, H. Wang, S. Kang, H. Meng, Phonetic posteriorgrams for many-to-one voice conversion without parallel data training, in 2016 IEEE International Conference on Multimedia and Expo (ICME) (IEEE, 2016), pp. 1–6
K. Tanaka, H. Kameoka, T. Kaneko, N. Hojo, Atts2s-vc: sequence-to-sequence voice conversion with attention and context preservation mechanisms, In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 6805–6809
D. Ulyanov, A. Vedaldi, V. Lempitsky, Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022 (2016)
K. Vijayan, H. Li, T. Toda, Speech-to-singing voice conversion: the challenges and strategies for improving vocal conversion processes. IEEE Signal Process. Mag. 36, 95–102 (2019)
Article Google Scholar
R. Wang, Y. Ding, L. Li, C. Fan, One-shot voice conversion using STAR-GAN, in ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020), pp. 7729–7733
D. Wu, Y.-H. Yang, Speech-to-singing conversion based on boundary equilibrium GAN, in INTERSPEECH (2020)
Y. Wu, K. He, Group normalization, in Proceedings of the European Conference on Computer Vision (ECCV) (2018), pages 3–19
M. Zhang, B. Sisman, S.S. Rallabandi, H. Li, L. Zhao, Error reduction network for DBLSTM-based voice conversion, in 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (IEEE, 2018), pp. 823–828
S. Zhao, T. H. Nguyen, H. Wang, B. Ma, Fast learning for non-parallel many-to-many voice conversion with residual star generative adversarial networks, in Interspeech (2019), pp. 689–693
T. Zhou, P. Krahenbuhl, M. Aubry, Q. Huang, A.A. Efros, Learning dense correspondence via 3D-guided cycle consistency, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 117–126 (2016)
J.-Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, in Proceedings of the IEEE International Conference on Computer Vision (2017), pp. 2223–2232

Download references

Acknowledgements

This work is supported by the National Nature Science Foundation of China under Grant Nos. 61401227, 61872199, 61872424, Special Project in Jinling Institute of Technology for Building Innovative Team on Intelligent Human Computer Interaction (218/010119200113).

Author information

Authors and Affiliations

College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing, People’s Republic of China
Yanping Li, Xiangtian Qiu, Pan Cao & Bingkun Bao
School of Software Engineering, Jinling Institute of Technology, Nanjing, People’s Republic of China
Yan Zhang

Authors

Yanping Li
View author publications
You can also search for this author inPubMed Google Scholar
Xiangtian Qiu
View author publications
You can also search for this author inPubMed Google Scholar
Pan Cao
View author publications
You can also search for this author inPubMed Google Scholar
Yan Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Bingkun Bao
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Yanping Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Y., Qiu, X., Cao, P. et al. Non-parallel Voice Conversion Based on Perceptual Star Generative Adversarial Network. Circuits Syst Signal Process 41, 4632–4648 (2022). https://doi.org/10.1007/s00034-022-01998-5

Download citation

Received: 09 October 2020
Revised: 18 February 2022
Accepted: 18 February 2022
Published: 31 March 2022
Issue Date: August 2022
DOI: https://doi.org/10.1007/s00034-022-01998-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Non-parallel Voice Conversion Based on Perceptual Star Generative Adversarial Network

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

NVCGAN: Leveraging Generative Adversarial Networks for Robust Voice Conversion

Voice Conversion with Denoising Diffusion Probabilistic GAN Models

Arbitrary Voice Conversion via Adversarial Learning and Cycle Consistency Loss

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now