Abstract
This paper proposes a novel non-parallel many-to-many voice conversion method based on perceptual Star Generative Adversarial Network. First, we adopt a perceptual loss function to optimize the generator, contributing to learning high-level spectral features. Then, Switchable Normalization is applied to replace Batch Normalization for the model training to learn corresponding operations in different normalization layers. Furthermore, we choose Residual Network to establish the mapping of different layers between the encoder and decoder of the generator, aiming to retain more semantic information and reduce the difficulty of training. The objective and subjective evaluations demonstrate that the proposed approach outperforms the competitive baseline system consistently in terms of speech quality, naturalness, and speaker similarity.







Similar content being viewed by others
References
M. Abe, S. Nakamura, K. Shikano, H. Kuwabara, Voice conversion through vector quantization, in ICASSP-88, International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 655–658 (1988)
J.L. Ba, J.R. Kiros, G.E. Hinton, Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Y. Cao, Z. Liu, M. Chen, J. Ma, S. Wang, J. Xiao, Nonparallel emotional speech conversion using Vae-Gan, in INTERSPEECH (2020)
L.-W. Chen, H. Yi Lee, Y. Tsao, Generative adversarial networks for unpaired voice transformation on impaired speech, in INTERSPEECH (2019)
Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, J. Choo, Stargan: unified generative adversarial networks for multi-domain image-to-image translation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797 (2018)
F. Fang, J. Yamagishi, I. Echizen, J. Lorenzo-Trueba, High-quality nonparallel voice conversion based on cycle-consistent adversarial network, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018), pp. 5279–5283
R. Ferro, N. Obin, A. Roebel, Cyclegan voice conversion of spectral envelopes using adversarial weights, in 2020 28th European Signal Processing Conference (EUSIPCO) (2021), pp. 406–410
L. Gatys, A.S. Ecker, M. Bethge, Texture synthesis using convolutional neural networks, in Advances in Neural Information Processing Systems (2015), pp. 262–270
L.A. Gatys, A.S. Ecker, M. Bethge, A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576 (2015)
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778
E. Helander, T. Virtanen, J. Nurminen, M. Gabbouj, Voice conversion using partial least squares regression. IEEE Trans. Audio Speech Lang. Process. 18(5), 912–921 (2010)
C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, H.-M. Wang, Voice conversion from non-parallel corpora using variational auto-encoder, in 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) (IEEE, 2016), pp. 1–6
C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, H.-M. Wang, Voice conversion from unaligned corpora using variational autoencoding Wasserstein generative adversarial networks. arXiv preprint arXiv:1704.00849 (2017)
W.-C. Huang, H.-T. Hwang, Y.-H. Peng, Y. Tsao, H.-M. Wang, Voice conversion based on cross-domain features using variational auto encoders, In 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP) (IEEE, 2018), pp. 51–55
S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
J. Johnson, A. Alahi, L. Fei-Fei, Perceptual losses for real-time style transfer and super-resolution, in European Conference on Computer Vision (Springer, 2016), pp. 694–711
H. Kameoka, T. Kaneko, K. Tanaka, N. Hojo, Stargan-vc: non-parallel many-to-many voice conversion using star generative adversarial networks, in 2018 IEEE Spoken Language Technology Workshop (SLT) (IEEE, 2018), pp. 266–273
T. Kaneko, H. Kameoka, Parallel-data-free voice conversion using cycle-consistent adversarial networks. arXiv preprint arXiv:1711.11293 (2017)
T. Kaneko, H. Kameoka, K. Hiramatsu, K. Kashino, Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks, in INTERSPEECH 2017 (2017), pp. 1283–1287
T. Kaneko, H. Kameoka, K. Tanaka, N. Hojo, Cyclegan-vc2: improved cyclegan-based non-parallel voice conversion, in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 6820–6824
T. Kaneko, H. Kameoka, K. Tanaka, N. Hojo, Stargan-vc2: rethinking conditional methods for Stargan-based voice conversion. arXiv preprint arXiv:1907.12279 (2019)
D.P. Kingma, J. Ba, Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
J. Li, E. Chen, Z. Ding, L. Zhu, K. Lu, Z. Huang, Cycle-consistent conditional adversarial transfer networks, in Proceedings of the 27th ACM International Conference on Multimedia (2019), pp. 747–755
Y. Li, Z. He, Y. Zhang, Z. Yang, High-quality many-to-many voice conversion using transitive star generative adversarial networks with adaptive instance normalization. J. Circuits Syst. Comput. 30, 2150188:1-2150188:19 (2021)
Y. Li, D. Xu, Y. Zhang, Y. Wang, B. Chen, Non-parallel many-to-many voice conversion with PSR-STARGAN, in INTERSPEECH (2020)
K. Liu, J. Zhang, Y. Yan, High quality voice conversion through phoneme-based linear mapping functions with straight for mandarin, in Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007), vol. 4 (IEEE, 2007), pp. 410–414
J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, Z. Ling, The voice conversion challenge 2018: promoting development of parallel and nonparallel methods. arXiv preprint arXiv:1804.04262 (2018)
J. Lu, K. Zhou, B. Sisman, H. Li, Vaw-gan for singing voice conversion with non-parallel training data, in 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (2020), pp. 514–519
P. Luo, R. Zhang, J. Ren, Z. Peng, J. Li, Switchable normalization for learning-to-normalize deep representation. IEEE Trans. Pattern Anal. Mach. Intell. (2019)
H. Ming, D. Huang, L. Xie, J. Wu, M. Dong, H. Li, Deep bidirectional LSTM modeling of timbre and prosody for emotional voice conversion, in Interspeech (2016)
H. Miyoshi, Y. Saito, S. Takamichi, H. Saruwatari, Voice conversion using sequence-to-sequence learning of context posterior probabilities. arXiv preprint arXiv:1704.02360 (2017)
M. Morise, F. Yokomori, K. Ozawa, World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 99(7), 1877–1884 (2016)
K. Nakamura, T. Toda, H. Saruwatari, K. Shikano, Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech. Speech Commun. 54, 134–146 (2012)
J. Parekh, P. Rao, Y.-H. Yang, Speech-to-singing conversion in an encoder-decoder framework, in ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020), pp. 261–265
H. Ren, M. El-Khamy, J. Lee, DN-RESNET: efficient deep residual network for image denoising, in Asian Conference on Computer Vision (Springer, 2018), pp. 215–230
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Y. Saito, Y. Ijima, K. Nishida, S. Takamichi, Non-parallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018), pp. 5274–5278
K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Y. Stylianou, O. Cappé, E. Moulines, Continuous probabilistic transform for voice conversion. IEEE Trans. Speech Audio Process. 6(2), 131–142 (1998)
L. Sun, S. Kang, K. Li, H. Meng, Voice conversion using deep bidirectional long short-term memory based recurrent neural networks, in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2015), pp. 4869–4873
L. Sun, K. Li, H. Wang, S. Kang, H. Meng, Phonetic posteriorgrams for many-to-one voice conversion without parallel data training, in 2016 IEEE International Conference on Multimedia and Expo (ICME) (IEEE, 2016), pp. 1–6
K. Tanaka, H. Kameoka, T. Kaneko, N. Hojo, Atts2s-vc: sequence-to-sequence voice conversion with attention and context preservation mechanisms, In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 6805–6809
D. Ulyanov, A. Vedaldi, V. Lempitsky, Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022 (2016)
K. Vijayan, H. Li, T. Toda, Speech-to-singing voice conversion: the challenges and strategies for improving vocal conversion processes. IEEE Signal Process. Mag. 36, 95–102 (2019)
R. Wang, Y. Ding, L. Li, C. Fan, One-shot voice conversion using STAR-GAN, in ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020), pp. 7729–7733
D. Wu, Y.-H. Yang, Speech-to-singing conversion based on boundary equilibrium GAN, in INTERSPEECH (2020)
Y. Wu, K. He, Group normalization, in Proceedings of the European Conference on Computer Vision (ECCV) (2018), pages 3–19
M. Zhang, B. Sisman, S.S. Rallabandi, H. Li, L. Zhao, Error reduction network for DBLSTM-based voice conversion, in 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (IEEE, 2018), pp. 823–828
S. Zhao, T. H. Nguyen, H. Wang, B. Ma, Fast learning for non-parallel many-to-many voice conversion with residual star generative adversarial networks, in Interspeech (2019), pp. 689–693
T. Zhou, P. Krahenbuhl, M. Aubry, Q. Huang, A.A. Efros, Learning dense correspondence via 3D-guided cycle consistency, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 117–126 (2016)
J.-Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, in Proceedings of the IEEE International Conference on Computer Vision (2017), pp. 2223–2232
Acknowledgements
This work is supported by the National Nature Science Foundation of China under Grant Nos. 61401227, 61872199, 61872424, Special Project in Jinling Institute of Technology for Building Innovative Team on Intelligent Human Computer Interaction (218/010119200113).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, Y., Qiu, X., Cao, P. et al. Non-parallel Voice Conversion Based on Perceptual Star Generative Adversarial Network. Circuits Syst Signal Process 41, 4632–4648 (2022). https://doi.org/10.1007/s00034-022-01998-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-022-01998-5