Region Normalized Capsule Network Based Generative Adversarial Network for Non-parallel Voice Conversion

Akhter, Md. Tousin; Banerjee, Padmanabha; Dhar, Sandipan; Ghosh, Subhayu; Jana, Nanda Dulal

doi:10.1007/978-3-031-48309-7_20

Md. Tousin Akhter¹³,
Padmanabha Banerjee¹⁴,
Sandipan Dhar¹³,
Subhayu Ghosh¹³ &
…
Nanda Dulal Jana¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14338))

Included in the following conference series:

International Conference on Speech and Computer

492 Accesses
2 Citations

Abstract

Voice conversion (VC) involves altering the vocal characteristics of a source speaker to resemble those of a target speaker while maintaining the same linguistic content. Recently, researchers have turned to deep generative models, particularly generative adversarial network (GAN) models, for VC studies due to their superior performance compared to statistical models. However, there is a noticeable disparity in naturalness between real speech samples and those generated by state-of-the-art (SOTA) VC models. This study introduces an enhanced GAN model for non-parallel VC, which employs mel-spectrograms as the speech feature. The enhanced GAN model incorporates a region normalization technique in the generator and a discriminator based on capsule networks (Caps-Net), to improve the quality of the generated speech samples. The proposed model is evaluated using the VCC 2018 and CMU Arctic datasets. The experimental outcomes demonstrate that the region normalization technique-based Caps-Net GAN (RNCapsGAN-VC) model outperforms the SOTA MaskCycleGAN-VC model in terms of both objective and subjective evaluations considering less training time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

MGVC: A Mask Voice Conversion Using Generating Adversarial Training

A Robust Framework for High-Quality Voice Conversion with Conditional Generative Adversarial Network

Voice Conversion with Denoising Diffusion Probabilistic GAN Models

Notes

1.
The generated speech samples and code implementation can be found at https://github.com/BlueBlaze6335/RNCapsGAN-VC.

References

Abe, M., Nakamura, S., Shikano, K., Kuwabara, H.: Voice conversion through vector quantization. In: International Conference on Acoustics, Speech, and Signal Processing, ICASSP 1988, vol. 1, pp. 655–658 (1988). https://doi.org/10.1109/ICASSP.1988.196671
Chen, Y.N., Liu, L.J., Hu, Y.J., Jiang, Y., Ling, Z.H.: Improving recognition-synthesis based any-to-one voice conversion with cyclic training. In: ICASSP 2022, pp. 7007–7011 (2022). https://doi.org/10.1109/ICASSP43922.2022.9747140
Coto-Jiménez, M., Goddard-Close, J., Martínez-Licona, F.M.: Quality assessment of HMM-based speech synthesis using acoustical vowel analysis. In: Ronzhin, A., Potapova, R., Delic, V. (eds.) SPECOM 2014. LNCS (LNAI), vol. 8773, pp. 368–375. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11581-8_46
Chapter Google Scholar
Dhar, S., Jana, N.D., Das, S.: An adaptive learning based generative adversarial network for one-to-one voice conversion. IEEE Trans. Artif. Intell. 4, 92–106 (2022). https://doi.org/10.1109/TAI.2022.3149858
Article Google Scholar
Du, H., Tian, X., Xie, L., Li, H.: Optimizing voice conversion network with cycle consistency loss of speaker identity. In: 2021 IEEE SLT, pp. 507–513 (2021). https://doi.org/10.1109/SLT48900.2021.9383567
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in NIPS, vol. 27. Curran Associates, Inc. (2014)
Google Scholar
Jaiswal, A., AbdAlmageed, W., Wu, Y., Natarajan, P.: CapsuleGAN: generative adversarial capsule network. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11131, pp. 526–535. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11015-4_38
Chapter Google Scholar
Jolicoeur-Martineau, A.: The relativistic discriminator: a key element missing from standard GAN. arXiv arXiv:1807.00734 (2019)
Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: CycleGAN-VC2: improved cycleGAN-based non-parallel voice conversion. In: ICASSP, vol. 2019, pp. 6820–6824 (2019)
Google Scholar
Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: MaskCycleGAN-VC: learning non-parallel voice conversion with filling in frames. In: ICASSP, pp. 5919–5923 (2021)
Google Scholar
Kaneko, T., Kameoka, H.: CycleGAN-VC: non-parallel voice conversion using cycle-consistent adversarial networks. In: 2018, 26th EUSIPCO, pp. 2100–2104 (2018). https://doi.org/10.23919/EUSIPCO.2018.8553236
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kishida, T., Nakashika, T.: Non-parallel voice conversion based on free-energy minimization of speaker-conditional restricted Boltzmann machine. In: 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 251–255 (2022). https://doi.org/10.23919/APSIPAASC55919.2022.9980151
Kominek, J., Black, A.W.: The CMU arctic speech databases. In: SSW (2004)
Google Scholar
Kumar, K., et al.: MelGAN: generative adversarial networks for conditional waveform synthesis. In: NeurIPS (2019)
Google Scholar
Lorenzo-Trueba, J., et al.: The voice conversion challenge 2018: promoting development of parallel and nonparallel methods. In: Odyssey (2018)
Google Scholar
Mazzia, V., Salvetti, F., Chiaberge, M.: Efficient-CapsNet: capsule network with self-attention routing. Sci. Rep. 11(1), 14634 (2021)
Article Google Scholar
Mazzia, V., Salvetti, F., Chiaberge, M.: Efficient-CapsNet: capsule network with self-attention routing. CoRR abs/2101.12491 (2021). https://arxiv.org/abs/2101.12491
Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. arXiv arXiv:1710.09829 (2017)
Sisman, B., Yamagishi, J., King, S., Li, H.: An overview of voice conversion and its challenges: from statistical modeling to deep learning. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 132–157 (2021)
Article Google Scholar
Sun, L., Li, K., Wang, H., Kang, S., Meng, H.M.: Phonetic posteriorgrams for many-to-one voice conversion without parallel data training. In: 2016 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2016)
Google Scholar
Uchida, H., Saito, D., Minematsu, N., Hirose, K.: Statistical acoustic-to-articulatory mapping unified with speaker normalization based on voice conversion. In: Proceedings of the INTERSPEECH 2015, pp. 588–592 (2015). https://doi.org/10.21437/Interspeech.2015-209
Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)
Google Scholar
Wu, J., Polyak, A., Taigman, Y., Fong, J., Agrawal, P., He, Q.: Multilingual text-to-speech training using cross language voice conversion and self-supervised learning of speech representations. In: ICASSP 2022, pp. 8017–8021 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746282
Yu, T., et al.: Region normalization for image inpainting. In: AAAI (2020)
Google Scholar
Yun, Y.-S., Jung, J., Eun, S.: Voice conversion between synthesized bilingual voices using line spectral frequencies. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 463–471. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23132-7_57
Chapter Google Scholar
Zahariev, V., Azarov, E., Petrovsky, A.: Voice conversion for TTS systems with tuning on the target speaker based on GMM. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 788–798. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66429-3_79
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, National Institute of Technology Durgapur, Durgapur, West Bengal, India
Md. Tousin Akhter, Sandipan Dhar, Subhayu Ghosh & Nanda Dulal Jana
Department of Electronics and Communication Engineering, Jalpaiguri Government Engineering College, Jalpaiguri, West Bengal, India
Padmanabha Banerjee

Authors

Md. Tousin Akhter
View author publications
You can also search for this author in PubMed Google Scholar
Padmanabha Banerjee
View author publications
You can also search for this author in PubMed Google Scholar
Sandipan Dhar
View author publications
You can also search for this author in PubMed Google Scholar
Subhayu Ghosh
View author publications
You can also search for this author in PubMed Google Scholar
Nanda Dulal Jana
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Sandipan Dhar or Nanda Dulal Jana .

Editor information

Editors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Koneru Lakshmaiah Education Foundation, Vaddeswaram, India
K. Samudravijaya
Indian Institute of Information Technology Dharwad, Dharwad, India
K. T. Deepak
Indian Institute of Technology Dharwad, Dharwad, India
Rajesh M. Hegde
KIIT Group of Colleges, Gurugram, India
Shyam S. Agrawal
Indian Institute of Technology Dharwad, Dharwad, India
S. R. Mahadeva Prasanna

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Akhter, M.T., Banerjee, P., Dhar, S., Ghosh, S., Jana, N.D. (2023). Region Normalized Capsule Network Based Generative Adversarial Network for Non-parallel Voice Conversion. In: Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M. (eds) Speech and Computer. SPECOM 2023. Lecture Notes in Computer Science(), vol 14338. Springer, Cham. https://doi.org/10.1007/978-3-031-48309-7_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-48309-7_20
Published: 22 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-48308-0
Online ISBN: 978-3-031-48309-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Region Normalized Capsule Network Based Generative Adversarial Network for Non-parallel Voice Conversion

Abstract

Access this chapter

Similar content being viewed by others

MGVC: A Mask Voice Conversion Using Generating Adversarial Training

A Robust Framework for High-Quality Voice Conversion with Conditional Generative Adversarial Network

Voice Conversion with Denoising Diffusion Probabilistic GAN Models

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Region Normalized Capsule Network Based Generative Adversarial Network for Non-parallel Voice Conversion

Abstract

Access this chapter

Similar content being viewed by others

MGVC: A Mask Voice Conversion Using Generating Adversarial Training

A Robust Framework for High-Quality Voice Conversion with Conditional Generative Adversarial Network

Voice Conversion with Denoising Diffusion Probabilistic GAN Models

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation