Skip to main content
Log in

Multi-Voice Singing Synthesis From Lyrics

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

In this paper, a multi-voice singing synthesis framework is proposed to convert lyrics to their sung version in the target speaker’s voice. It consists of three blocks: a text-to-speech (TTS) module, a speech-to-singing (STS) module, and an intelligibility enhancement module. Synthesized speech is generated from lyrics for a target speaker’s voice by a TTS converter in the front end. Later, a sung version is synthesized in target melody through an encoder–decoder model in the STS module. Further, phonetic intelligibility is enhanced using an intelligibility enhancement module based on an audio style transfer scheme. The proposed system is systematically evaluated using LibriSpeech and NUS-48E corpus using subjective and objective evaluation. We have compared our model with a state-of-the-art multi-voice singing synthesis model based on a generative adversarial network (GAN). Our study shows that the proposed model performs on par with the baseline model without any phoneme annotations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability

The datasets analysed in this manuscript are publicly available.

References

  1. M. Arjovsky, S. Chintala, L. Bottou, Wasserstein generative adversarial networks. in Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 214-223 (2017)

  2. M. Blaauw, J. Bonada, Sequence-to-sequence singing synthesis using the feed-forward transformer. in ICASSP 2020- 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020) pp. 7229-7233. https://doi.org/10.1109/ICASSP40776.2020.9053944

  3. S. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979). https://doi.org/10.1109/TASSP.1979.1163209

    Article  Google Scholar 

  4. E. Casanova, J. Weber, C. Shulby, A. C. Junior, E. Gölge, M. A. Ponti, YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. arXiv preprint arXiv:2112.02418 (2021)

  5. P. Chandna, M. Blaauw, J. Bonada, E. Gómez, WGANSing: a multi-voice singing voice synthesizer based on the Wasserstein-GAN. in Proceedings of 27th European Signal Processing Conference, pp. 1–5 (2019)

  6. J. Chen, X. Tan, J. Luan, T. Qin, T.-Y. Liu, HiFiSinger: Towards high-fidelity neural singing voice synthesis. arXiv preprint arXiv:2009.01776 (2020)

  7. Y.-P. Cho, F.-R. Yang, Y.-C. Chang, C.-T. Cheng, X.-H. Wang, Y.-W. Liu, A survey on recent deep learning-driven singing voice synthesis systems. IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR) 2021, 319–323 (2021). https://doi.org/10.1109/AIVR52153.2021.00067

    Article  Google Scholar 

  8. S. Choi, W. Kim, S. Park, S. Yong, J. Nam, Korean singing voice synthesis based on auto-regressive boundary equilibrium Gan. in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020) pp. 7234–7238. https://doi.org/10.1109/ICASSP40776.2020.9053950

  9. B. Choksi, A. Sawant, S. Mali, Style transfer for audio using convolutional neural networks. Int. J. Comput. Appl. 175, 17–20 (2017). https://doi.org/10.5120/ijca2017915612

    Article  Google Scholar 

  10. Z. Duan, H. Fang, B. Li, K. C. Sim, Y. Wang, The NUS sung and spoken lyrics corpus: A quantitative comparison of singing and speech. in Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 1–9 (2013) https://doi.org/10.1109/APSIPA.2013.6694316

  11. M. Freixes, F. Alías, J.C. Carrié, A unit selection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept. EURASIP J. Audio Speech Music Process. 2019, 1–14 (2019)

    Article  Google Scholar 

  12. L. Gatys, A. Ecker, M. Bethge, Image style transfer using convolutional neural networks. in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2414–2423 (2016)

  13. D. Griffin, Jae Lim, Signal estimation from modified short-time Fourier transform. in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 8, pp. 804–807 (1983). https://doi.org/10.1109/ICASSP.1983.1172092

  14. Y. Gu et al., ByteSing: A Chinese singing voice synthesis system using duration allocated encoder-decoder acoustic models and WaveRNN vocoders. in 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) (2021) pp. 1–5. https://doi.org/10.1109/ISCSLP49672.2021.9362104

  15. C. Gupta, R. Tong, H. Li, Y. Wang, Semi-supervised Lyrics and Solo-singing alignment. in Proceedings of International Society for Music Information Retrieval Conference (ISMIR) , pp. 600–607 (2018)

  16. Y. Hono, K. Hashimoto, K. Oura, Y. Nankaku, K. Tokuda, Sinsy: a deep neural network-based singing voice synthesis system. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2803–2815 (2021). https://doi.org/10.1109/TASLP.2021.3104165

    Article  Google Scholar 

  17. Y. Hono, K. Hashimoto, K. Oura, Y. Nankaku, K. Tokuda., Singing voice synthesis based on generative adversarial networks. in Proceedins of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6955–6959 (2019)

  18. Y. Jia et al., Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Adv. Neural Inf. Process. Syst. 31 (2018)

  19. J. Kim, H. Choi, J. Park, S. Kim, J. Kim, M. Hahn, Korean singing voice synthesis system based on an LSTM recurrent neural network. in Proceedings of Interspeech, pp. 1551–1555 (2018)

  20. J. Lee, H.-S. Choi, J. Koo, K. Lee, Disentangling timbre and singing style with multi-singer singing synthesis system. ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020) pp. 7224–7228. https://doi.org/10.1109/ICASSP40776.2020.9054636

  21. J. Li, H. Yang, W. Zhang, L. Cai, A lyrics to singing voice synthesis system with variable timbre. communications in computer and information science, pp. 186–193 (2011)

  22. J. Liu, C. Li, Y. Ren, F. Chen, P. Liu, Z. Zhao, Diffsinger: Singing voice synthesis via shallow diffusion mechanism. arXiv preprint arXiv:2105.02446 (2021)

  23. R. Liu, X. Wen, C. Lu, L. Song, J.S. Sung, Vibrato learning in multi-singer singing voice synthesis. IEEE Autom. Speech Recognit. Underst. Workshop (ASRU) 2021, 773–779 (2021). https://doi.org/10.1109/ASRU51503.2021.9688029

    Article  Google Scholar 

  24. B. McFee et al., Librosa: audio and music signal analysis in python. in Proceedings of 14th Python in Science Conference, pp. 18–24 (2015)

  25. P. K. Mital, Time domain neural audio style transfer. CoRR (2017). arxiv:1711.11160

  26. A. Nagrani, J.S. Chung, W. Xie, A. Zisserman, Voxceleb: large-scale speaker verification in the wild. Comput. Speech Lang. 60, 1010–27 (2020)

    Article  Google Scholar 

  27. A. Nagrani, J. S. Chung, A. Zisserman, VoxCeleb: A large-scale speaker identification dataset. in Proceedings of Interspeech, pp. 2616–2620 (2017)

  28. K. Nakamura, S. Takaki, K. Hashimoto, K. Oura, Y. Nankaku, K. Tokuda, Fast and high-quality singing voice synthesis system based on convolutional neural networks. in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020) pp. 7239–7243. https://doi.org/10.1109/ICASSP40776.2020.9053811

  29. M. Nishimura, K. Hashimoto, K. Oura, Y. Nankaku, K. Tokuda, Singing voice synthesis based on deep neural networks. in Proceedings of Interspeech, pp. 2478–2482 (2016)

  30. V. Panayotov, G. Chen, D. Povey, S. Khudanpur, Librispeech: An ASR corpus based on public domain audio books. in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5206–5210 (2015). https://doi.org/10.1109/ICASSP.2015.7178964

  31. J. Parekh, P. Rao, Y. H. Yang, Speech-to-singing conversion in an encoder-decoder framework. in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 261–265 (2020)

  32. Y. Ren, X. Tan, T. Qin, J. Luan, Z. Zhao, T.-Y. Liu, DeepSinger: singing voice synthesis with data mined from the web. in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1979–1989 (2020)

  33. O. Ronneberger, P. Fischer, T. Brox, U-Net: convolutional networks for biomedical image segmentation. Medical Image Comput. Comput. Assist. Intervent. MICCAI 2015, 234–241 (2015)

    Google Scholar 

  34. A. Saeed, M.F. Hayat, T. Habib, D.A. Ghaffar, M.A. Qureshi, A novel multi-speakers Urdu singing voices synthesizer using Wasserstein generative adversarial network. Speech Commun. 137, 103–113 (2022). https://doi.org/10.1016/j.specom.2021.12.005

    Article  Google Scholar 

  35. J. Shen et al. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. CoRR (2017). arxiv:1712.05884

  36. J. Shi, S. Guo, N. Huo, Y. Zhang, Q. Jin, Sequence-to-sequence singing voice synthesis with perceptual entropy loss. in ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2021) pp. 76–80. https://doi.org/10.1109/ICASSP39728.2021.9414348

  37. L. Su, Vocal melody extraction using patch-based CNN. in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 371–375 (2018)

  38. D. Ulyanov, V. Lebedev, Audio texture synthesis and style transfer. (2016). http://tinyurl.com/y844x8qt

  39. R. Valle, J. Li, R. Prenger, B. Catanzaro, Mellotron: multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens. in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020) pp. 6189–6193. https://doi.org/10.1109/ICASSP40776.2020.9054556

  40. D.-Y. Wu, Y.-H. Yang, Speech-to-singing conversion based on boundary equilibrium gan. arXiv preprint arXiv:2005.13835 (2020)

Download references

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rajeev Rajan.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Resna, S., Rajan, R. Multi-Voice Singing Synthesis From Lyrics. Circuits Syst Signal Process 42, 307–321 (2023). https://doi.org/10.1007/s00034-022-02122-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-022-02122-3

Keywords

Navigation