Multi-Voice Singing Synthesis From Lyrics

Resna, S.; Rajan, Rajeev

doi:10.1007/s00034-022-02122-3

Multi-Voice Singing Synthesis From Lyrics

Published: 08 August 2022

Volume 42, pages 307–321, (2023)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

394 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

In this paper, a multi-voice singing synthesis framework is proposed to convert lyrics to their sung version in the target speaker’s voice. It consists of three blocks: a text-to-speech (TTS) module, a speech-to-singing (STS) module, and an intelligibility enhancement module. Synthesized speech is generated from lyrics for a target speaker’s voice by a TTS converter in the front end. Later, a sung version is synthesized in target melody through an encoder–decoder model in the STS module. Further, phonetic intelligibility is enhanced using an intelligibility enhancement module based on an audio style transfer scheme. The proposed system is systematically evaluated using LibriSpeech and NUS-48E corpus using subjective and objective evaluation. We have compared our model with a state-of-the-art multi-voice singing synthesis model based on a generative adversarial network (GAN). Our study shows that the proposed model performs on par with the baseline model without any phoneme annotations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

Article Open access 02 January 2020

Data Availability

The datasets analysed in this manuscript are publicly available.

References

M. Arjovsky, S. Chintala, L. Bottou, Wasserstein generative adversarial networks. in Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 214-223 (2017)
M. Blaauw, J. Bonada, Sequence-to-sequence singing synthesis using the feed-forward transformer. in ICASSP 2020- 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020) pp. 7229-7233. https://doi.org/10.1109/ICASSP40776.2020.9053944
S. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979). https://doi.org/10.1109/TASSP.1979.1163209
Article Google Scholar
E. Casanova, J. Weber, C. Shulby, A. C. Junior, E. Gölge, M. A. Ponti, YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. arXiv preprint arXiv:2112.02418 (2021)
P. Chandna, M. Blaauw, J. Bonada, E. Gómez, WGANSing: a multi-voice singing voice synthesizer based on the Wasserstein-GAN. in Proceedings of 27th European Signal Processing Conference, pp. 1–5 (2019)
J. Chen, X. Tan, J. Luan, T. Qin, T.-Y. Liu, HiFiSinger: Towards high-fidelity neural singing voice synthesis. arXiv preprint arXiv:2009.01776 (2020)
Y.-P. Cho, F.-R. Yang, Y.-C. Chang, C.-T. Cheng, X.-H. Wang, Y.-W. Liu, A survey on recent deep learning-driven singing voice synthesis systems. IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR) 2021, 319–323 (2021). https://doi.org/10.1109/AIVR52153.2021.00067
Article Google Scholar
S. Choi, W. Kim, S. Park, S. Yong, J. Nam, Korean singing voice synthesis based on auto-regressive boundary equilibrium Gan. in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020) pp. 7234–7238. https://doi.org/10.1109/ICASSP40776.2020.9053950
B. Choksi, A. Sawant, S. Mali, Style transfer for audio using convolutional neural networks. Int. J. Comput. Appl. 175, 17–20 (2017). https://doi.org/10.5120/ijca2017915612
Article Google Scholar
Z. Duan, H. Fang, B. Li, K. C. Sim, Y. Wang, The NUS sung and spoken lyrics corpus: A quantitative comparison of singing and speech. in Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 1–9 (2013) https://doi.org/10.1109/APSIPA.2013.6694316
M. Freixes, F. Alías, J.C. Carrié, A unit selection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept. EURASIP J. Audio Speech Music Process. 2019, 1–14 (2019)
Article Google Scholar
L. Gatys, A. Ecker, M. Bethge, Image style transfer using convolutional neural networks. in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2414–2423 (2016)
D. Griffin, Jae Lim, Signal estimation from modified short-time Fourier transform. in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 8, pp. 804–807 (1983). https://doi.org/10.1109/ICASSP.1983.1172092
Y. Gu et al., ByteSing: A Chinese singing voice synthesis system using duration allocated encoder-decoder acoustic models and WaveRNN vocoders. in 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) (2021) pp. 1–5. https://doi.org/10.1109/ISCSLP49672.2021.9362104
C. Gupta, R. Tong, H. Li, Y. Wang, Semi-supervised Lyrics and Solo-singing alignment. in Proceedings of International Society for Music Information Retrieval Conference (ISMIR) , pp. 600–607 (2018)
Y. Hono, K. Hashimoto, K. Oura, Y. Nankaku, K. Tokuda, Sinsy: a deep neural network-based singing voice synthesis system. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2803–2815 (2021). https://doi.org/10.1109/TASLP.2021.3104165
Article Google Scholar
Y. Hono, K. Hashimoto, K. Oura, Y. Nankaku, K. Tokuda., Singing voice synthesis based on generative adversarial networks. in Proceedins of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6955–6959 (2019)
Y. Jia et al., Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Adv. Neural Inf. Process. Syst. 31 (2018)
J. Kim, H. Choi, J. Park, S. Kim, J. Kim, M. Hahn, Korean singing voice synthesis system based on an LSTM recurrent neural network. in Proceedings of Interspeech, pp. 1551–1555 (2018)
J. Lee, H.-S. Choi, J. Koo, K. Lee, Disentangling timbre and singing style with multi-singer singing synthesis system. ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020) pp. 7224–7228. https://doi.org/10.1109/ICASSP40776.2020.9054636
J. Li, H. Yang, W. Zhang, L. Cai, A lyrics to singing voice synthesis system with variable timbre. communications in computer and information science, pp. 186–193 (2011)
J. Liu, C. Li, Y. Ren, F. Chen, P. Liu, Z. Zhao, Diffsinger: Singing voice synthesis via shallow diffusion mechanism. arXiv preprint arXiv:2105.02446 (2021)
R. Liu, X. Wen, C. Lu, L. Song, J.S. Sung, Vibrato learning in multi-singer singing voice synthesis. IEEE Autom. Speech Recognit. Underst. Workshop (ASRU) 2021, 773–779 (2021). https://doi.org/10.1109/ASRU51503.2021.9688029
Article Google Scholar
B. McFee et al., Librosa: audio and music signal analysis in python. in Proceedings of 14th Python in Science Conference, pp. 18–24 (2015)
P. K. Mital, Time domain neural audio style transfer. CoRR (2017). arxiv:1711.11160
A. Nagrani, J.S. Chung, W. Xie, A. Zisserman, Voxceleb: large-scale speaker verification in the wild. Comput. Speech Lang. 60, 1010–27 (2020)
Article Google Scholar
A. Nagrani, J. S. Chung, A. Zisserman, VoxCeleb: A large-scale speaker identification dataset. in Proceedings of Interspeech, pp. 2616–2620 (2017)
K. Nakamura, S. Takaki, K. Hashimoto, K. Oura, Y. Nankaku, K. Tokuda, Fast and high-quality singing voice synthesis system based on convolutional neural networks. in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020) pp. 7239–7243. https://doi.org/10.1109/ICASSP40776.2020.9053811
M. Nishimura, K. Hashimoto, K. Oura, Y. Nankaku, K. Tokuda, Singing voice synthesis based on deep neural networks. in Proceedings of Interspeech, pp. 2478–2482 (2016)
V. Panayotov, G. Chen, D. Povey, S. Khudanpur, Librispeech: An ASR corpus based on public domain audio books. in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5206–5210 (2015). https://doi.org/10.1109/ICASSP.2015.7178964
J. Parekh, P. Rao, Y. H. Yang, Speech-to-singing conversion in an encoder-decoder framework. in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 261–265 (2020)
Y. Ren, X. Tan, T. Qin, J. Luan, Z. Zhao, T.-Y. Liu, DeepSinger: singing voice synthesis with data mined from the web. in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1979–1989 (2020)
O. Ronneberger, P. Fischer, T. Brox, U-Net: convolutional networks for biomedical image segmentation. Medical Image Comput. Comput. Assist. Intervent. MICCAI 2015, 234–241 (2015)
Google Scholar
A. Saeed, M.F. Hayat, T. Habib, D.A. Ghaffar, M.A. Qureshi, A novel multi-speakers Urdu singing voices synthesizer using Wasserstein generative adversarial network. Speech Commun. 137, 103–113 (2022). https://doi.org/10.1016/j.specom.2021.12.005
Article Google Scholar
J. Shen et al. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. CoRR (2017). arxiv:1712.05884
J. Shi, S. Guo, N. Huo, Y. Zhang, Q. Jin, Sequence-to-sequence singing voice synthesis with perceptual entropy loss. in ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2021) pp. 76–80. https://doi.org/10.1109/ICASSP39728.2021.9414348
L. Su, Vocal melody extraction using patch-based CNN. in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 371–375 (2018)
D. Ulyanov, V. Lebedev, Audio texture synthesis and style transfer. (2016). http://tinyurl.com/y844x8qt
R. Valle, J. Li, R. Prenger, B. Catanzaro, Mellotron: multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens. in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020) pp. 6189–6193. https://doi.org/10.1109/ICASSP40776.2020.9054556
D.-Y. Wu, Y.-H. Yang, Speech-to-singing conversion based on boundary equilibrium gan. arXiv preprint arXiv:2005.13835 (2020)

Download references

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

MultiMedia and Communication Vertical, Tata Elxsi, Technopark, Thiruvananthapuram, India
S. Resna
Department of Electronics and Communication Engineering College of Engineering, Trivandrum, APJ Abdul Kalam Technological University, Thiruvananthapuram, Kerala, India
Rajeev Rajan

Authors

S. Resna
View author publications
You can also search for this author in PubMed Google Scholar
Rajeev Rajan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rajeev Rajan.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Resna, S., Rajan, R. Multi-Voice Singing Synthesis From Lyrics. Circuits Syst Signal Process 42, 307–321 (2023). https://doi.org/10.1007/s00034-022-02122-3

Download citation

Received: 16 January 2022
Revised: 14 July 2022
Accepted: 14 July 2022
Published: 08 August 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s00034-022-02122-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-Voice Singing Synthesis From Lyrics

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-Voice Singing Synthesis From Lyrics

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation