Abstract
In this paper, a multi-voice singing synthesis framework is proposed to convert lyrics to their sung version in the target speaker’s voice. It consists of three blocks: a text-to-speech (TTS) module, a speech-to-singing (STS) module, and an intelligibility enhancement module. Synthesized speech is generated from lyrics for a target speaker’s voice by a TTS converter in the front end. Later, a sung version is synthesized in target melody through an encoder–decoder model in the STS module. Further, phonetic intelligibility is enhanced using an intelligibility enhancement module based on an audio style transfer scheme. The proposed system is systematically evaluated using LibriSpeech and NUS-48E corpus using subjective and objective evaluation. We have compared our model with a state-of-the-art multi-voice singing synthesis model based on a generative adversarial network (GAN). Our study shows that the proposed model performs on par with the baseline model without any phoneme annotations.
Similar content being viewed by others
Data Availability
The datasets analysed in this manuscript are publicly available.
References
M. Arjovsky, S. Chintala, L. Bottou, Wasserstein generative adversarial networks. in Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 214-223 (2017)
M. Blaauw, J. Bonada, Sequence-to-sequence singing synthesis using the feed-forward transformer. in ICASSP 2020- 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020) pp. 7229-7233. https://doi.org/10.1109/ICASSP40776.2020.9053944
S. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979). https://doi.org/10.1109/TASSP.1979.1163209
E. Casanova, J. Weber, C. Shulby, A. C. Junior, E. Gölge, M. A. Ponti, YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. arXiv preprint arXiv:2112.02418 (2021)
P. Chandna, M. Blaauw, J. Bonada, E. Gómez, WGANSing: a multi-voice singing voice synthesizer based on the Wasserstein-GAN. in Proceedings of 27th European Signal Processing Conference, pp. 1–5 (2019)
J. Chen, X. Tan, J. Luan, T. Qin, T.-Y. Liu, HiFiSinger: Towards high-fidelity neural singing voice synthesis. arXiv preprint arXiv:2009.01776 (2020)
Y.-P. Cho, F.-R. Yang, Y.-C. Chang, C.-T. Cheng, X.-H. Wang, Y.-W. Liu, A survey on recent deep learning-driven singing voice synthesis systems. IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR) 2021, 319–323 (2021). https://doi.org/10.1109/AIVR52153.2021.00067
S. Choi, W. Kim, S. Park, S. Yong, J. Nam, Korean singing voice synthesis based on auto-regressive boundary equilibrium Gan. in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020) pp. 7234–7238. https://doi.org/10.1109/ICASSP40776.2020.9053950
B. Choksi, A. Sawant, S. Mali, Style transfer for audio using convolutional neural networks. Int. J. Comput. Appl. 175, 17–20 (2017). https://doi.org/10.5120/ijca2017915612
Z. Duan, H. Fang, B. Li, K. C. Sim, Y. Wang, The NUS sung and spoken lyrics corpus: A quantitative comparison of singing and speech. in Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 1–9 (2013) https://doi.org/10.1109/APSIPA.2013.6694316
M. Freixes, F. Alías, J.C. Carrié, A unit selection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept. EURASIP J. Audio Speech Music Process. 2019, 1–14 (2019)
L. Gatys, A. Ecker, M. Bethge, Image style transfer using convolutional neural networks. in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2414–2423 (2016)
D. Griffin, Jae Lim, Signal estimation from modified short-time Fourier transform. in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 8, pp. 804–807 (1983). https://doi.org/10.1109/ICASSP.1983.1172092
Y. Gu et al., ByteSing: A Chinese singing voice synthesis system using duration allocated encoder-decoder acoustic models and WaveRNN vocoders. in 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) (2021) pp. 1–5. https://doi.org/10.1109/ISCSLP49672.2021.9362104
C. Gupta, R. Tong, H. Li, Y. Wang, Semi-supervised Lyrics and Solo-singing alignment. in Proceedings of International Society for Music Information Retrieval Conference (ISMIR) , pp. 600–607 (2018)
Y. Hono, K. Hashimoto, K. Oura, Y. Nankaku, K. Tokuda, Sinsy: a deep neural network-based singing voice synthesis system. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2803–2815 (2021). https://doi.org/10.1109/TASLP.2021.3104165
Y. Hono, K. Hashimoto, K. Oura, Y. Nankaku, K. Tokuda., Singing voice synthesis based on generative adversarial networks. in Proceedins of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6955–6959 (2019)
Y. Jia et al., Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Adv. Neural Inf. Process. Syst. 31 (2018)
J. Kim, H. Choi, J. Park, S. Kim, J. Kim, M. Hahn, Korean singing voice synthesis system based on an LSTM recurrent neural network. in Proceedings of Interspeech, pp. 1551–1555 (2018)
J. Lee, H.-S. Choi, J. Koo, K. Lee, Disentangling timbre and singing style with multi-singer singing synthesis system. ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020) pp. 7224–7228. https://doi.org/10.1109/ICASSP40776.2020.9054636
J. Li, H. Yang, W. Zhang, L. Cai, A lyrics to singing voice synthesis system with variable timbre. communications in computer and information science, pp. 186–193 (2011)
J. Liu, C. Li, Y. Ren, F. Chen, P. Liu, Z. Zhao, Diffsinger: Singing voice synthesis via shallow diffusion mechanism. arXiv preprint arXiv:2105.02446 (2021)
R. Liu, X. Wen, C. Lu, L. Song, J.S. Sung, Vibrato learning in multi-singer singing voice synthesis. IEEE Autom. Speech Recognit. Underst. Workshop (ASRU) 2021, 773–779 (2021). https://doi.org/10.1109/ASRU51503.2021.9688029
B. McFee et al., Librosa: audio and music signal analysis in python. in Proceedings of 14th Python in Science Conference, pp. 18–24 (2015)
P. K. Mital, Time domain neural audio style transfer. CoRR (2017). arxiv:1711.11160
A. Nagrani, J.S. Chung, W. Xie, A. Zisserman, Voxceleb: large-scale speaker verification in the wild. Comput. Speech Lang. 60, 1010–27 (2020)
A. Nagrani, J. S. Chung, A. Zisserman, VoxCeleb: A large-scale speaker identification dataset. in Proceedings of Interspeech, pp. 2616–2620 (2017)
K. Nakamura, S. Takaki, K. Hashimoto, K. Oura, Y. Nankaku, K. Tokuda, Fast and high-quality singing voice synthesis system based on convolutional neural networks. in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020) pp. 7239–7243. https://doi.org/10.1109/ICASSP40776.2020.9053811
M. Nishimura, K. Hashimoto, K. Oura, Y. Nankaku, K. Tokuda, Singing voice synthesis based on deep neural networks. in Proceedings of Interspeech, pp. 2478–2482 (2016)
V. Panayotov, G. Chen, D. Povey, S. Khudanpur, Librispeech: An ASR corpus based on public domain audio books. in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5206–5210 (2015). https://doi.org/10.1109/ICASSP.2015.7178964
J. Parekh, P. Rao, Y. H. Yang, Speech-to-singing conversion in an encoder-decoder framework. in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 261–265 (2020)
Y. Ren, X. Tan, T. Qin, J. Luan, Z. Zhao, T.-Y. Liu, DeepSinger: singing voice synthesis with data mined from the web. in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1979–1989 (2020)
O. Ronneberger, P. Fischer, T. Brox, U-Net: convolutional networks for biomedical image segmentation. Medical Image Comput. Comput. Assist. Intervent. MICCAI 2015, 234–241 (2015)
A. Saeed, M.F. Hayat, T. Habib, D.A. Ghaffar, M.A. Qureshi, A novel multi-speakers Urdu singing voices synthesizer using Wasserstein generative adversarial network. Speech Commun. 137, 103–113 (2022). https://doi.org/10.1016/j.specom.2021.12.005
J. Shen et al. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. CoRR (2017). arxiv:1712.05884
J. Shi, S. Guo, N. Huo, Y. Zhang, Q. Jin, Sequence-to-sequence singing voice synthesis with perceptual entropy loss. in ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2021) pp. 76–80. https://doi.org/10.1109/ICASSP39728.2021.9414348
L. Su, Vocal melody extraction using patch-based CNN. in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 371–375 (2018)
D. Ulyanov, V. Lebedev, Audio texture synthesis and style transfer. (2016). http://tinyurl.com/y844x8qt
R. Valle, J. Li, R. Prenger, B. Catanzaro, Mellotron: multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens. in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020) pp. 6189–6193. https://doi.org/10.1109/ICASSP40776.2020.9054556
D.-Y. Wu, Y.-H. Yang, Speech-to-singing conversion based on boundary equilibrium gan. arXiv preprint arXiv:2005.13835 (2020)
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Resna, S., Rajan, R. Multi-Voice Singing Synthesis From Lyrics. Circuits Syst Signal Process 42, 307–321 (2023). https://doi.org/10.1007/s00034-022-02122-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-022-02122-3