Adding Singing Capabilities to Unit Selection TTS Through HNM-Based Conversion

Freixes, Marc; Socoró, Joan Claudi; Alías, Francesc

doi:10.1007/978-3-319-49169-1_4

Marc Freixes²¹,
Joan Claudi Socoró²¹ &
Francesc Alías²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10077))

Included in the following conference series:

International Conference on Advances in Speech and Language Technologies for Iberian Languages

707 Accesses

Abstract

Adding singing capabilities to a corpus-based concatenative text-to-speech (TTS) system can be addressed by explicitly collecting singing samples from the previously recorded speaker. However, this approach is only feasible if the considered speaker is also a singing talent. As an alternative, we consider appending a Harmonic plus Noise Model (HNM) speech-to-singing conversion module to a Unit Selection TTS (US-TTS) system. Two possible text-to-speech-to-singing synthesis approaches are studied: applying the speech-to-singing conversion to the US-TTS synthetic output, or implementing a hybrid US+HNM synthesis framework. The perceptual tests show that the speech-to-singing conversion yields similar singing resemblance than the natural version, but with lower naturalness. Moreover, no statistically significant differences are found between both strategies in terms of naturalness nor singing resemblance. Finally, the hybrid approach allows reducing more than twice the overall computational cost.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A unit selection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept

Article Open access 16 December 2019

Multi-Voice Singing Synthesis From Lyrics

Article 08 August 2022

Singing to speech conversion with generative flow

Article Open access 10 March 2025

References

The fstival speech synthesis system (2016). http://www.cstr.ed.ac.uk/projects/festival/
Babacan, O., Drugman, T., Raitio, T., Erro, D., Dutoit, T.: Parametric representation for singing voice synthesis: a comparative evaluation. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 2564–2568, May 2014
Google Scholar
Bonada, J., Serra, X.: Synthesis of the singing voice by performance sampling and spectral models. IEEE Sig. Process. Mag. 24(2), 67–79 (2007)
Article Google Scholar
Cen, L., Dong, M., Chan, P.: Template-based personalized singing voice synthesis. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4509–4512 (2012)
Google Scholar
Dong, M., Lee, S.W., Li, H., Chan, P., Peng, X., Ehnes, J.W., Huang, D.: I2R Speech2Singing perfects everyone’s singing. In: 15th Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 2148–2149 (2014)
Google Scholar
Erro, D., Moreno, A., Bonafonte, A.: Flexible harmonic/stochastic speech synthesis. In: 6th ISCA Workshop on Speech Synthesis (SSW), Bonn, Germany, pp. 194–199, August 2007
Google Scholar
Formiga, L., Trilla, A., Alías, F., Iriondo, I., Socoró, J.: Adaptation of the URL-TTS system to the 2010 Albayzin evaluation campaign. In: Proceedings of FALA 2010, Jornadas en Tecnología del Habla and Iberian SLTech Workshop, vol. 1, pp. 363–370, November 2020
Google Scholar
Huber, S., Roebel, A.: On glottal source shape parameter transformation using a novel deterministic and stochastic speech analysis and synthesis system. In: 16th Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 289–293 (2015)
Google Scholar
Kawahara, H., Masuda-Katsuse, I., De Cheveigné, A.: Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds. Speech Commun. 27(3), 187–207 (1999)
Article Google Scholar
Kenmochi, H.: Singing synthesis as a new musical instrument. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5385–5388 (2012)
Google Scholar
Macon, M., Jensen-Link, L., George, E.: Concatenation-based MIDI-to-singing voice synthesis. In: 103rd Audio Engineering Society Convention, pp. 1–10 (1997)
Google Scholar
Meron, Y.: High quality singing synthesis using the selection-based synthesis scheme. Ph.D. thesis, University of Tokyo (1999)
Google Scholar
Nose, T., Kanemoto, M., Koriyama, T., Kobayashi, T.: HMM-based expressive singing voice synthesis with singing style control and robust pitch modeling. Comput. Speech Lang. 34(1), 308–322 (2015)
Article Google Scholar
Oura, K., Mase, A.: Recent development of the HMM-based singing voice synthesis system-sinsy. In: 7th ISCA Workshop on Speech Synthesis (SSW), pp. 211–216 (2010)
Google Scholar
Planet, S., Iriondo, I., Martínez, E., Montero, J.A.: TRUE: an online testing platform for multimedia evaluation. In: Workshop Corpora for Research on Emotion and Affect Marrakech, Morocco (2008)
Google Scholar
Röbel, A., Fineberg, J.: Speech to chant transformation with the phase vocoder. In: 8th Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 2–3 (2007)
Google Scholar
Roebel, A., Huber, S., Rodet, X., Degottex, G.: Analysis and modification of excitation source characteristics for singing voice synthesis. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5381–5384, March 2012
Google Scholar
Saitou, T., Goto, M., Unoki, M., Akagi, M.: Speech-to-singing synthesis: converting speaking voices to singing voices by controlling acoustic features unique to singing voices. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 215–218 (2007)
Google Scholar
Saitou, T., Goto, M., Unoki, M., Akagi, M.: Vocal conversion from speaking voice to singing voice using STRAIGHT. In: 8th Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 2–3 (2007)
Google Scholar
Stylianou, Y.: Applying the harmonic plus noise model in concatenative speech synthesis. IEEE Trans. Speech Audio Process. 9(1), 21–29 (2001)
Article Google Scholar
Sundberg, J.: The Science of the Singing Voice. Northern Illinois University Press, DeKalb (1987)
Google Scholar
Sundberg, J.: The KTH synthesis of singing. Adv. Cogn. Psychol. 2(2), 131–143 (2006)
Article MathSciNet Google Scholar

Download references

Acknowledgements

Marc Freixes thanks the support of the European Social Fund (ESF) and the Catalan Government (SUR/DEC) for the pre-doctoral FI grant No. 2016FI_B2 00094. This work has been partially funded by SUR/DEC (grant ref. 2014-SGR-0590). We also want to thank the people that took the perceptual test and Raúl Montaño for his help with the statistics.

Author information

Authors and Affiliations

GTM – Grup de Recerca en Tecnologies Mèdia, La Salle - Universitat Ramon Llull, Quatre Camins, 30, 08022, Barcelona, Spain
Marc Freixes, Joan Claudi Socoró & Francesc Alías

Authors

Marc Freixes
View author publications
You can also search for this author in PubMed Google Scholar
Joan Claudi Socoró
View author publications
You can also search for this author in PubMed Google Scholar
Francesc Alías
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marc Freixes .

Editor information

Editors and Affiliations

INESC-ID/IST, Universidade de Lisboa, Lisbon, Portugal
Alberto Abad
I3A/University of Zaragoza, Zaragoza, Spain
Alfonso Ortega
DETI/IEETA, University of Aveiro, Aveiro, Portugal
António Teixeira
AtlantTIC Research Center, Universidad de Vigo, Vigo, Spain
Carmen García Mateo
Universitat Politècnica de València, Valencia, Spain
Carlos D. Martínez Hinarejos
University of Coimbra, Coimbra, Portugal
Fernando Perdigão
INESC-ID/ISCTE-IUL, Lisbon, Portugal
Fernando Batista
INESC-ID/IST, Universidade de Lisboa, Lisbon, Portugal
Nuno Mamede

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Freixes, M., Socoró, J.C., Alías, F. (2016). Adding Singing Capabilities to Unit Selection TTS Through HNM-Based Conversion. In: Abad, A., et al. Advances in Speech and Language Technologies for Iberian Languages. IberSPEECH 2016. Lecture Notes in Computer Science(), vol 10077. Springer, Cham. https://doi.org/10.1007/978-3-319-49169-1_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-49169-1_4
Published: 04 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49168-4
Online ISBN: 978-3-319-49169-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics