Improvements to Prosodic Variation in Long Short-Term Memory Based Intonation Models Using Random Forest

Tóth, Bálint Pál; Szórádi, Balázs; Németh, Géza

doi:10.1007/978-3-319-43958-7_46

Bálint Pál Tóth¹⁶,
Balázs Szórádi¹⁶ &
Géza Németh¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9811))

Included in the following conference series:

International Conference on Speech and Computer

2287 Accesses

Abstract

Statistical parametric speech synthesis has overcome unit selection methods in many aspects, including flexibility and variability. However, the intonation of these systems is quite monotonic, especially in case of longer sentences. Due to statistical methods the variation of fundamental frequency (F0) trajectories decreases. In this research a random forest (RF) based classifier was trained with radio conversations based on the perceived variation by a human annotator. This classifier was used to extend the labels of a phonetically balanced, studio quality speech corpus. With the extended labels a Long Short-Term Memory (LSTM) network was trained to model fundamental frequency (F0). Objective and subjective evaluations were carried out. The results show that the variation of the generated F0 trajectories can be fine-tuned with an additional input of the LSTM network.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Commun. 51, 1039–1064 (2009)
Article Google Scholar
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Article Google Scholar
Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: ICASSP, pp. 7962–7966 (2013)
Google Scholar
Fan, Y., Qian, Y., Xie, F.-L., Soong. F.K.: TTS synthesis with bidirectional LSTM based recurrent neural networks. In:. Interspeech, pp. 1964–1968 (2014)
Google Scholar
Németh, G., Fék, M., Csapó, T.G.: Increasing prosodic variability of text-to-speech synthesizers. In: INTERSPEECH, pp. 474–477 (2007)
Google Scholar
Jia, H, Tao, J, Wang, X.: Prosody variation: application to automatic prosody evaluation of Mandarin speech. In: Proceeding Speech Prosody, pp. 547–550 (2008)
Google Scholar
Gahlawat, M., Malik, A., Bansal, P.: Expressive speech synthesis system using unit selection. In: Prasath, R., Kathirvalavakumar, T. (eds.) MIKE 2013. LNCS, vol. 8284, pp. 391–401. Springer, Heidelberg (2013)
Chapter Google Scholar
Gustafson, K., House, D.: Fun or boring? A web-based evaluation of expressive synthesis for children. In: INTERSPEECH, pp. 565–568 (2001)
Google Scholar
Camacho, A.: Swipe: a sawtooth waveform inspired pitch estimator for speech and music. Doctoral dissertation at the University of Florida, pp. 47–86 (2007)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063 (2012)
Garner, P.N., Cernak, M., Motlicek, P.: A simple continuous pitch estimation algorithm. IEEE Signal Process. Lett. 20(1), 102–105 (2013)
Article Google Scholar
Zhang, Q., Soong, F. K., Qian, Y., Yan, Z., Pan, J., Yan, Y.: Improved modeling for F0 generation and V/U decision in HMM-based TTS. In: ICASSP, pp. 4606–4609 (2010)
Google Scholar
Drugman, T., Stylianou, Y.: Maximum voiced frequency estimation: exploiting amplitude and phase spectra. IEEE Signal Process. Lett. 21(10), 1230–1234 (2014)
Article Google Scholar
Csapó, T.G., Németh, G., Cernak, M.: Residual-based excitation with continuous F0 modeling in HMM-based speech synthesis. In: Dediu, A.-H., Martín-Vide, C., Vicsi, K. (eds.) SLSP 2015. LNCS, vol. 9449, pp. 27–38. Springer, Heidelberg (2015). doi:10.1007/978-3-319-25789-1_4
Chapter Google Scholar
Olaszy, G.: Development and services of a Hungarian precisely labeled and segmented, parallel speech database, (in Hungarian), Speech Res., pp. 261–270 (2013)
Google Scholar
Chollet, F.: Keras: Theano-based deep learning library, https://github.com/fchollet, Documentation: http://keras.io (2015)
ITU-T recommendation p. 800: Methods for subjective determination of transmission quality (1996)
Google Scholar
Laskowski, K., Heldner, M., Edlund, J.: The fundamental frequency variation spectrum. In: FONETIK-2008, pp. 29–32 (2008)
Google Scholar

Download references

Acknowledgments

We would like to thank to Mátyás Bartalis for his help in creating the subjective listening test and to the listeners for participating in it. Bálint Pál Tóth gratefully acknowledges the support of NVIDIA Corporation with the donation of an NVidia Titan X GPU used for his research. This research is partially supported by the Swiss National Science Foundation via the joint research project (SCOPES scheme) SP2: SCOPES project on speech prosody (SNSF n° IZ73Z0_152495-1).

Author information

Authors and Affiliations

Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Budapest, Hungary
Bálint Pál Tóth, Balázs Szórádi & Géza Németh

Authors

Bálint Pál Tóth
View author publications
You can also search for this author in PubMed Google Scholar
Balázs Szórádi
View author publications
You can also search for this author in PubMed Google Scholar
Géza Németh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bálint Pál Tóth .

Editor information

Editors and Affiliations

SPIIRAS , Saint-Petersburg, Russia
Andrey Ronzhin
Moscow State Linguistic University , Moscow, Russia
Rodmonga Potapova
Budapest University of Technology and Economics, Budapest, Hungary
Géza Németh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tóth, B.P., Szórádi, B., Németh, G. (2016). Improvements to Prosodic Variation in Long Short-Term Memory Based Intonation Models Using Random Forest. In: Ronzhin, A., Potapova, R., Németh, G. (eds) Speech and Computer. SPECOM 2016. Lecture Notes in Computer Science(), vol 9811. Springer, Cham. https://doi.org/10.1007/978-3-319-43958-7_46

Download citation

DOI: https://doi.org/10.1007/978-3-319-43958-7_46
Published: 13 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43957-0
Online ISBN: 978-3-319-43958-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics