Ensemble Deep Neural Network Based Waveform-Driven Stress Model for Speech Synthesis

Tóth, Bálint Pál; Kis, Kornél István; Szaszák, György; Németh, Géza

doi:10.1007/978-3-319-43958-7_32

Bálint Pál Tóth¹⁶,
Kornél István Kis¹⁶,
György Szaszák¹⁶ &
…
Géza Németh¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9811))

Included in the following conference series:

International Conference on Speech and Computer

2270 Accesses

Abstract

Stress annotations in the training corpus of speech synthesis systems are usually obtained by applying language rules to the transcripts. However, the actual stress patterns seen in the waveform are not guaranteed to be canonical, they can deviate from locations defined by language rules. This is driven mostly by speaker dependent factors. Therefore, stress models based on these corpora can be far from perfect. This paper proposes a waveform based stress annotation technique. According to the stress classes, four feedforward deep neural networks (DNNs) were trained to model fundamental frequency (F0) of speech. During synthesis, stress labels are generated from the textual input and an ensemble of the four DNNs predict the F0 trajectories. Objective and subjective evaluation was carried out. The results show that the proposed method surpasses the quality of vanilla DNN-based F0 models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T.: Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In: Eurospeech, pp. 2347–2350 (1999)
Google Scholar
Tomoki, T., Tokuda, K.: A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Trans. Inf. Syst. 90(5), 816–824 (2007)
Google Scholar
Pitrelli, J.F., Beckman, M.E., Hirschberg, J.: Evaluation of prosodic transcription labeling reliability in the ToBI framework. In: International Conference on Spoken Language Processing, vol. 1, pp. 123–126 (1994)
Google Scholar
Szaszák, G., Beke, A., Olaszy, G., Tóth, B.P.: Using automatic stress extraction from audio for improved prosody modeling in speech synthesis. In: 16th Annual Conference of the International Speech Communication Association, pp. 2227–2231 (2015)
Google Scholar
Pitrelli, J.F., Beckman, M.E., Hirschberg, J.: Evaluation of prosodic transcription labeling reliability in the ToBI framework. In: International Conference on Spoken Language Processing, vol. 1, pp. 123–126 (1994)
Google Scholar
Hannun, A., et al.: Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014)
Szaszák, G., Beke, A.: Exploiting prosody for syntactic analysis in automatic speech understanding. J. Lang. Model. 1, 143–172 (2012)
Article Google Scholar
Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7962–7966 (2013)
Google Scholar
Fan, Y., Qian, Y., Xie, F.L., Soong, F.K.: TTS synthesis with bidirectional LSTM based recurrent neural networks. In: Interspeech, pp. 1964–1968 (2014)
Google Scholar
Camacho, A., Harris, J.G.: A sawtooth waveform inspired pitch estimator for speech and music. J. Acoust. Soc. Am. 124(3), 1638–1652 (2008)
Article Google Scholar
Nesterov, Y.: Gradient methods for minimizing composite objective function, UCL (2007)
Google Scholar
Koutny, I.: Parsing Hungarian sentences in order to determine their prosodic structure in a multilingual TTS system. In: Eurospeech, pp. 2091–2094 (1999)
Google Scholar
Olaszy, G., Németh, G., Olaszi, P., Kiss, G., Zainkó, C., Gordos, G.: Profivox – a Hungarian TTS system for telecommunications applications. Int. J. Speech Technol. 3(3-4), 201–215 (2000)
Article MATH Google Scholar
Olaszy, G.: Precíziós, párhuzamos magyar beszédadatbázis fejlesztése és szolgáltatásai [Development and services of a Hungarian precisely labeled and segmented, parallel speech database] (in Hungarian),” Beszédkutatás 2013 [Speech Res. 2013], pp. 261–270 (2013)
Google Scholar
Chollet, F.: Keras: Theano-based deep learning library (2015). https://github.com/fchollet, Documentation: http://keras.io/
ITU-T recommendation p. 800: Methods for subjective determination of transmission quality (1996)
Google Scholar
Tóth, B., Csapó, G.: Continuous fundamental frequency prediction with deep neural networks. In: European Signal Processing Conference (2016, in review)
Google Scholar

Download references

Acknowledgments

We would like to thank to Mátyás Bartalis for his help in creating the subjective listening test and to the listeners for participating in it. Bálint Pál Tóth gratefully acknowledges the support of NVIDIA Corporation with the donation of an NVidia Titan X GPU used for his research. This research is partially supported by the Swiss National Science Foundation via the joint research project (SCOPES scheme) SP2: SCOPES project on speech prosody (SNSF n° IZ73Z0_152495-1).

Author information

Authors and Affiliations

Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Budapest, Hungary
Bálint Pál Tóth, Kornél István Kis, György Szaszák & Géza Németh

Authors

Bálint Pál Tóth
View author publications
You can also search for this author in PubMed Google Scholar
Kornél István Kis
View author publications
You can also search for this author in PubMed Google Scholar
György Szaszák
View author publications
You can also search for this author in PubMed Google Scholar
Géza Németh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bálint Pál Tóth .

Editor information

Editors and Affiliations

SPIIRAS , Saint-Petersburg, Russia
Andrey Ronzhin
Moscow State Linguistic University , Moscow, Russia
Rodmonga Potapova
Budapest University of Technology and Economics, Budapest, Hungary
Géza Németh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tóth, B.P., Kis, K.I., Szaszák, G., Németh, G. (2016). Ensemble Deep Neural Network Based Waveform-Driven Stress Model for Speech Synthesis. In: Ronzhin, A., Potapova, R., Németh, G. (eds) Speech and Computer. SPECOM 2016. Lecture Notes in Computer Science(), vol 9811. Springer, Cham. https://doi.org/10.1007/978-3-319-43958-7_32

Download citation

DOI: https://doi.org/10.1007/978-3-319-43958-7_32
Published: 13 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43957-0
Online ISBN: 978-3-319-43958-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics