Text-To-Speech Synthesis

Hinterleitner, Florian; Norrenbrock, Christoph; Möller, Sebastian; Heute, Ulrich

doi:10.1007/978-3-319-02681-7_13

Florian Hinterleitner⁶,
Christoph Norrenbrock⁷,
Sebastian Möller⁶ &
…
Ulrich Heute⁷

Part of the book series: T-Labs Series in Telecommunication Services ((TLABS))

3228 Accesses

Abstract

In this chapter, we will address the quality experienced when listening to speech which is synthesized by state-of-the-art synthesis systems which generate artificial speech from text. Such systems are used, e.g., in information and navigation systems, but also for generating audiobooks. We describe both, auditory evaluation methods as well as instrumental models predicting perceived QoE. Besides overall perceived quality, we focus on perceptual quality features that can be used for diagnosis and system optimization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Measuring the Effect of Reverberation on Statistical Parametric Speech Synthesis

Speech Synthesis: Text-To-Speech Conversion and Artificial Voices

Notes

1.
An extensive collection of speech produced by German speaking synthesizers can be found in [4].

References

ASA S3.2-2009 (2009) American national standard method for measuring the intelligibility of speech over communication systems. American National Standards of the Acoustical Society of America, Washington
Google Scholar
Benoit C, Griceb M, Hazanc V (1996) The SUS test: a method for the assessment of text-to-speech synthesis intelligibility using semantically unpredictable sentences. Speech Communication 18(4):381–392
Google Scholar
Black AW, Taylor PA (1994) CHATR: a generic speech synthesis system. In: COLING 1994, vol 2. pp 983–986
Google Scholar
Burkhardt F (2013) Comparison of German TTS-systems. Cited 20 Apr 2013. http://syntheticspeech.de/index.html
Cernak M, Rusko M (2005) An evaluation of synthetic speech using the PESQ measure. In: Proceedings of forum acusticum, Budapest, Hungary, pp 2725–2728
Google Scholar
Chu M, Peng H (2001) An objective measure for estimating MOS of synthesized speech. In: Proceedings of the 7th international conference on speech communication and technology (Eurospeech 2001), Aalborg, Denmark, pp 2087–2090
Google Scholar
Côté N (2011) Integral and diagnostic intrusive prediction of speech quality. Springer, Heidelberg
Book Google Scholar
Falk TH, Möller S (2008) Towards signal-based instrumental quality diagnosis for text-to-speech systems. IEEE Signal Processing Letter 15:781–784
Google Scholar
Fujisaki H (1981) Dynamic characteristics of voice fundamental frequency in speech and singing. Acoustical analysis and physiological interpretations. In: STL-QPSR, vol 22. pp 1–20
Google Scholar
Gibbon D, Moore R, Winski R (1997) Handbook of standards and resources for spoken language systems. De Gruyter Mouton, Berlin, Boston
Google Scholar
Hinterleitner F, Möller S, Norrenbrock C, Heute U (2011) Perceptual quality dimensions of text-to-speech systems. In: Proceedings of the 12th annual conference of the international speech communication association (Interspeech 2011), Florence, Italy, pp 2177–2180
Google Scholar
Hinterleitner F, Neitzel G, Möller S, Norrenbrock C (2011) An evaluation protocol for the subjective assessment of text-to-speech in audiobook reading tasks. In: Proceedings of the Blizzard challenge workshop, Florence, Italy
Google Scholar
Hinterleitner F, Zabel S, Möller S, Leutelt L, Norrenbrock C (2011) Predicting the quality of synthesized speech using reference-based prediction measures. In: Proceedings of the 22nd Konferenz Elektronische Sprachsignalverarbeitung (ESSV 2011), Aachen, Germany, pp 99–106
Google Scholar
Hinterleitner F, Norrenbrock C, Möller S (2012) On the use of fujisaki parameters for the quality prediction of synthetic speech. In: Proceedings of the 23rd Konferenz Elektronische Sprachsignalverarbeitung (ESSV 2012), Cottbus, Germany, pp 112–119
Google Scholar
Hinterleitner F, Norrenbrock C, Möller S, Heute U (2012) What makes this voice sound so bad? A multidimensional analysis of state-of-the-art text-to-speech systems. In: Proceedings of the 2012 IEEE workshop on spoken language technology (SLT), Miami, USA, pp 240–245
Google Scholar
Hinterleitner F, Norrenbrock C, Möller S (2013) Perceptual quality dimensions of text-to-speech in audiobook reading tasks. In: Proceedings of the 24th Konferenz Elektronische Sprachsignalverarbeitung (ESSV 2013), Bielefeld, Germany, pp 44–49
Google Scholar
Hinterleitner F, Norrenbrock C, Möller S, Heute U (2013) Predicting the quality of text-to-speech systems from a large-scale feature set, Lyon, France, pp 383–387
Google Scholar
ITU-T Recommendation P.85 (1994) A method for subjective performance assessment of the quality of speech voice output devices. International Telecommunication Union, Geneva
Google Scholar
ITU-T Recommendation P.563 (2004) Single ended method for objective speech quality assessment in narrow-band telephony. International Telecommunication Union, Geneva
Google Scholar
ITU-T Recommendation P.862 (2001) Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. International Telecommunication Union, Geneva
Google Scholar
ITU-T Recommendation P.863 (2011) Perceptual objective listening quality assessment (POLQA). International Telecommunication Union, Geneva
Google Scholar
Jekosch U (1993) Speech quality assessment and evaluation. In: Proceedings of Eurospeech, Berlin, Germany, pp 1387–1394
Google Scholar
Klatt DH (1980) Software for a cascade/parallel formant synthesizer. Journal of the Acoustical Society of America 67(3):971–995
Google Scholar
Kraft V, Portele T (1995) Quality evaluation of five German speech synthesis systems. Acta Acustica 3:351–365
Google Scholar
Mariniak A (1993) A global framework for the assessment of synthetic speech without subjects. In: Proceedings of the 3rd European conference on speech processing and technology (Eurospeech), Berlin, Germany, pp 1683–1686
Google Scholar
Mayo C, Clark RAJ, King S (2005) Listener’s weighting of acoustic cues to synthetic speech naturalness: a multidimensional scaling analysis. In: Proceedings of the 6th annual conference of the international speech communication association (Interspeech), Lisbon, Portugal, pp 1725–1728
Google Scholar
Minker W, Lee GG, Mariani J, Nakamura S (2010) Salient features for anger recognition in German and English IVR portals. Spoken dialogue systems technology and design. Springer
Google Scholar
Möller S, Hinterleitner F (2013) ITU-T Contribution COM 12–37: proposal for an appendix to Rec. P.85 of the evaluation of speech output for audiobook reading tasks. Deutsche Telekom AG, ITU-T SG12 meeting 19–28 Mar 2013, Geneva
Google Scholar
Möller S, Hinterleitner F, Falk TH, Polzehl T (2010) Comparison of approaches for instrumentally predicting the quality of text-to-speech systems. In: Proceedings of the 11th annual conference of the international speech communication association (Interspeech 2010), Makuhari, Japan, pp 1325–1328
Google Scholar
Moulines E, Charpentier N (1990) Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication 9(5/6):453–467
Google Scholar
Norrenbrock C, Hinterleitner F, Heute U, Möller S (2012) Instrumental assessment of prosodic quality for text-to-speech signals. IEEE Signal Processing Letters 19:255–258
Google Scholar
Norrenbrock C, Hinterleitner F, Heute U, Möller S (2012) Quality analysis of macroprosodic $F_{0}$ dynamics in text-to-speech signals. In: Proceedings of the 13th annual conference of the international speech communication association (Interspeech 2012), Portland, USA, pp 454–457
Google Scholar
Norrenbrock C, Hinterleitner F, Heute U, Möller S (2012) Towards perceptual quality modeling of synthesized audiobooks. In: Proceedings of the blizzard challenge workshop, Portland, USA
Google Scholar
Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77(2):257–286
Google Scholar
Sityaev D, Knill K, Burrows T (2006) Comparison of the ITU-T P.85 standard to other methods for the evaluation of text-to-speech systems. In: Proceedings of the 9th international conference on spoken language processing (Interspeech), Pittsburgh, USA, pp 1077–1080
Google Scholar
Tokuda K, Zen H, Black AW (2002) An HMM-based speech synthesis system applied to English. In: Proceedings of 2002 IEEE speech synthesis workshop, Santa Monica, USA, pp 227–230
Google Scholar
Tsogo L, Masson MH, Bardot A (2000) Multidimensional scaling methods for many-objects sets: a review. Multivariate Behavioral Research 35(3):307–319
Google Scholar
Viswanathan M, Viswanathan M (2005) Measuring speech quality for text-to-speech systems: development and assessment of a modified mean opinion score (MOS) scale. Computer Speech and Language 19(1):55–83
Google Scholar

Download references

Acknowledgments

This work was supported by the Deutsche Forschungsgemeinschaft (DFG), grants MO-1138/11-1, MO-1138/11-2, HE-4465/4-1 and HE-4465/4-2.

Author information

Authors and Affiliations

Quality and Usability Lab, Telekom Innovation Laboratories, TU Berlin, Berlin, Germany
Florian Hinterleitner & Sebastian Möller
Digital Signal Processing and System Theory, CAU Kiel, Kiel, Germany
Christoph Norrenbrock & Ulrich Heute

Authors

Florian Hinterleitner
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Norrenbrock
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Möller
View author publications
You can also search for this author in PubMed Google Scholar
Ulrich Heute
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Florian Hinterleitner .

Editor information

Editors and Affiliations

Quality and Usability Lab Telekom Innovation Laboratories, TU Berlin, Berlin, Germany
Sebastian Möller
Assessment of IP-based Applications Telekom Innovation Laboratories, TU Berlin, Berlin, Germany
Alexander Raake

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hinterleitner, F., Norrenbrock, C., Möller, S., Heute, U. (2014). Text-To-Speech Synthesis. In: Möller, S., Raake, A. (eds) Quality of Experience. T-Labs Series in Telecommunication Services. Springer, Cham. https://doi.org/10.1007/978-3-319-02681-7_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-02681-7_13
Published: 05 March 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-02680-0
Online ISBN: 978-3-319-02681-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Text-To-Speech Synthesis

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Measuring the Effect of Reverberation on Statistical Parametric Speech Synthesis

Speech Synthesis: Text-To-Speech Conversion and Artificial Voices

Speech Synthesis: Text-To-Speech Conversion and Artificial Voices

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us