Creating Expressive TTS Voices for Conversation Agent Applications

Breen, Andrew

doi:10.1007/978-3-319-11581-8_1

Andrew Breen²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8773))

Included in the following conference series:

International Conference on Speech and Computer

1365 Accesses

Abstract

Text-to-Speech has traditionally been viewed as a “black box” component, where standard “portfolio” voices are typically offered with a professional but “neutral” speaking style. For commercially important languages many different portfolio voices may be offered all with similar speaking styles. A customer wishing to use TTS will typically choose one of these voices. The only alternative is to opt for a “custom voice” solution. In this case, a customer pays for a TTS voice to be created using their preferred voice talent. Such an approach allows for some “tuning” of the scripts used to create the voice. Limited script elements may be added to provide better coverage of the customer’s expected domain and “gilded phrases” can be included to ensure that specific phrase fragments are spoken perfectly. However, even with such an approach the recording style is strictly controlled and standard scripts are augmented rather than redesigned from scratch. The “black box” approach to TTS allows for systems to be produced which satisfy the needs of a large number of customers, even if this means that solutions may be limited in the persona they present.

Recent advances in conversational agent applications have changed people’s expectations of how a computer voice should sound and interact. Suddenly, it’s much more important for the TTS system to present a persona which matches the goals of the application. Such systems demand a more flamboyant, upbeat and expressive voice. The “black box” approach is no longer sufficient; voices for high-end conversational agents are being explicitly “designed” to meet the needs of such applications. These voices are both expressive and light in tone, and a complete contrast to the more conservative voices available for traditional markets. This paper will describe how Nuance is addressing this new and challenging market.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Musical Syntax and Sonification of Voice and Speech Interfaces: A Case Study in Turn-Taking

BumbleBee: The Multi-purpose Voice Assistant

Enhancing the Natural Conversation Experience Through Conversation Analysis – A Design Method

References

Klatt, D.: Review of text-to-speech conversion for English. J. Acous. Soc. Amer. 82, 737–793 (1987)
Article Google Scholar
Taylor, P.: Text-To-Speech Synthesis. Cambridge University Press (2009)
Google Scholar
Ladd, D.R.: Intonational Phonology. Cambridge University Press (1996)
Google Scholar
Breen, A.P.: The BT Laureate Text-To-Speech System. In: ESCA/IEEE Workshop on Speech Synthesis, pp. 195–198 (1994)
Google Scholar
Hunt, A., Black, A.: Unit selection in a Concatenative Speech Synthesis System using a Large Speech Database. In: ICASSP, pp. 373–376 (1996)
Google Scholar
Donovan, R.: Trainable Speech Synthesis, PhD Thesis, University of Cambridge (1996)
Google Scholar
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T.: Simultaneous Modelling of Spectrum, Pitch and Duration in HMM-Based Speech Synthesis. In: Eurospeech 1999, pp. 2374–2350 (1999)
Google Scholar
SFS “Speech Filing System”, http://www.phon.ucl.ac.uk/resource/sfs/
Chen, L., Gales, M.J.F., Wan, V., Latorre, J., Akamine, M.: Exploring Rich Expressive Information from Audiobook Data Using Cluster Adaptive Training. In: Interspeech 2012 (2012)
Google Scholar
Zen, H., Senoir, A., Schuster, M.: Statistical Parametric Speech Synthesis using Deep Neural Networks. In: ICASSP, pp. 7962–7966 (2013)
Google Scholar
Pollet, V., Breen, A.P.: Synthesis by Generation and Concatenation of Multi-form Segments. In: ICSLP 2008 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Nuance Communication, Norwich, United Kingdom
Andrew Breen

Authors

Andrew Breen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Speech and Multimodal Interfaces Laboratory, St. Petersburg Institute of Informatics and Automation of the Russian Academy of Sciences, 39, 14th line, 199178, St. Petersburg, Russia
Andrey Ronzhin
Institute of Applied and Mathematical Linguistics, Moscow State Linguistic University, 38, Ostozhenka, 119034, Moscow, Russia
Rodmonga Potapova
Faculty of Technical Sciences, University of Novi Sad, 6, Trg Dositeja Obradovića, 21000, Novi Sad, Serbia
Vlado Delic

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Breen, A. (2014). Creating Expressive TTS Voices for Conversation Agent Applications. In: Ronzhin, A., Potapova, R., Delic, V. (eds) Speech and Computer. SPECOM 2014. Lecture Notes in Computer Science(), vol 8773. Springer, Cham. https://doi.org/10.1007/978-3-319-11581-8_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-11581-8_1
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11580-1
Online ISBN: 978-3-319-11581-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics