Efficient integrated response generation from multiple targets using weighted finite state transducers
Introduction
Improvements in automatic speech recognition (ASR) have led to many new deployments of speech-enabled computer interfaces, particularly in telephone-based applications. In such systems, the quality of speech output impacts user acceptance, and many commercial human–computer dialog systems are constrained in order to be able to make use of pre-recorded voice prompts. While applications with very limited capabilities can make use of pre-recorded speech, many applications require more dynamic response generation which requires speech synthesis. The simple solution is to use a general text-to-speech synthesis system, but higher quality synthesized speech can be obtained using a system tailored for the specific domain. Further, the fact that the language is generated automatically provides an opportunity to pass more information to the waveform generation module than would be available in unrestricted text-to-speech synthesis. One example that has often been cited is the annotation of automatically generated syntactic and semantic structure as well as dialog context for purposes of improved prosody prediction. While this is important, our focus is on a different opportunity that takes advantage of advances in concatenative speech synthesis: that is, to provide flexibility to the synthesizer in terms of possible responses.
The key idea is that there is more than one acceptable response at any point in a dialog. In particular, we take advantage of two main areas of flexibility: choice of wording and prosodic realization of an utterance. Instead of passing a single text string (or prosodically annotated text string) to a synthesizer, we pass an annotated network. Thus, the search for wording, prosody prediction and speech units is optimized jointly. In other words, instead of predicting a specific word sequence and prosodic realization first and then searching for units to match that target, our approach effectively makes a “soft” decision about the target words and prosody and evaluates alternative realizations of a given utterance.
Of course, since providing more flexibility increases the search space, then it also increases the computational cost and potentially the implementation complexity. Thus, a key to making this approach practical is implementation using weighted finite-state transducers (WFSTs). Each step of network expansion is then followed by minimization, and both can be implemented using a general purpose toolbox.
In the paper, we propose a general architecture based on: (i) representing a response in terms of an annotated word network rather than a single word sequence, and (ii) using a symbolic representation of prosodic structure for annotation. We also describe a specific implementation using a template-based language generator and a variable-length unit selection synthesis system. The paper is organized as follows. We will begin in Section 2 with a review of previous work in the area of language generation and speech synthesis in dialog systems. The architecture of our system will be described in Section 3, followed by details of how generation and synthesis can be integrated using WFSTs in Section 4. Experiments are described in Section 5, and we conclude by summarizing the key advances and laying out directions for future work in Section 6.
Section snippets
Background
This section provides an overview of recent work related to language generation and speech synthesis in dialog systems. We will start by describing different approaches to tying generation and synthesis within dialog systems (Section 2.1), followed by a summary of the key differences in our approach in Section 2.2. Then, in Section 2.3 we review recent developments in limited domain synthesis that we build on in the work.
Response generation components
Our response generation system is part of a mixed-initiative human–computer dialog system that is designed to help people solve travel planning problems, where users can get information about flights, hotels and rental cars. The complete dialog system is based on the University of Colorado (CU) Communicator system (Pellom et al., 2000), with changes only to the response generation components. The CU system uses a client-server architecture, developed in the DARPA Communicator program with the
Integrating generation and synthesis with WFSTs
A WFST architecture provides a framework for a flexible and efficient implementation of the response generation components. The flexibility of WFSTs accommodates the use of variable size units and different forms of prosody and text generation. The computational efficiency of WFST composition and finding the best path allows real-time synthesis, particularly for constrained domain applications. This section gives details about our approach to integrating language generation and speech synthesis
Experiments
This section provides details on our experiments. First, in Section 5.1 we introduce the corpora that we used. Then, Section 5.2 describes the prosody prediction results. Finally, Section 5.3 covers our perceptual experiment demonstrating the benefits of synthesizing from multiple targets.
Discussion
In summary, we have demonstrated that by expanding the space of candidate responses in a dialog system we can achieve higher quality speech output. Specifically, alternative word sequences in addition to multiple prosodic targets were used to diversify the output of the language generator, taking advantage of the natural (allowable) variability in spoken language. Instead of specifying a single desired utterance, we make a “soft” decision about the word sequence and the prosodic target and let
Acknowledgements
We thank Bryan Pellom and the Center for Spoken Language Research at the University of Colorado for providing the speech corpus, and Lesley Carmichael for prosodic labeling of the corpus. This material is based upon work supported by the National Science Foundation under Grant No. (IIS-9528990). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
References (42)
- et al.
A hidden Markov-model-based trainable speech synthesizer
Computer Speech and Language
(1999) Pitch accent in context: Predicting prominence from text
Artificial Intelligence
(1993)- et al.
Weighted finite-state transducers in speech recognition
Computer Speech and Language
(2002) - et al.
Specifying intonation from context for speech synthesis
Speech Communication
(1994) - et al.
Predicting abstract prosodic labels for speech synthesis
Computer Speech and Language
(1996) - et al.
Automatic classification of intonational phrase boundaries
Computer Speech and Language
(1992) - et al.
Natural language generation in dialog systems
- et al.
The AT&T Next-Gen TTS system
- et al.
Limited domain synthesis
- et al.
Joint prosody prediction and unit selection for concatenative speech synthesis
Unit selection for speech synthesis using splicing costs with weighted finite state transducers
A computational memory and processing model for prosody
Segment selection in the L&H Realspeak laboratory TTS system
Assigning intonational features in synthesized spoken directions
Hybrid natural language generation for spoken dialog systems
On the use of automatically generated discourse-level information in a concept-to-speech synthesis system
Automatic generation of synthesis units for trainable text-to-speech systems
Unit selection in a concatenative speech synthesis system using a large speech database
A comparison of classification techniques for the automatic detection of error corrections in human–computer dialogues
Cited by (24)
Impacts of machine translation and speech synthesis on speech-to-speech translation
2012, Speech CommunicationCitation Excerpt :Therefore, integration methods for natural language generation and speech synthesis have been proposed by Bulyko and Ostendorf (2002), Nakatsu and White (2006), Boidin et al. (2009). Bulyko and Ostendorf (2002) proposed an integration method for natural language generation and unit selection based speech synthesis that enables the choice of wording and prosody to be jointly determined by the language generation and speech synthesis components. A template-based language generation component passes a word network expressing the same content to the speech synthesis component, rather than a single word string.
Error-correction detection and response generation in a spoken dialogue system
2005, Speech CommunicationSpoken language generation
2002, Computer Speech and LanguageHierarchical reinforcement learning for situated natural language generation
2015, Natural Language EngineeringIntegrating machine translation and speech synthesis component for english to dravidian language speech to speech translation system
2015, Journal of Engineering Science and TechnologyStochastic language generation in dialogue using factored language models
2014, Computational Linguistics