Elsevier

Computer Speech & Language

Volume 16, Issues 3–4, July–October 2002, Pages 533-550
Computer Speech & Language

Efficient integrated response generation from multiple targets using weighted finite state transducers

https://doi.org/10.1016/S0885-2308(02)00023-2Get rights and content

Abstract

In this paper, we describe how language generation and speech synthesis for spoken dialog systems can be efficiently integrated under a weighted finite state transducer architecture. Taking advantage of this efficiency, we show that introducing flexible targets in generation leads to more natural sounding synthesis. Specifically, we allow multiple wordings of the response and multiple prosodic realizations of the different wordings. The choice of wording and prosodic structure are then jointly optimized with unit selection for waveform generation in speech synthesis. Results of perceptual experiments show that by integrating the steps of language generation and speech synthesis, we are able to achieve improved naturalness of synthetic speech compared to the sequential implementation.

Introduction

Improvements in automatic speech recognition (ASR) have led to many new deployments of speech-enabled computer interfaces, particularly in telephone-based applications. In such systems, the quality of speech output impacts user acceptance, and many commercial human–computer dialog systems are constrained in order to be able to make use of pre-recorded voice prompts. While applications with very limited capabilities can make use of pre-recorded speech, many applications require more dynamic response generation which requires speech synthesis. The simple solution is to use a general text-to-speech synthesis system, but higher quality synthesized speech can be obtained using a system tailored for the specific domain. Further, the fact that the language is generated automatically provides an opportunity to pass more information to the waveform generation module than would be available in unrestricted text-to-speech synthesis. One example that has often been cited is the annotation of automatically generated syntactic and semantic structure as well as dialog context for purposes of improved prosody prediction. While this is important, our focus is on a different opportunity that takes advantage of advances in concatenative speech synthesis: that is, to provide flexibility to the synthesizer in terms of possible responses.

The key idea is that there is more than one acceptable response at any point in a dialog. In particular, we take advantage of two main areas of flexibility: choice of wording and prosodic realization of an utterance. Instead of passing a single text string (or prosodically annotated text string) to a synthesizer, we pass an annotated network. Thus, the search for wording, prosody prediction and speech units is optimized jointly. In other words, instead of predicting a specific word sequence and prosodic realization first and then searching for units to match that target, our approach effectively makes a “soft” decision about the target words and prosody and evaluates alternative realizations of a given utterance.

Of course, since providing more flexibility increases the search space, then it also increases the computational cost and potentially the implementation complexity. Thus, a key to making this approach practical is implementation using weighted finite-state transducers (WFSTs). Each step of network expansion is then followed by minimization, and both can be implemented using a general purpose toolbox.

In the paper, we propose a general architecture based on: (i) representing a response in terms of an annotated word network rather than a single word sequence, and (ii) using a symbolic representation of prosodic structure for annotation. We also describe a specific implementation using a template-based language generator and a variable-length unit selection synthesis system. The paper is organized as follows. We will begin in Section 2 with a review of previous work in the area of language generation and speech synthesis in dialog systems. The architecture of our system will be described in Section 3, followed by details of how generation and synthesis can be integrated using WFSTs in Section 4. Experiments are described in Section 5, and we conclude by summarizing the key advances and laying out directions for future work in Section 6.

Section snippets

Background

This section provides an overview of recent work related to language generation and speech synthesis in dialog systems. We will start by describing different approaches to tying generation and synthesis within dialog systems (Section 2.1), followed by a summary of the key differences in our approach in Section 2.2. Then, in Section 2.3 we review recent developments in limited domain synthesis that we build on in the work.

Response generation components

Our response generation system is part of a mixed-initiative human–computer dialog system that is designed to help people solve travel planning problems, where users can get information about flights, hotels and rental cars. The complete dialog system is based on the University of Colorado (CU) Communicator system (Pellom et al., 2000), with changes only to the response generation components. The CU system uses a client-server architecture, developed in the DARPA Communicator program with the

Integrating generation and synthesis with WFSTs

A WFST architecture provides a framework for a flexible and efficient implementation of the response generation components. The flexibility of WFSTs accommodates the use of variable size units and different forms of prosody and text generation. The computational efficiency of WFST composition and finding the best path allows real-time synthesis, particularly for constrained domain applications. This section gives details about our approach to integrating language generation and speech synthesis

Experiments

This section provides details on our experiments. First, in Section 5.1 we introduce the corpora that we used. Then, Section 5.2 describes the prosody prediction results. Finally, Section 5.3 covers our perceptual experiment demonstrating the benefits of synthesizing from multiple targets.

Discussion

In summary, we have demonstrated that by expanding the space of candidate responses in a dialog system we can achieve higher quality speech output. Specifically, alternative word sequences in addition to multiple prosodic targets were used to diversify the output of the language generator, taking advantage of the natural (allowable) variability in spoken language. Instead of specifying a single desired utterance, we make a “soft” decision about the word sequence and the prosodic target and let

Acknowledgements

We thank Bryan Pellom and the Center for Spoken Language Research at the University of Colorado for providing the speech corpus, and Lesley Carmichael for prosodic labeling of the corpus. This material is based upon work supported by the National Science Foundation under Grant No. (IIS-9528990). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

References (42)

  • I Bulyko et al.

    Unit selection for speech synthesis using splicing costs with weighted finite state transducers

  • J Cahn

    A computational memory and processing model for prosody

  • Collins, M. (1996). A new statistical parser based on bigram lexical dependencies. Proceedings of the 34th Annual...
  • G Coorman et al.

    Segment selection in the L&H Realspeak laboratory TTS system

  • R.J Davis et al.

    Assigning intonational features in synthesized spoken directions

  • Donovan, R., Franz, M., Sorensen, J. & Roukos, S. (1999). Phrase splicing and variable substitution using the IBM...
  • M Galey et al.

    Hybrid natural language generation for spoken dialog systems

  • J Hitzeman et al.

    On the use of automatically generated discourse-level information in a concept-to-speech synthesis system

  • H Hon et al.

    Automatic generation of synthesis units for trainable text-to-speech systems

  • A Hunt et al.

    Unit selection in a concatenative speech synthesis system using a large speech database

  • K Kirchhoff

    A comparison of classification techniques for the automatic detection of error corrections in human–computer dialogues

  • Cited by (24)

    • Impacts of machine translation and speech synthesis on speech-to-speech translation

      2012, Speech Communication
      Citation Excerpt :

      Therefore, integration methods for natural language generation and speech synthesis have been proposed by Bulyko and Ostendorf (2002), Nakatsu and White (2006), Boidin et al. (2009). Bulyko and Ostendorf (2002) proposed an integration method for natural language generation and unit selection based speech synthesis that enables the choice of wording and prosody to be jointly determined by the language generation and speech synthesis components. A template-based language generation component passes a word network expressing the same content to the speech synthesis component, rather than a single word string.

    • Spoken language generation

      2002, Computer Speech and Language
    View all citing articles on Scopus
    View full text