Partially observable Markov decision processes for spoken dialog systems

https://doi.org/10.1016/j.csl.2006.06.008Get rights and content

Abstract

In a spoken dialog system, determining which action a machine should take in a given situation is a difficult problem because automatic speech recognition is unreliable and hence the state of the conversation can never be known with certainty. Much of the research in spoken dialog systems centres on mitigating this uncertainty and recent work has focussed on three largely disparate techniques: parallel dialog state hypotheses, local use of confidence scores, and automated planning. While in isolation each of these approaches can improve action selection, taken together they currently lack a unified statistical framework that admits global optimization. In this paper we cast a spoken dialog system as a partially observable Markov decision process (POMDP). We show how this formulation unifies and extends existing techniques to form a single principled framework. A number of illustrations are used to show qualitatively the potential benefits of POMDPs compared to existing techniques, and empirical results from dialog simulations are presented which demonstrate significant quantitative gains. Finally, some of the key challenges to advancing this method – in particular scalability – are briefly outlined.

Introduction

Spoken dialog systems (SDS) help people accomplish a task using spoken language. For example, a person might use an SDS to buy a train ticket over the phone, to direct a robot to clean a bedroom, or to control a music player in an automobile. Building SDSs is a challenging engineering problem in large part because automatic speech recognition (ASR) and understanding technology are error-prone. More specifically, speech recognition accuracy is relatively good for constrained speech limited to, for example, digits, place-names, or short commands, but accuracy degrades rapidly as the domain language becomes less constrained. Furthermore, as spoken dialog systems become more complex, not only do the demands on the speech recognition and understanding components increase, but also user behaviour becomes less predictable. Thus, as task complexity increases, overall there is a rapid increase in uncertainty, and principled methods of dealing with this uncertainty are needed in order to make progress in this research area.

As an illustration of the effects of speech recognition errors, consider the example conversation shown in Table 1, taken from (Bohus and Rudnicky, 2002). The system shown here allows the user to take control of the conversation wherever reasonably possible. In turn 3, the machine asks “What’s your full name?” and in turn 4, the user replies with their name, but is misrecognized as saying “Athens in Akron”. Since the machine does not insist on knowing the user’s name, it infers that the user is taking control of the conversation and is asking about a flight. Hence, the system interprets “Athens in Akron” as the starting point of a flight booking dialog. This choice of interpretation causes the whole conversation to go off track and it is not until turn 13, nine turns later, that the conversation is progressing again.

This interaction illustrates the motivation for the three main approaches that have been developed in order to minimize the effects of errors and uncertainty in a spoken dialog system.

First, systems can attempt to identify errors locally using a confidence score: when a recognition hypothesis has a low confidence score, it can be ignored to reduce the risk of entering bad information into the dialog state. In the example above, if “Athens in Akron” were associated with a poor confidence score, then it could have been identified as an error and the system might have recovered sooner.

Second, accepting that misrecognitions will occur, their consequences can be difficult for human designers to anticipate. Thus systems can perform automated planning to explore the effects of misrecognitions and determine which sequence of actions are most useful in the long run. Consider turn 5 in the example above: the handcrafted dialog manager chose to disambiguate “Athens”, but automated planning might have revealed that it was better in the long term to first confirm that the user really did say “Athens”, even though in the short term this might waste a turn.

Finally, accepting that some bad information will be entered into the dialog state maintained by the system, it seems unwise to maintain just one hypothesis for the current dialog state. A more robust approach would maintain parallel state hypotheses at each time-step. In turn 4 in the example above, the system could have maintained a second hypothesis for the current state – for example, in which the user said their name but was not understood. The system could have later exploited this information when a non-understanding happened in turn 7.

These three methods of coping with speech recognition errors – local use of confidence scores, automated planning, and parallel dialog hypotheses – can lead to improved performance, and confidence scores in particular are now routinely used in deployed systems. However, these existing methods typically focus on just a small part of the system and rely on the use of ad hoc parameter setting (for example, hand-tuned parameter thresholds) and pre-programmed heuristics. Most seriously, when these techniques are combined in modern systems, there is a lack of an overall statistical framework which can support global optimization and on-line adaptation.

In this paper, we will argue that a partially observable Markov decision process (POMDP2) provides such a framework. We will explain how a POMDP can be developed to encompass a complete dialog system, how a POMDP serves as a basis for optimization, and how a POMDP can integrate uncertainty in the form of statistical distributions with heuristics in the form of manually specified rules. To illustrate the power of the POMDP formalism, we will show how each of the three approaches above represents a special case of the more general POMDP model. Further, we provide evidence of the potential benefits of POMDPs through experimental results obtained from simulated dialogs. Finally, we address scalability and argue that whilst the computational issues are certainly demanding, tractable implementations of POMDP-based dialog systems are feasible.

The paper is organized as follows. Section 2 begins by reviewing POMDPs and then shows how the state space of a POMDP can be factored to represent a spoken dialog system in a way which explicitly represents the major sources of uncertainty. Next Section 3 shows how each of the three techniques mentioned above – parallel dialog hypotheses, local confidence scoring, and automated planning – are naturally subsumed by the POMDP architecture. Section 4 discusses the advantages of POMDPs using a combination of illustrative dialogs and experimental simulation, including simulations with user models estimated from real dialog data. Finally, Section 4.4 concludes by highlighting the key challenge of scalability and suggests two methods for advancing POMDP-based spoken dialog systems.

Section snippets

Casting a spoken dialog system as a POMDP

In this section we will cast a spoken dialog system as a POMDP. We start by briefly reviewing POMDPs. Then, we analyze the typical architecture of a spoken dialog system and identify the major sources of uncertainty. Finally, we show how to represent a spoken dialog system as a POMDP. In this discussion extensive use is made of influence diagrams and Bayesian inference – readers unfamiliar with these topics are referred to texts such as (Jensen, 2001).

POMDPs and existing architectures

As described in the previous section, the SDS-POMDP model allows the dialog management problem to be cast in a statistical framework. It is therefore particularly well-suited to coping with the uncertainty inherent in spoken dialog systems. In this section, three existing techniques for handling uncertainty in an SDS will be reviewed: maintaining multiple dialog states, local use of confidence scores, and automated planning. In each case, it will be shown that the SDS-POMDP model provides an

Empirical support for the SDS-POMDP framework

Section 2 has shown how POMDPs can be viewed as a principled theoretical approach to dialog management under uncertainty and Section 3 has demonstrated that existing approaches to handling uncertainty are subsumed and generalized by the SDS-POMDP framework. In this section, the practical advantages of utilising the SDS-POMDP framework are demonstrated through example interactions and simulation experiments.

Acknowledgements

The authors thank Pascal Poupart for many helpful discussions and comments. This work was supported in part by the European Union “Tools for Ambient Linguistic Knowledge (TALK)” project.

References (60)

  • Bohus, D., Carpenter, P., Jin, C., Wilson, D., Zhang, R., Rudnicky, AI., 2001. Is this conversation on track? In: Proc....
  • Bohus, D., Rudnicky, A.I., 2002. Integrating multiple knowledge sources for utterance-level confidence annotation in...
  • Bohus, D., Rudnicky, A.I., 2005a. Sorry, I didn’t catch that! – An investigation of non-understanding errors and...
  • Bohus, D., Rudnicky, A.I., 2005b. A principled approach for rejection threshold optimization in spoken dialog systems....
  • Cassandra, A.R., Kaelbling, L.P., Littman, M.L., 1994. Acting optimally in partially observable stochastic domains. In:...
  • Denecke, M., Dohsaka, K., Nakano, M. 2004. Learning dialogue policies using state aggregation in reinforcement...
  • Deng, Y., Mahajan, M., Acero, A., 2003. Estimating speech recognition error rate without acoustic test data. In: Proc....
  • Doran, C., Aberdeen, J., Damianos, L., Hirschman, L., 2001. Comparing several aspects of human–computer and human–human...
  • Evermann, G., Woodland, P., 2000. Posterior probability decoding, confidence estimation and system combination. In:...
  • Gabsdil, M., Lemon, O., 2004. Combining acoustic and pragmatic features to predict recognition performance in spoken...
  • Glass, J., 1999. Challenges for spoken dialogue systems. In: Proc. IEEE Workshop on Automatic Speech Recognition and...
  • Goddeau, D., Pineau, J., 2000. Fast reinforcement learning of dialog strategies. In: Proc. IEEE Int. Conf. on...
  • Hansen, E.A., 1998. Solving POMDPs by searching in policy space. In: Proc Uncertainty in Artificial Intelligence (UAI),...
  • Henderson, J., Lemon, O., Georgila, K., 2005. Hybrid reinforcement/supervised learning for dialogue policies from...
  • Higashinaka, H., Nakano, M., Aikawa, K., 2003. Corpus-based discourse understanding in spoken dialogue systems. In:...
  • Hirschberg, J., Litman, D., Swerts, M., 2001. Detecting misrecognitions and corrections in spoken dialogue systems from...
  • Hoey, J., Poupart, P., 2005. Solving POMDPs with continuous or large observation spaces. In: Proc. Int. Joint Conf. on...
  • Horvitz, E., Paek, T., 2000. DeepListener: harnessing expected utility to guide clarification dialog in spoken language...
  • F. Jensen

    Bayesian Networks and Decision Graphs

    (2001)
  • D. Jurafsky et al.

    Speech and Language Processing

    (2000)
  • L. Kaelbling et al.

    Planning and acting in partially observable stochastic domains

    Artificial Intelligence

    (1998)
  • Kemp, T., Schaff, T., 1997. Estimating confidence using word lattices. In: Proc. Eurospeech, Rhodes, Greece, pp....
  • Krahmer, E., Swerts, M., Theune, M., Weegels, M., 1999. Problem spotting in human–machine interaction. In: Proc....
  • E. Krahmer et al.

    Error detection in spoken human–machine interaction

    International Journal of Speech Technology

    (2001)
  • Lane, I.R., Ueno, S., Kawahara, T., 2004. Cooperative dialogue planning with user and situation models via...
  • Langkilde, I., Walker, M.A., Wright, J., Gorin, A., Litman, D., 1999. Automatic prediction of problematic...
  • S. Larsson et al.

    Information state and dialogue management in the TRINDI dialogue move engine toolkit

    Natural Language Engineering

    (2000)
  • Levin, E., Pieraccini, R., 1997. A stochastic model of computer–human interaction for leaning dialog strategies. In:...
  • Levin, E., Pieraccini, R., Eckert, W., 1998. Using Markov decision process for learning dialogue strategies. In: Proc....
  • E. Levin et al.

    A stochastic model of human–machine interaction for learning dialogue strategies

    IEEE Transactions on Speech and Audio Processing

    (2000)
  • Cited by (817)

    View all citing articles on Scopus
    1

    Work carried out while at Cambridge University, Engineering Department.

    View full text