Partially observable Markov decision processes for spoken dialog systems
Introduction
Spoken dialog systems (SDS) help people accomplish a task using spoken language. For example, a person might use an SDS to buy a train ticket over the phone, to direct a robot to clean a bedroom, or to control a music player in an automobile. Building SDSs is a challenging engineering problem in large part because automatic speech recognition (ASR) and understanding technology are error-prone. More specifically, speech recognition accuracy is relatively good for constrained speech limited to, for example, digits, place-names, or short commands, but accuracy degrades rapidly as the domain language becomes less constrained. Furthermore, as spoken dialog systems become more complex, not only do the demands on the speech recognition and understanding components increase, but also user behaviour becomes less predictable. Thus, as task complexity increases, overall there is a rapid increase in uncertainty, and principled methods of dealing with this uncertainty are needed in order to make progress in this research area.
As an illustration of the effects of speech recognition errors, consider the example conversation shown in Table 1, taken from (Bohus and Rudnicky, 2002). The system shown here allows the user to take control of the conversation wherever reasonably possible. In turn 3, the machine asks “What’s your full name?” and in turn 4, the user replies with their name, but is misrecognized as saying “Athens in Akron”. Since the machine does not insist on knowing the user’s name, it infers that the user is taking control of the conversation and is asking about a flight. Hence, the system interprets “Athens in Akron” as the starting point of a flight booking dialog. This choice of interpretation causes the whole conversation to go off track and it is not until turn 13, nine turns later, that the conversation is progressing again.
This interaction illustrates the motivation for the three main approaches that have been developed in order to minimize the effects of errors and uncertainty in a spoken dialog system.
First, systems can attempt to identify errors locally using a confidence score: when a recognition hypothesis has a low confidence score, it can be ignored to reduce the risk of entering bad information into the dialog state. In the example above, if “Athens in Akron” were associated with a poor confidence score, then it could have been identified as an error and the system might have recovered sooner.
Second, accepting that misrecognitions will occur, their consequences can be difficult for human designers to anticipate. Thus systems can perform automated planning to explore the effects of misrecognitions and determine which sequence of actions are most useful in the long run. Consider turn 5 in the example above: the handcrafted dialog manager chose to disambiguate “Athens”, but automated planning might have revealed that it was better in the long term to first confirm that the user really did say “Athens”, even though in the short term this might waste a turn.
Finally, accepting that some bad information will be entered into the dialog state maintained by the system, it seems unwise to maintain just one hypothesis for the current dialog state. A more robust approach would maintain parallel state hypotheses at each time-step. In turn 4 in the example above, the system could have maintained a second hypothesis for the current state – for example, in which the user said their name but was not understood. The system could have later exploited this information when a non-understanding happened in turn 7.
These three methods of coping with speech recognition errors – local use of confidence scores, automated planning, and parallel dialog hypotheses – can lead to improved performance, and confidence scores in particular are now routinely used in deployed systems. However, these existing methods typically focus on just a small part of the system and rely on the use of ad hoc parameter setting (for example, hand-tuned parameter thresholds) and pre-programmed heuristics. Most seriously, when these techniques are combined in modern systems, there is a lack of an overall statistical framework which can support global optimization and on-line adaptation.
In this paper, we will argue that a partially observable Markov decision process (POMDP2) provides such a framework. We will explain how a POMDP can be developed to encompass a complete dialog system, how a POMDP serves as a basis for optimization, and how a POMDP can integrate uncertainty in the form of statistical distributions with heuristics in the form of manually specified rules. To illustrate the power of the POMDP formalism, we will show how each of the three approaches above represents a special case of the more general POMDP model. Further, we provide evidence of the potential benefits of POMDPs through experimental results obtained from simulated dialogs. Finally, we address scalability and argue that whilst the computational issues are certainly demanding, tractable implementations of POMDP-based dialog systems are feasible.
The paper is organized as follows. Section 2 begins by reviewing POMDPs and then shows how the state space of a POMDP can be factored to represent a spoken dialog system in a way which explicitly represents the major sources of uncertainty. Next Section 3 shows how each of the three techniques mentioned above – parallel dialog hypotheses, local confidence scoring, and automated planning – are naturally subsumed by the POMDP architecture. Section 4 discusses the advantages of POMDPs using a combination of illustrative dialogs and experimental simulation, including simulations with user models estimated from real dialog data. Finally, Section 4.4 concludes by highlighting the key challenge of scalability and suggests two methods for advancing POMDP-based spoken dialog systems.
Section snippets
Casting a spoken dialog system as a POMDP
In this section we will cast a spoken dialog system as a POMDP. We start by briefly reviewing POMDPs. Then, we analyze the typical architecture of a spoken dialog system and identify the major sources of uncertainty. Finally, we show how to represent a spoken dialog system as a POMDP. In this discussion extensive use is made of influence diagrams and Bayesian inference – readers unfamiliar with these topics are referred to texts such as (Jensen, 2001).
POMDPs and existing architectures
As described in the previous section, the SDS-POMDP model allows the dialog management problem to be cast in a statistical framework. It is therefore particularly well-suited to coping with the uncertainty inherent in spoken dialog systems. In this section, three existing techniques for handling uncertainty in an SDS will be reviewed: maintaining multiple dialog states, local use of confidence scores, and automated planning. In each case, it will be shown that the SDS-POMDP model provides an
Empirical support for the SDS-POMDP framework
Section 2 has shown how POMDPs can be viewed as a principled theoretical approach to dialog management under uncertainty and Section 3 has demonstrated that existing approaches to handling uncertainty are subsumed and generalized by the SDS-POMDP framework. In this section, the practical advantages of utilising the SDS-POMDP framework are demonstrated through example interactions and simulation experiments.
Acknowledgements
The authors thank Pascal Poupart for many helpful discussions and comments. This work was supported in part by the European Union “Tools for Ambient Linguistic Knowledge (TALK)” project.
References (60)
- Bohus, D., Carpenter, P., Jin, C., Wilson, D., Zhang, R., Rudnicky, AI., 2001. Is this conversation on track? In: Proc....
- Bohus, D., Rudnicky, A.I., 2002. Integrating multiple knowledge sources for utterance-level confidence annotation in...
- Bohus, D., Rudnicky, A.I., 2005a. Sorry, I didn’t catch that! – An investigation of non-understanding errors and...
- Bohus, D., Rudnicky, A.I., 2005b. A principled approach for rejection threshold optimization in spoken dialog systems....
- Cassandra, A.R., Kaelbling, L.P., Littman, M.L., 1994. Acting optimally in partially observable stochastic domains. In:...
- Denecke, M., Dohsaka, K., Nakano, M. 2004. Learning dialogue policies using state aggregation in reinforcement...
- Deng, Y., Mahajan, M., Acero, A., 2003. Estimating speech recognition error rate without acoustic test data. In: Proc....
- Doran, C., Aberdeen, J., Damianos, L., Hirschman, L., 2001. Comparing several aspects of human–computer and human–human...
- Evermann, G., Woodland, P., 2000. Posterior probability decoding, confidence estimation and system combination. In:...
- Gabsdil, M., Lemon, O., 2004. Combining acoustic and pragmatic features to predict recognition performance in spoken...
Bayesian Networks and Decision Graphs
Speech and Language Processing
Planning and acting in partially observable stochastic domains
Artificial Intelligence
Error detection in spoken human–machine interaction
International Journal of Speech Technology
Information state and dialogue management in the TRINDI dialogue move engine toolkit
Natural Language Engineering
A stochastic model of human–machine interaction for learning dialogue strategies
IEEE Transactions on Speech and Audio Processing
Cited by (817)
Emotion-and-knowledge grounded response generation in an open-domain dialogue setting
2024, Knowledge-Based SystemsCross-domain coreference modeling in dialogue state tracking with prompt learning
2024, Knowledge-Based SystemsRobustness analysis of power system under sequential attacks with incomplete information
2023, Reliability Engineering and System SafetyMGCRL: Multi-view graph convolution and multi-agent reinforcement learning for dialogue state tracking
2024, Neural Computing and Applications
- 1
Work carried out while at Cambridge University, Engineering Department.