Dialogue POMDP components (Part II): learning the reward function

Chinaei, H.; Chaib-draa, B.

doi:10.1007/s10772-014-9224-x

Dialogue POMDP components (Part II): learning the reward function

Published: 15 October 2014

Volume 17, pages 325–340, (2014)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

H. Chinaei¹ &
B. Chaib-draa¹

187 Accesses
5 Citations
Explore all metrics

Abstract

The partially observable Markov decision process (POMDP) framework has been applied in dialogue systems as a formal framework to represent uncertainty explicitly while being robust to noise. In this context, estimating the dialogue POMDP model components (states, observations, and reward) is a significant challenge as they have a direct impact on the optimized dialogue POMDP policy. Learning states and observations sustaining a POMDP have been both covered in the first part (Part I), whereas this part (Part II) covers learning the reward function, that is required by the POMDP. To this end, we propose two specific algorithms based on inverse reinforcement learning (IRL). The first is called POMDP-IRL-BT (BT for belief transition) and it approximates a belief transition model, similar to the Markov decision process transition models. The second is a point-based POMDP-IRL algorithm, denoted by PB-POMDP-IRL (PB for point-based), that approximates the value of the new beliefs, which occurs in the computation of the policy values, using a linear approximation of expert beliefs. Ultimately, we apply the two algorithms on healthcare dialogue management in order to learn a dialogue POMDP from dialogues collected by SmartWheeler (an intelligent wheelchair).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

Abbeel, P., Ng, A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine learning (ICML’04). Banff, AB, Canada.
Boularias, A., Chinaei, H. R., & Chaib-draa, B., (2010). Learning the reward model of dialogue POMDPs from data. In NIPS 2010 Workshop on Machine Learning for Assistive Technologies. Vancouver, BC, Canada.
Boularias, A., Kober, J., & Peters, J. (2011). Relative entropy inverse reinforcement learning. Journal of Machine Learning Research—Proceedings Track, 15, 182–189.
Google Scholar
Chandramohan, S., Geist, M., Lefèvre, F., & Pietquin, O. (2012). Behavior specific user simulation in spoken dialogue systems. In Proceedings of the IEEE ITG Conference on Speech Communication. Braunschweig, Germany.
Chinaei, H. R., & Chaib-draa, B. (2011). Learning dialogue POMDP models from data. In Proceedings of the 24th Canadian Conference on Advances in Artificial Intelligence (Canadian AI’11). St. John’s, NL, Canada.
Chinaei, H. R., & Chaib-draa, B. (2014). Dialogue POMDP components (Part I): Learning states and observations. International Journal of Speech Technologyn (this issue).
Chinaei, H. R., Chaib-draa, B., & Lamontagne, L. (2012). Learning observation models for dialogue POMDPs. In Proceedings of the 24th Canadian conference on advances in Artificial Intelligence (Canadian AI’12). Toronto, ON, Canada.
Choi, J., & Kim, K.-E. (2011). Inverse reinforcement learning in partially observable environments. Journal of Machine Learning Research, 12, 691–730.
MATH Google Scholar
Gašić, M. (2011). Statistical Dialogue Modelling. PhD thesis, Department of Engineering, University of Cambridge.
Ji, S., Parr, R., Li, H., Liao, X., & Carin, L. (2007). Point-based policy iteration. In Proceedings of the 22nd National Conference on Artificial Intelligence (vol. 2) (AAAI’07). Vancouver, BC, Canada.
Kim, D., Kim, J., & Kim, K. (2011). Robust performance evaluation of POMDP-based dialogue systems. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 1029–1040.
Article Google Scholar
Neu, G., Szepesvári, C. (2007). Apprenticeship learning using inverse reinforcement learning and gradient methods. In Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence (UAI’07). Vancouver, BC, Canada.
Ng, A. Y., Russell, S. J. (2000). Algorithms for inverse reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning (ICML’00). Stanford, CA, USA.
Paek, T., & Pieraccini, R. (2008). Automating spoken dialogue management design using machine learning: An industry perspective. Speech Communication, 50(8), 716–729.
Article Google Scholar
Pinault, F. and Lefèvre, F. (2011). Semantic graph clustering for pomdp-based spoken dialog systems. In Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH’11). Florence, Italy.
Pineau, J., Gordon, G., & Thrun, S. (2003). Point-based value iteration: An anytime algorithm for POMDPs. In International Joint Conference on Artificial Intelligence (IJCAI’03). Acapulco, Mexico.
Pineau, J., West, R., Atrash, A., Villemure, J., & Routhier, F. (2011). On the feasibility of using a standardized test for evaluating a speech-controlled smart wheelchair. International Journal of Intelligent Control and Systems, 16(2), 124–131.
Google Scholar
Ramachandran, D., & Amir, E. (2007). Bayesian inverse reinforcement learning. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI’07). Hyderabad, India.
Roy, N., Pineau, J., & Thrun, S. (2000). Spoken dialogue management using probabilistic reasoning. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (ACL’00). Hong Kong.
Spaan, M., & Vlassis, N. (2005). Perseus: Randomized point-based value iteration for POMDPs. Journal of Artificial Intelligence Research, 24(1), 195–220.
Syed, U. and Schapire, R. (2008). A game-theoretic approach to apprenticeship learning. In Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems. Vancouver, BC, Canada.
Thomson, B. (2009). Statistical Methods for Spoken Dialogue Management. PhD thesis, Department of Engineering, University of Cambridge.
Williams, J. D. (2006). Partially Observable Markov Decision Processes for Spoken Dialogue Management. PhD thesis, Department of Engineering, University of Cambridge.
Williams, J. D., & Young, S. (2005). The SACTI-1 corpus: Guide for research users. Technical Report. Department of Engineering, University of Cambridge.
Williams, J. D., & Young, S. (2007). Partially observable Markov decision processes for spoken dialog systems. Computer Speech and Language, 21, 393–422.
Article Google Scholar
Zhang, B., Cai, Q., Mao, J., Chang, E., & Guo, B. (2001a). Spoken dialogue management as planning and acting under uncertainty. In Proceedings of the 9th European Conference on Speech Communication and Technology (Eurospeech’01). Aalborg, Denmark.
Zhang, B., Cai, Q., Mao, J., & Guo, B. (2001b). Planning and acting under uncertainty: A new model for spoken dialogue system. In Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence (UAI’01), Seattle, WA, USA.
Ziebart, B., Maas, A., Bagnell, J., & Dey, A. (2008). Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd National Conference on Artificial Intelligence (AAAI’08). Chicago, IL, USA.

Download references

Author information

Authors and Affiliations

Department of Computer Science, Laval University, Québec, Canada
H. Chinaei & B. Chaib-draa

Authors

H. Chinaei
View author publications
You can also search for this author in PubMed Google Scholar
B. Chaib-draa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to B. Chaib-draa.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chinaei, H., Chaib-draa, B. Dialogue POMDP components (Part II): learning the reward function. Int J Speech Technol 17, 325–340 (2014). https://doi.org/10.1007/s10772-014-9224-x

Download citation

Received: 10 July 2013
Accepted: 11 January 2014
Published: 15 October 2014
Issue Date: December 2014
DOI: https://doi.org/10.1007/s10772-014-9224-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Dialogue POMDP components (Part II): learning the reward function

Abstract

Access this article

Similar content being viewed by others

Dialogue POMDP components (part I): learning states and observations

Enhancing Robotic Systems for Revolutionizing Healthcare Using Markov Decision Processes

Users’ Belief Awareness in Reinforcement Learning-Based Situated Human–Robot Dialogue Management

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Dialogue POMDP components (Part II): learning the reward function

Abstract

Access this article

Similar content being viewed by others

Dialogue POMDP components (part I): learning states and observations

Enhancing Robotic Systems for Revolutionizing Healthcare Using Markov Decision Processes

Users’ Belief Awareness in Reinforcement Learning-Based Situated Human–Robot Dialogue Management

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation