Skip to main content
Log in

Dialogue POMDP components (Part II): learning the reward function

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

The partially observable Markov decision process (POMDP) framework has been applied in dialogue systems as a formal framework to represent uncertainty explicitly while being robust to noise. In this context, estimating the dialogue POMDP model components (states, observations, and reward) is a significant challenge as they have a direct impact on the optimized dialogue POMDP policy. Learning states and observations sustaining a POMDP have been both covered in the first part (Part I), whereas this part (Part II) covers learning the reward function, that is required by the POMDP. To this end, we propose two specific algorithms based on inverse reinforcement learning (IRL). The first is called POMDP-IRL-BT (BT for belief transition) and it approximates a belief transition model, similar to the Markov decision process transition models. The second is a point-based POMDP-IRL algorithm, denoted by PB-POMDP-IRL (PB for point-based), that approximates the value of the new beliefs, which occurs in the computation of the policy values, using a linear approximation of expert beliefs. Ultimately, we apply the two algorithms on healthcare dialogue management in order to learn a dialogue POMDP from dialogues collected by SmartWheeler (an intelligent wheelchair).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Abbeel, P., Ng, A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine learning (ICML’04). Banff, AB, Canada.

  • Boularias, A., Chinaei, H. R., & Chaib-draa, B., (2010). Learning the reward model of dialogue POMDPs from data. In NIPS 2010 Workshop on Machine Learning for Assistive Technologies. Vancouver, BC, Canada.

  • Boularias, A., Kober, J., & Peters, J. (2011). Relative entropy inverse reinforcement learning. Journal of Machine Learning Research—Proceedings Track, 15, 182–189.

    Google Scholar 

  • Chandramohan, S., Geist, M., Lefèvre, F., & Pietquin, O. (2012). Behavior specific user simulation in spoken dialogue systems. In Proceedings of the IEEE ITG Conference on Speech Communication. Braunschweig, Germany.

  • Chinaei, H. R., & Chaib-draa, B. (2011). Learning dialogue POMDP models from data. In Proceedings of the 24th Canadian Conference on Advances in Artificial Intelligence (Canadian AI’11). St. John’s, NL, Canada.

  • Chinaei, H. R., & Chaib-draa, B. (2014). Dialogue POMDP components (Part I): Learning states and observations. International Journal of Speech Technologyn (this issue).

  • Chinaei, H. R., Chaib-draa, B., & Lamontagne, L. (2012). Learning observation models for dialogue POMDPs. In Proceedings of the 24th Canadian conference on advances in Artificial Intelligence (Canadian AI’12). Toronto, ON, Canada.

  • Choi, J., & Kim, K.-E. (2011). Inverse reinforcement learning in partially observable environments. Journal of Machine Learning Research, 12, 691–730.

    MATH  Google Scholar 

  • Gašić, M. (2011). Statistical Dialogue Modelling. PhD thesis, Department of Engineering, University of Cambridge.

  • Ji, S., Parr, R., Li, H., Liao, X., & Carin, L. (2007). Point-based policy iteration. In Proceedings of the 22nd National Conference on Artificial Intelligence (vol. 2) (AAAI’07). Vancouver, BC, Canada.

  • Kim, D., Kim, J., & Kim, K. (2011). Robust performance evaluation of POMDP-based dialogue systems. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 1029–1040.

    Article  Google Scholar 

  • Neu, G., Szepesvári, C. (2007). Apprenticeship learning using inverse reinforcement learning and gradient methods. In Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence (UAI’07). Vancouver, BC, Canada.

  • Ng, A. Y., Russell, S. J. (2000). Algorithms for inverse reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning (ICML’00). Stanford, CA, USA.

  • Paek, T., & Pieraccini, R. (2008). Automating spoken dialogue management design using machine learning: An industry perspective. Speech Communication, 50(8), 716–729.

    Article  Google Scholar 

  • Pinault, F. and Lefèvre, F. (2011). Semantic graph clustering for pomdp-based spoken dialog systems. In Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH’11). Florence, Italy.

  • Pineau, J., Gordon, G., & Thrun, S. (2003). Point-based value iteration: An anytime algorithm for POMDPs. In International Joint Conference on Artificial Intelligence (IJCAI’03). Acapulco, Mexico.

  • Pineau, J., West, R., Atrash, A., Villemure, J., & Routhier, F. (2011). On the feasibility of using a standardized test for evaluating a speech-controlled smart wheelchair. International Journal of Intelligent Control and Systems, 16(2), 124–131.

    Google Scholar 

  • Ramachandran, D., & Amir, E. (2007). Bayesian inverse reinforcement learning. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI’07). Hyderabad, India.

  • Roy, N., Pineau, J., & Thrun, S. (2000). Spoken dialogue management using probabilistic reasoning. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (ACL’00). Hong Kong.

  • Spaan, M., & Vlassis, N. (2005). Perseus: Randomized point-based value iteration for POMDPs. Journal of Artificial Intelligence Research, 24(1), 195–220.

  • Syed, U. and Schapire, R. (2008). A game-theoretic approach to apprenticeship learning. In Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems. Vancouver, BC, Canada.

  • Thomson, B. (2009). Statistical Methods for Spoken Dialogue Management. PhD thesis, Department of Engineering, University of Cambridge.

  • Williams, J. D. (2006). Partially Observable Markov Decision Processes for Spoken Dialogue Management. PhD thesis, Department of Engineering, University of Cambridge.

  • Williams, J. D., & Young, S. (2005). The SACTI-1 corpus: Guide for research users. Technical Report. Department of Engineering, University of Cambridge.

  • Williams, J. D., & Young, S. (2007). Partially observable Markov decision processes for spoken dialog systems. Computer Speech and Language, 21, 393–422.

    Article  Google Scholar 

  • Zhang, B., Cai, Q., Mao, J., Chang, E., & Guo, B. (2001a). Spoken dialogue management as planning and acting under uncertainty. In Proceedings of the 9th European Conference on Speech Communication and Technology (Eurospeech’01). Aalborg, Denmark.

  • Zhang, B., Cai, Q., Mao, J., & Guo, B. (2001b). Planning and acting under uncertainty: A new model for spoken dialogue system. In Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence (UAI’01), Seattle, WA, USA.

  • Ziebart, B., Maas, A., Bagnell, J., & Dey, A. (2008). Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd National Conference on Artificial Intelligence (AAAI’08). Chicago, IL, USA.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to B. Chaib-draa.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chinaei, H., Chaib-draa, B. Dialogue POMDP components (Part II): learning the reward function. Int J Speech Technol 17, 325–340 (2014). https://doi.org/10.1007/s10772-014-9224-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-014-9224-x

Keywords

Navigation