research-article

Sample-efficient batch reinforcement learning for dialogue management optimization

Authors:

Olivier Pietquin,

Matthieu Geist,

Senthilkumar Chandramohan,

Hervé Frezza-BuetAuthors Info & Claims

ACM Transactions on Speech and Language Processing (TSLP), Volume 7, Issue 3

Article No.: 7, Pages 1 - 21

https://doi.org/10.1145/1966407.1966412

Published: 06 June 2011 Publication History

Abstract

Spoken Dialogue Systems (SDS) are systems which have the ability to interact with human beings using natural language as the medium of interaction. A dialogue policy plays a crucial role in determining the functioning of the dialogue management module. Handcrafting the dialogue policy is not always an option, considering the complexity of the dialogue task and the stochastic behavior of users. In recent years approaches based on Reinforcement Learning (RL) for policy optimization in dialogue management have been proved to be an efficient approach for dialogue policy optimization. Yet most of the conventional RL algorithms are data intensive and demand techniques such as user simulation. Doing so, additional modeling errors are likely to occur. This paper explores the possibility of using a set of approximate dynamic programming algorithms for policy optimization in SDS. Moreover, these algorithms are combined to a method for learning a sparse representation of the value function. Experimental results show that these algorithms when applied to dialogue management optimization are particularly sample efficient, since they learn from few hundreds of dialogue examples. These algorithms learn in an off-policy manner, meaning that they can learn optimal policies with dialogue examples generated with a quite simple strategy. Thus they can learn good dialogue policies directly from data, avoiding user modeling errors.

References

[1]

Bellman, R. 1957. Dynamic Programming 6th Ed. Dover Publications.

[2]

Bellman, R. and Dreyfus, S. 1959. Functional approximation and dynamic programming. Mathem. Tables Other Aids Comput. 13, 247--251.

[3]

Bellman, R., Kalaba, R., and Kotkin, B. 1973. Polynomial approximation—a new computational technique in dynamic programming: allocation processes. Math. Computat. 17, 155--161.

[4]

Bradtke, S. J. and Barto, A. G. 1996. Linear Least-Squares algorithms for temporal difference learning. Mach. Learn. 22, 1--3, 33--57.

Digital Library

[5]

Chandramohan, S., Geist, M., and Pietquin, O. 2010a. Optimizing spoken dialogue management with fitted value iteration. In Proceedings of the International Conference on Speech Communication and Technologies (Interspeech'10). ISCA, 86--89.

[6]

Chandramohan, S., Geist, M., and Pietquin, O. 2010b. Sparse approximate dynamic programming for dialog management. In Proceedings of the 11th SIGDial Conference on Discourse and Dialogue. ACL, 107--115.

Digital Library

[7]

Eckert, W., Levin, E., and Pieraccini, R. 1997. User Modeling for Spoken Dialogue System Evaluation. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). 80--87.

[8]

Engel, Y., Mannor, S., and Meir, R. 2004. The Kernel Recursive Least Squares Algorithm. IEEE Trans. Signal Process. 52, 2275--2285.

Digital Library

[9]

Ernst, D., Geurts, P., and Wehenkel, L. 2005. Tree-Based Batch Mode Reinforcement Learning. J. Mach. Learn. Resear. 6, 503--556.

Digital Library

[10]

Gasic, M., Jurcicek, F., Keizer, S., Mairesse, F., Thomson, B., Yu, K., and Young, S. 2010. Gaussian processes for fast policy optimisation of pomdp-based dialogue managers. In Proceedings of SIGDIAL'10.

Digital Library

[11]

Geist, M. and Pietquin, O. 2010a. Managing uncertainty within the ktd framework. In Proceedings of the Workshop on Active Learning and Experimental Design (AL&E Collocated with AISTAT'10).

[12]

Geist, M. and Pietquin, O. 2010b. Statistically linearized least-squares temporal differences. In Proceedings of the IEEE International Conference on Ultra Modern Control Systems (ICUMT'10). IEEE.

[13]

Gordon, G. 1995. Stable Function Approximation in Dynamic Programming. In Proceedings of the International Conference on Machine Learning (ICML).

[14]

Henderson, J., Lemon, O., and Georgila, K. 2008. Hybrid reinforcement/supervised learning of dialogue policies from fixed data sets. Computat. Ling.

Digital Library

[15]

Jurcicek, F., Thomson, B., Keizer, S., Gasic, M., Mairesse, F., Yu, K., and Young, S. 2010. Natural Belief-Critic: a reinforcement algorithm for parameter estimation in statistical spoken dialogue systems. In Proceedings of Interspeech.

[16]

Lagoudakis, M. G. and Parr, R. 2003. Least-squares policy iteration. J. Mach. Learn. Resear. 4, 1107--1149.

Digital Library

[17]

Larsson, S. and Traum, D. B. 2000. Information state and dialogue management in the TRINDI dialogue move engine toolkit. Nat. Lang. Engin.

Digital Library

[18]

Lemon, O., Georgila, K., Henderson, J., and Stuttle, M. 2006. An ISU dialogue system exhibiting reinforcement learning of dialogue policies: generic slot-filling in the TALK in-car system. In Proceedings of the Meeting of the European Chapter of the Associaton for Computational Linguistics (EACL'06).

Digital Library

[19]

Lemon, O. and Pietquin, O. 2007. Machine Learning for Spoken Dialogue Systems. In Proceedings of the European Conference on Speech Communication and Technologies (Interspeech'07). 2685--2688.

[20]

Levin, E. and Pieraccini, R. 1998. Using markov decision process for learning dialogue strategies. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP'98).

[21]

Levin, E., Pieraccini, R., and Eckert, W. 2000. A stochastic model of human-machine interaction for learning dialog strategies. IEEE Trans. Speech Audio Process. 8, 1, 11--23.

[22]

Li, L., Balakrishnan, S., and Williams, J. 2009. Reinforcement Learning for Dialog Management using Least-Squares Policy Iteration and Fast Feature Selection. In Proceedings of the International Conference on Speech Communication and Technologies (InterSpeech'09).

[23]

Park, J. and Sandberg, I. 1991. Universal approximation using radial-basis-function networks. Neural Computat. 3, 2, 246--257.

[24]

Pietquin, O. 2004. A Framework for Unsupervised Learning of Dialogue Strategies. SIMILAR Collection. Presses Universitaires de Louvain.

[25]

Pietquin, O. 2006a. Consistent Goal-Directed User Model for Realistic Man-Machine Task-Oriented Spoken Dialogue Simulation. In Proceedings of the 7th IEEE International Conference on Multimedia and Expo. 425--428.

[26]

Pietquin, O. 2006b. Machine Learning for Spoken Dialogue Management: an Experiment with Speech-Based Database Querying. In Artificial Intelligence: Methodology, Systems & Applications, J. E. J. Domingue, Ed. Lecture Notes in Artificial Intelligence, vol. 4183. Springer Verlag, 172--180.

Digital Library

[27]

Pietquin, O. and Renals, S. 2002. ASR System Modeling For Automatic Evaluation And Optimization of Dialogue Systems. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'02). Vol. I. 45--48.

[28]

Puterman, M. L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience.

Digital Library

[29]

Rieser, V. and Lemon, O. 2008. Learning effective multimodal dialogue strategies from wizard-of-oz data: Bootstrapping and evaluation. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL/HLT '08).

[30]

Samuel, A. L. 1959. Some studies in machine learning using the game of checkers. IBM J. Resear. Devel. 210--229.

Digital Library

[31]

Schatzmann, J., Stuttle, M. N., Weilhammer, K., and Young, S. 2005. Effects of the user model on simulation-based learning of dialogue strategies. In Proceedings of workshop on Automatic Speech Recognition and Understanding (ASRU'05).

[32]

Schatzmann, J., Thomson, B., and Young, S. 2007. Error simulation for training statistical dialogue systems. In Proceedings of the International Workshop on Automatic Speech Recognition and Understanding (ASRU'07).

[33]

Scheffler, K. and Young, S. 2001. Corpus-based dialogue simulation for automatic strategy learning and evaluation. In Proceedings of NAACL Workshop on Adaptation in Dialogue Systems.

[34]

Scholkopf, B. and Smola, A. J. 2001. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA.

Digital Library

[35]

Singh, S., Kearns, M., Litman, D., and Walker, M. 1999. Reinforcement learning for spoken dialogue systems. In Proceedings of the Annual Meeting of the Neural Iniformation Processing Society (NIPS'99). Springer.

[36]

Sutton, R. S. and Barto, A. G. 1998. Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning) 3rd Ed. The MIT Press.

Digital Library

[37]

W3C 2008. VoiceXML 3.0 Specifications. W3C. http://www.w3.org/TR/voicexml30/.

[38]

Walker, M. A., Litman, D. J., Kamm, C. A., and Abella, A. 1997. PARADISE: A framework for evaluating spoken dialogue agents. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL'97). 271--280.

Digital Library

[39]

Watkins, C. J. 1989. Learning from Delayed Rewards. Ph.D. thesis, University of Cambridge, UK.

[40]

Williams, J. D. and Young, S. 2007. Partially observable markov decision processes for spoken dialog systems. Comput. Speech Lang.

Digital Library

[41]

Xu, X., Hu, D., and Lu, X. 2007. Kernel-based least squares policy iteration for reinforcement learning. IEEE Trans. Neural Netw. 18, 4, 973--992.

Digital Library

Cited By

Wang HZhang YYang YZheng YWong K(2024)Acquiring New Knowledge Without Losing Old Ones for Effective Continual Dialogue Policy LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.334472736:12(7569-7584)Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1109/TKDE.2023.3344727
Janjua MShah HWhite MMiahi EMachado MWhite A(2024)GVFs in the real world: making predictions online for water treatmentMachine Language10.1007/s10994-023-06413-x113:8(5151-5181)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1007/s10994-023-06413-x
Lu KCao YChen XZhang S(2023)Efficient Dialog Policy Learning With Hindsight, User Modeling, and AdaptationIEEE Transactions on Cognitive and Developmental Systems10.1109/TCDS.2021.306112115:2(395-408)Online publication date: Jun-2023
https://doi.org/10.1109/TCDS.2021.3061121
Show More Cited By

Index Terms

Sample-efficient batch reinforcement learning for dialogue management optimization
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction paradigms
      1. Natural language interfaces

Recommendations

Experience Replay-based Deep Reinforcement Learning for Dialogue Management Optimisation
Dialogue policy is a crucial component in task-oriented Spoken Dialogue Systems (SDSs). As a decision function, it takes the current dialogue state as input and generates appropriate system’s response. In this paper, we explore the reinforcement learning ...
A methodology for turn-taking capabilities enhancement in Spoken Dialogue Systems using Reinforcement Learning

We propose a new methodology to improve turn-taking capabilities in a Spoken Dialogue System using Reinforcement Learning.We describe a new incremental Dialogue System Architecture and a new incremental dialogue simulator.We train a Reinforcement ...
Statistical Spoken Dialogue Systems and the Challenges for Machine Learning
WSDM '17: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining

This talk will review the principal components of a spoken dialogue system and then discuss the opportunities for applying machine learning for building robust high performance open-domain systems. The talk will be illustrated by recent work at ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Speech and Language Processing

ACM Transactions on Speech and Language Processing Volume 7, Issue 3

May 2011

155 pages

ISSN:1550-4875

EISSN:1550-4883

DOI:10.1145/1966407

Issue’s Table of Contents

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 June 2011

Accepted: 01 February 2011

Revised: 01 November 2010

Received: 01 July 2010

Published in TSLP Volume 7, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Seventh Framework Programme

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

34
Total Citations
View Citations
432
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)2

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang HZhang YYang YZheng YWong K(2024)Acquiring New Knowledge Without Losing Old Ones for Effective Continual Dialogue Policy LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.334472736:12(7569-7584)Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1109/TKDE.2023.3344727
Janjua MShah HWhite MMiahi EMachado MWhite A(2024)GVFs in the real world: making predictions online for water treatmentMachine Language10.1007/s10994-023-06413-x113:8(5151-5181)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1007/s10994-023-06413-x
Lu KCao YChen XZhang S(2023)Efficient Dialog Policy Learning With Hindsight, User Modeling, and AdaptationIEEE Transactions on Cognitive and Developmental Systems10.1109/TCDS.2021.306112115:2(395-408)Online publication date: Jun-2023
https://doi.org/10.1109/TCDS.2021.3061121
Rezaei Gazik MRoayaei M(2023)Batch(Offline) Reinforcement Learning for Recommender System2023 31st International Conference on Electrical Engineering (ICEE)10.1109/ICEE59167.2023.10334722(245-250)Online publication date: 9-May-2023
https://doi.org/10.1109/ICEE59167.2023.10334722
Zhang JLi XZhou XYuan MCheng ZHuang KLi YOshana R(2022)L-QoCoProceedings of the 59th ACM/IEEE Design Automation Conference10.1145/3489517.3530466(379-384)Online publication date: 10-Jul-2022
https://dl.acm.org/doi/10.1145/3489517.3530466
Gangapurwala SMitchell AHavoutis I(2020)Guided Constrained Policy Optimization for Dynamic Quadrupedal Robot LocomotionIEEE Robotics and Automation Letters10.1109/LRA.2020.29796565:2(3642-3649)Online publication date: Apr-2020
https://doi.org/10.1109/LRA.2020.2979656
Abbasnejad EAbbasnejad IWu QShi Jvan den Hengel A(2020)Gold Seeker: Information Gain From Policy Distributions for Goal-Oriented Vision-and-Langauge Reasoning2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR42600.2020.01346(13447-13456)Online publication date: Jun-2020
https://doi.org/10.1109/CVPR42600.2020.01346
Abbasnejad EWu QShi Qvan den Hengel A(2019)What's to Know? Uncertainty as a Guide to Asking Goal-Oriented Questions2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR.2019.00428(4150-4159)Online publication date: Jun-2019
https://doi.org/10.1109/CVPR.2019.00428
Barlier MLaroche RPietquin OAndre EKoenig SDastani MSukthankar G(2018)Training Dialogue Systems With Human AdviceProceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems10.5555/3237383.3237847(999-1007)Online publication date: 9-Jul-2018
https://dl.acm.org/doi/10.5555/3237383.3237847
Celikyilmaz ADeng LHakkani-Tür D(2018)Deep Learning in Spoken and Text-Based Dialog SystemsDeep Learning in Natural Language Processing10.1007/978-981-10-5209-5_3(49-78)Online publication date: 24-May-2018
https://doi.org/10.1007/978-981-10-5209-5_3
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents