skip to main content
research-article

Sample-efficient batch reinforcement learning for dialogue management optimization

Published: 06 June 2011 Publication History

Abstract

Spoken Dialogue Systems (SDS) are systems which have the ability to interact with human beings using natural language as the medium of interaction. A dialogue policy plays a crucial role in determining the functioning of the dialogue management module. Handcrafting the dialogue policy is not always an option, considering the complexity of the dialogue task and the stochastic behavior of users. In recent years approaches based on Reinforcement Learning (RL) for policy optimization in dialogue management have been proved to be an efficient approach for dialogue policy optimization. Yet most of the conventional RL algorithms are data intensive and demand techniques such as user simulation. Doing so, additional modeling errors are likely to occur. This paper explores the possibility of using a set of approximate dynamic programming algorithms for policy optimization in SDS. Moreover, these algorithms are combined to a method for learning a sparse representation of the value function. Experimental results show that these algorithms when applied to dialogue management optimization are particularly sample efficient, since they learn from few hundreds of dialogue examples. These algorithms learn in an off-policy manner, meaning that they can learn optimal policies with dialogue examples generated with a quite simple strategy. Thus they can learn good dialogue policies directly from data, avoiding user modeling errors.

References

[1]
Bellman, R. 1957. Dynamic Programming 6th Ed. Dover Publications.
[2]
Bellman, R. and Dreyfus, S. 1959. Functional approximation and dynamic programming. Mathem. Tables Other Aids Comput. 13, 247--251.
[3]
Bellman, R., Kalaba, R., and Kotkin, B. 1973. Polynomial approximation—a new computational technique in dynamic programming: allocation processes. Math. Computat. 17, 155--161.
[4]
Bradtke, S. J. and Barto, A. G. 1996. Linear Least-Squares algorithms for temporal difference learning. Mach. Learn. 22, 1--3, 33--57.
[5]
Chandramohan, S., Geist, M., and Pietquin, O. 2010a. Optimizing spoken dialogue management with fitted value iteration. In Proceedings of the International Conference on Speech Communication and Technologies (Interspeech'10). ISCA, 86--89.
[6]
Chandramohan, S., Geist, M., and Pietquin, O. 2010b. Sparse approximate dynamic programming for dialog management. In Proceedings of the 11th SIGDial Conference on Discourse and Dialogue. ACL, 107--115.
[7]
Eckert, W., Levin, E., and Pieraccini, R. 1997. User Modeling for Spoken Dialogue System Evaluation. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). 80--87.
[8]
Engel, Y., Mannor, S., and Meir, R. 2004. The Kernel Recursive Least Squares Algorithm. IEEE Trans. Signal Process. 52, 2275--2285.
[9]
Ernst, D., Geurts, P., and Wehenkel, L. 2005. Tree-Based Batch Mode Reinforcement Learning. J. Mach. Learn. Resear. 6, 503--556.
[10]
Gasic, M., Jurcicek, F., Keizer, S., Mairesse, F., Thomson, B., Yu, K., and Young, S. 2010. Gaussian processes for fast policy optimisation of pomdp-based dialogue managers. In Proceedings of SIGDIAL'10.
[11]
Geist, M. and Pietquin, O. 2010a. Managing uncertainty within the ktd framework. In Proceedings of the Workshop on Active Learning and Experimental Design (AL&E Collocated with AISTAT'10).
[12]
Geist, M. and Pietquin, O. 2010b. Statistically linearized least-squares temporal differences. In Proceedings of the IEEE International Conference on Ultra Modern Control Systems (ICUMT'10). IEEE.
[13]
Gordon, G. 1995. Stable Function Approximation in Dynamic Programming. In Proceedings of the International Conference on Machine Learning (ICML).
[14]
Henderson, J., Lemon, O., and Georgila, K. 2008. Hybrid reinforcement/supervised learning of dialogue policies from fixed data sets. Computat. Ling.
[15]
Jurcicek, F., Thomson, B., Keizer, S., Gasic, M., Mairesse, F., Yu, K., and Young, S. 2010. Natural Belief-Critic: a reinforcement algorithm for parameter estimation in statistical spoken dialogue systems. In Proceedings of Interspeech.
[16]
Lagoudakis, M. G. and Parr, R. 2003. Least-squares policy iteration. J. Mach. Learn. Resear. 4, 1107--1149.
[17]
Larsson, S. and Traum, D. B. 2000. Information state and dialogue management in the TRINDI dialogue move engine toolkit. Nat. Lang. Engin.
[18]
Lemon, O., Georgila, K., Henderson, J., and Stuttle, M. 2006. An ISU dialogue system exhibiting reinforcement learning of dialogue policies: generic slot-filling in the TALK in-car system. In Proceedings of the Meeting of the European Chapter of the Associaton for Computational Linguistics (EACL'06).
[19]
Lemon, O. and Pietquin, O. 2007. Machine Learning for Spoken Dialogue Systems. In Proceedings of the European Conference on Speech Communication and Technologies (Interspeech'07). 2685--2688.
[20]
Levin, E. and Pieraccini, R. 1998. Using markov decision process for learning dialogue strategies. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP'98).
[21]
Levin, E., Pieraccini, R., and Eckert, W. 2000. A stochastic model of human-machine interaction for learning dialog strategies. IEEE Trans. Speech Audio Process. 8, 1, 11--23.
[22]
Li, L., Balakrishnan, S., and Williams, J. 2009. Reinforcement Learning for Dialog Management using Least-Squares Policy Iteration and Fast Feature Selection. In Proceedings of the International Conference on Speech Communication and Technologies (InterSpeech'09).
[23]
Park, J. and Sandberg, I. 1991. Universal approximation using radial-basis-function networks. Neural Computat. 3, 2, 246--257.
[24]
Pietquin, O. 2004. A Framework for Unsupervised Learning of Dialogue Strategies. SIMILAR Collection. Presses Universitaires de Louvain.
[25]
Pietquin, O. 2006a. Consistent Goal-Directed User Model for Realistic Man-Machine Task-Oriented Spoken Dialogue Simulation. In Proceedings of the 7th IEEE International Conference on Multimedia and Expo. 425--428.
[26]
Pietquin, O. 2006b. Machine Learning for Spoken Dialogue Management: an Experiment with Speech-Based Database Querying. In Artificial Intelligence: Methodology, Systems & Applications, J. E. J. Domingue, Ed. Lecture Notes in Artificial Intelligence, vol. 4183. Springer Verlag, 172--180.
[27]
Pietquin, O. and Renals, S. 2002. ASR System Modeling For Automatic Evaluation And Optimization of Dialogue Systems. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'02). Vol. I. 45--48.
[28]
Puterman, M. L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience.
[29]
Rieser, V. and Lemon, O. 2008. Learning effective multimodal dialogue strategies from wizard-of-oz data: Bootstrapping and evaluation. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL/HLT '08).
[30]
Samuel, A. L. 1959. Some studies in machine learning using the game of checkers. IBM J. Resear. Devel. 210--229.
[31]
Schatzmann, J., Stuttle, M. N., Weilhammer, K., and Young, S. 2005. Effects of the user model on simulation-based learning of dialogue strategies. In Proceedings of workshop on Automatic Speech Recognition and Understanding (ASRU'05).
[32]
Schatzmann, J., Thomson, B., and Young, S. 2007. Error simulation for training statistical dialogue systems. In Proceedings of the International Workshop on Automatic Speech Recognition and Understanding (ASRU'07).
[33]
Scheffler, K. and Young, S. 2001. Corpus-based dialogue simulation for automatic strategy learning and evaluation. In Proceedings of NAACL Workshop on Adaptation in Dialogue Systems.
[34]
Scholkopf, B. and Smola, A. J. 2001. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA.
[35]
Singh, S., Kearns, M., Litman, D., and Walker, M. 1999. Reinforcement learning for spoken dialogue systems. In Proceedings of the Annual Meeting of the Neural Iniformation Processing Society (NIPS'99). Springer.
[36]
Sutton, R. S. and Barto, A. G. 1998. Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning) 3rd Ed. The MIT Press.
[37]
W3C 2008. VoiceXML 3.0 Specifications. W3C. http://www.w3.org/TR/voicexml30/.
[38]
Walker, M. A., Litman, D. J., Kamm, C. A., and Abella, A. 1997. PARADISE: A framework for evaluating spoken dialogue agents. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL'97). 271--280.
[39]
Watkins, C. J. 1989. Learning from Delayed Rewards. Ph.D. thesis, University of Cambridge, UK.
[40]
Williams, J. D. and Young, S. 2007. Partially observable markov decision processes for spoken dialog systems. Comput. Speech Lang.
[41]
Xu, X., Hu, D., and Lu, X. 2007. Kernel-based least squares policy iteration for reinforcement learning. IEEE Trans. Neural Netw. 18, 4, 973--992.

Cited By

View all
  • (2024)Acquiring New Knowledge Without Losing Old Ones for Effective Continual Dialogue Policy LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.334472736:12(7569-7584)Online publication date: 1-Dec-2024
  • (2024)GVFs in the real world: making predictions online for water treatmentMachine Language10.1007/s10994-023-06413-x113:8(5151-5181)Online publication date: 1-Aug-2024
  • (2023)Efficient Dialog Policy Learning With Hindsight, User Modeling, and AdaptationIEEE Transactions on Cognitive and Developmental Systems10.1109/TCDS.2021.306112115:2(395-408)Online publication date: Jun-2023
  • Show More Cited By

Index Terms

  1. Sample-efficient batch reinforcement learning for dialogue management optimization

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Speech and Language Processing
    ACM Transactions on Speech and Language Processing   Volume 7, Issue 3
    May 2011
    155 pages
    ISSN:1550-4875
    EISSN:1550-4883
    DOI:10.1145/1966407
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 June 2011
    Accepted: 01 February 2011
    Revised: 01 November 2010
    Received: 01 July 2010
    Published in TSLP Volume 7, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Spoken dialogue systems
    2. reinforcement learning

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)25
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 15 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Acquiring New Knowledge Without Losing Old Ones for Effective Continual Dialogue Policy LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.334472736:12(7569-7584)Online publication date: 1-Dec-2024
    • (2024)GVFs in the real world: making predictions online for water treatmentMachine Language10.1007/s10994-023-06413-x113:8(5151-5181)Online publication date: 1-Aug-2024
    • (2023)Efficient Dialog Policy Learning With Hindsight, User Modeling, and AdaptationIEEE Transactions on Cognitive and Developmental Systems10.1109/TCDS.2021.306112115:2(395-408)Online publication date: Jun-2023
    • (2023)Batch(Offline) Reinforcement Learning for Recommender System2023 31st International Conference on Electrical Engineering (ICEE)10.1109/ICEE59167.2023.10334722(245-250)Online publication date: 9-May-2023
    • (2022)L-QoCoProceedings of the 59th ACM/IEEE Design Automation Conference10.1145/3489517.3530466(379-384)Online publication date: 10-Jul-2022
    • (2020)Guided Constrained Policy Optimization for Dynamic Quadrupedal Robot LocomotionIEEE Robotics and Automation Letters10.1109/LRA.2020.29796565:2(3642-3649)Online publication date: Apr-2020
    • (2020)Gold Seeker: Information Gain From Policy Distributions for Goal-Oriented Vision-and-Langauge Reasoning2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR42600.2020.01346(13447-13456)Online publication date: Jun-2020
    • (2019)What's to Know? Uncertainty as a Guide to Asking Goal-Oriented Questions2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR.2019.00428(4150-4159)Online publication date: Jun-2019
    • (2018)Training Dialogue Systems With Human AdviceProceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems10.5555/3237383.3237847(999-1007)Online publication date: 9-Jul-2018
    • (2018)Deep Learning in Spoken and Text-Based Dialog SystemsDeep Learning in Natural Language Processing10.1007/978-981-10-5209-5_3(49-78)Online publication date: 24-May-2018
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media