ABSTRACT
Establishing robust dialogue policy with low computation cost is challenging, especially for multi-domain task-oriented dialogue management due to the high complexity in state and action spaces. The previous works mostly using the deterministic policy optimization only attain moderate performance. Meanwhile, state-of-the-art result that uses end-to-end approach is computationally demanding since it utilizes a large-scaled language model based on the generative pre-trained transformer-2 (GPT-2). In this study, a new learning procedure consisting of three learning stages is presented to improve multi-domain dialogue management with corrective guidance. Firstly, the behavior cloning with an auxiliary task is developed to build a robust pre-trained model by mitigating the causal confusion problem in imitation learning. Next, the pre-trained model is rectified by using reinforcement learning via the proximal policy optimization. Lastly, human-in-the-loop learning strategy is fulfilled to enhance the agent performance by directly providing corrective feedback from rule-based agent so that the agent is prevented to trap in confounded states. The experiments on end-to-end evaluation show that the proposed learning method achieves state-of-the-art result by performing nearly identical to the rule-based agent. This method outperforms the second place of 9th dialog system technology challenge (DSTC9) track 2 that uses GPT-2 as the core model in dialogue management.
- David Abel, John Salvatier, Andreas Stuhlmüller, and Owain Evans. 2016. Agent-Agnostic Human-in-the-Loop Reinforcement Learning. In NIPS Workshop on the Future of Interactive Learning Machines.Google Scholar
- Antoine Bordes, Y-Lan Boureau, and Jason Weston. 2016. Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:1605.07683 (2016).Google Scholar
- Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, I nigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gavs ić. 2018. MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling. In Proc. of Conference on Empirical Methods in Natural Language Processing. 5016--5026.Google ScholarCross Ref
- Bill Byrne, Karthik Krishnamoorthi, Chinnadhurai Sankar, Arvind Neelakantan, Ben Goodrich, Daniel Duckworth, Semih Yavuz, Amit Dubey, Kyu-Young Kim, and Andy Cedilnik. 2019. Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset. In Proc. of Conference on Empirical Methods in Natural Language Processing. 4516--4525.Google ScholarCross Ref
- Jen-Tzung Chien. 2015. Hierarchical Pitman--Yor--Dirichlet language model. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 23, 8 (2015), 1259--1272.Google ScholarDigital Library
- Jen-Tzung Chien. 2019. Deep Bayesian natural language processing. In Proc. of Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts. 25--30.Google ScholarCross Ref
- Jen-Tzung Chien and Ying-Lan Chang. 2014. Bayesian sparse topic model. Journal of Signal Processing Systems , Vol. 74, 3 (2014), 375--389. Google ScholarDigital Library
- Jen-Tzung Chien and Po-Chien Hsu. 2020. Stochastic curiosity exploration for dialogue systems. In Proc. of Annual Conference of International Speech Communication Association. 3885--3889.Google ScholarCross Ref
- Jen-Tzung Chien and Po-Yen Hung. 2020. Multiple Target Prediction for Deep Reinforcement Learning. In Proc. of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. 1611--1616.Google Scholar
- Jen-Tzung Chien and Yuan-Chu Ku. 2015. Bayesian recurrent neural network for language modeling. IEEE Transactions on Neural Networks and Learning Systems, Vol. 27, 2 (2015), 361--374.Google ScholarCross Ref
- Jen-Tzung Chien, Wei-Lin Liao, and Issam El Naqa. 2020. Exploring state transition uncertainty in variational reinforcement learning. In Proc. of European Signal Processing Conference. 1527--1531.Google Scholar
- Jen-Tzung Chien and Wei Xiang Lieow. 2019. Meta Learning for Hyperparameter Optimization in Dialogue System.. In Proc. of Annual Conference of International Speech Communication Association. 839--843.Google ScholarCross Ref
- Thibault Cordier, Tanguy Urvoy, Lina M Rojas-Barahona, and Fabrice Lefèvre. 2020. Diluted Near-Optimal Expert Demonstrations for Guiding Dialogue Stochastic Policy Optimisation. In NIPS Workshop on Human in the Loop Dialogue Systems.Google Scholar
- Kevin Crowston. 2012. Amazon Mechanical Turk: A Research Tool for Organizations and Information Systems Scholars. In Proc. of Shaping the Future of ICT Research. Methods and Approaches. 210--221.Google ScholarCross Ref
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proc. of North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171--4186.Google Scholar
- Xinyu Duan, Yating Zhang, Lin Yuan, Xin Zhou, Xiaozhong Liu, Tianyi Wang, Ruocheng Wang, Qiong Zhang, Changlong Sun, and Fei Wu. 2019. Legal Summarization for Multi-Role Debate Dialogue via Controversy Focus Mining and Multi-Task Learning. In Proc. of ACM International Conference on Information and Knowledge Management. 1361--1370. Google ScholarDigital Library
- Mihail Eric, Rahul Goel, Shachi Paul, Abhishek Sethi, Sanchit Agarwal, Shuyag Gao, and Dilek Hakkani-Tur. 2019. MultiWOZ 2.1: Multi-Domain Dialogue State Corrections and State Tracking Baselines. arXiv preprint arXiv:1907.01669 (2019).Google Scholar
- Justin Fu, Katie Luo, and Sergey Levine. 2017. Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248 (2017).Google Scholar
- Xisen Jin, Wenqiang Lei, Zhaochun Ren, Hongshen Chen, Shangsong Liang, Yihong Zhao, and Dawei Yin. 2018. Explicit State Tracking with Semi-Supervision for Neural Dialogue Generation. In Proc. of International Conference on Information and Knowledge Management. 1403--1412. Google ScholarDigital Library
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proc. of the International Conference on Learning Representations.Google Scholar
- Jonávs Kulhánek, Vojtve ch Hudevc ek, Tomávs Nekvinda, and Ondvr ej Duvs ek. 2021. AuGPT: Dialogue with Pre-trained Language Models and Data Augmentation. arXiv preprint arXiv:2102.05126 (2021).Google Scholar
- Jinchao Li, Qi Zhu, Lingxiao Luo, Lars Liden, Kaili Huang, Shahin Shayandeh, Runze Liang, Baolin Peng, Zheng Zhang, Swadheen Shukla, Ryuichi Takanobu, Minlie Huang, and Jianfeng Gao. 2021. Multi-Domain Task Completion Dialog Challenge 2 at DSTC9. In AAAI Workshop on Dialog System Technology Challenge.Google Scholar
- Ziming Li, Sungjin Lee, Baolin Peng, Jinchao Li, Julia Kiseleva, Maarten de Rijke, Shahin Shayandeh, and Jianfeng Gao. 2020. Guided Dialogue Policy Learning without Adversarial Learning in the Loop. In Proc. of Conference on Empirical Methods in Natural Language Processing. 2308--2317.Google ScholarCross Ref
- Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu, and Kam-Fai Wong. 2018. Deep Dyna-Q: Integrating Planning for Task-Completion Dialogue Policy Learning. In Proc. of Annual Meeting of the Association for Computational Linguistics. 2182--2192.Google ScholarCross Ref
- Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. OpenAI blog (2019).Google Scholar
- Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. 2020. Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset. In Proc. of AAAI Conference on Artificial Intelligence, Vol. 34. 8689--8696.Google ScholarCross Ref
- Mahdin Rohmatillah and Jen-Tzung Chien. 2021. Causal Confusion Reduction for Robust Multi-Domain Dialogue Policy. In Proc. of Annual Conference of the International Speech Communication Association.Google ScholarCross Ref
- Stephane Ross, Geoffrey Gordon, and Drew Bagnell. 2011. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. In Proc. of International Conference on Artificial Intelligence and Statistics, Vol. 15. 627--635.Google Scholar
- John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. 2016. High-Dimensional Continuous Control Using Generalized Advantage Estimation. In Proc. of International Conference on Learning Representations.Google Scholar
- John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).Google Scholar
- Stephanie Seneff and Joseph Polifroni. 2000. Dialogue Management in the Mercury Flight Reservation System. In ANLP-NAACL Workshop: Conversational Systems. Google ScholarDigital Library
- Shang-Yu Su, Xiujun Li, Jianfeng Gao, Jingjing Liu, and Yun-Nung Chen. 2018. Discriminative Deep Dyna-Q: Robust Planning for Dialogue Policy Learning. In Proc. of Conference on Empirical Methods in Natural Language Processing. 3813--3823.Google ScholarCross Ref
- Ryuichi Takanobu, Runze Liang, and Minlie Huang. 2020a. Multi-Agent Task-Oriented Dialog Policy Learning with Role-Aware Reward Decomposition. In Proc. of Annual Meeting of the Association for Computational Linguistics. 625--638.Google ScholarCross Ref
- Ryuichi Takanobu, Hanlin Zhu, and Minlie Huang. 2019. Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog. In Proc. of Conference on Empirical Methods in Natural Language Processing. 100--110.Google ScholarCross Ref
- Ryuichi Takanobu, Qi Zhu, Jinchao Li, Baolin Peng, Jianfeng Gao, and Minlie Huang. 2020b. Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical Analysis of System-wise Evaluation. In Proc. of Annual Meeting of the Special Interest Group on Discourse and Dialogue. 297--310.Google Scholar
- Tijmen Tieleman and Geoffrey Hinton. 2012. Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, Vol. 4, 2 (2012), 26--31.Google Scholar
- H.-H. Tseng, Y. Luo, S. Cui, J.-T. Chien, R. K. Ten Haken, and I. El Naqa. 2017. Deep reinforcement learning for automated radiation adaptation in lung cancer. Medical Physics, Vol. 44, 12 (2017), 6690--6705.Google ScholarCross Ref
- Stefan Ultes, Lina M. Rojas-Barahona, Pei-Hao Su, David Vandyke, Dongho Kim, I nigo Casanueva, Paweł Budzianowski, Nikola Mrkvs ić, Tsung-Hsien Wen, Milica Gavs ić, and Steve Young. 2017. PyDial: A Multi-domain Statistical Dialogue System Toolkit. In Proc. of Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 73--78.Google ScholarCross Ref
- Shinji Watanabe and Jen-Tzung Chien. 2015. Bayesian speech and language processing .Cambridge University Press. Google ScholarDigital Library
- Tsung-Hsien Wen, Milica Gavs ić, Nikola Mrkvs ić, Lina M. Rojas-Barahona, Pei-Hao Su, Stefan Ultes, David Vandyke, and Steve Young. 2016. Conditional Generation and Snapshot Learning in Neural Dialogue Systems. In Proc. of Conference on Empirical Methods in Natural Language Processing. 2153--2162.Google ScholarCross Ref
- Yuexin Wu, Xiujun Li, Jingjing Liu, Jianfeng Gao, and Yiming Yang. 2019. Switch-Based Active Deep Dyna-Q: Efficient Adaptive Planning for Task-Completion Dialogue Policy Learning. In Proc. of AAAI Conference on Artificial Intelligence. 7289--7296.Google ScholarCross Ref
- Boliang Zhang, Ying Lyu, Ning Ding, Tianhao Shen, Zhaoyang Jia, Kun Han, and Kevin Knight. 2021. A Hybrid Task-Oriented Dialog System with Domain and Task Adaptive Pretraining. arXiv preprint arXiv:2102.04506 (2021).Google Scholar
- Zheng Zhang, Ryuichi Takanobu, Qi Zhu, MinLie Huang, and XiaoYan Zhu. 2020. Recent advances and challenges in task-oriented dialog systems. Science China Technological Sciences (2020), 1--17.Google Scholar
- Yangyang Zhao, Zhenyu Wang, Kai Yin, Rui Zhang, Zhenhua Huang, and Pei Wang. 2020. Dynamic Reward-Based Dueling Deep Dyna-Q: Robust Policy Learning in Noisy Environments. In Proc. of AAAI Conference on Artificial Intelligence. 9676--9684.Google ScholarCross Ref
- Qi Zhu, Zheng Zhang, Yan Fang, Xiang Li, Ryuichi Takanobu, Jinchao Li, Baolin Peng, Jianfeng Gao, Xiaoyan Zhu, and Minlie Huang. 2020. ConvLab-2: An Open-Source Toolkit for Building, Evaluating, and Diagnosing Dialogue Systems. In Proc. of Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 142--149.Google ScholarCross Ref
Index Terms
- Corrective Guidance and Learning for Dialogue Management
Recommendations
Automating spoken dialogue management design using machine learning: An industry perspective
In designing a spoken dialogue system, developers need to specify the actions a system should take in response to user speech input and the state of the environment based on observed or inferred events, states, and beliefs. This is the fundamental task ...
From vocal to multimodal dialogue management
ICMI '06: Proceedings of the 8th international conference on Multimodal interfacesMultimodal, speech-enabled systems pose different research problems when compared to unimodal, voice-only dialogue systems. One of the important issues is the question of how a multimodal interface should look like in order to make the multimodal ...
Experience Replay-based Deep Reinforcement Learning for Dialogue Management Optimisation
Dialogue policy is a crucial component in task-oriented Spoken Dialogue Systems (SDSs). As a decision function, it takes the current dialogue state as input and generates appropriate system’s response. In this paper, we explore the reinforcement learning ...
Comments