Abstract
The purpose of task-based dialogue systems is to help users achieve their dialogue needs using as few dialogue rounds as possible. As the demand increases, the dialogue tasks gradually involve multiple domains and develop in the direction of complexity and diversity. Achieving high performance with low computational effort has become an essential metric for multi-domain task-based dialogue systems. This paper proposes a new approach to guided dialogue policy. The method introduces a conditional diffusion model in the reinforcement learning Q-learning algorithm to regularise the policy in a diffusion Q-learning manner. The conditional diffusion model is used to learn the action value function, regulate the actions using regularisation, sample the actions, use the sampled actions in the policy update process, and additionally add a loss term that maximizes the value of the actions in the policy update process to improve the learning efficiency. Our proposed method is based on a conditional diffusion model, combined with the reinforcement learning TD3 algorithm as a dialogue policy and an inverse reinforcement learning approach to construct a reward estimator to provide rewards for policy updates as a way of completing a multi-domain dialogue task.







Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability and Access
Some or all data, models, or code generated or used during the study are available from the corresponding author by request.
References
Chen H, Liu X, Yin D, Tang J (2017) A survey on dialogue systems: recent advances and new frontiers. SIGKDD Explor 19(2):25–35. https://doi.org/10.1145/3166054.3166058
Kwan W, Wang H, Wang H, Wong K (2023) A survey on recent advances and challenges in reinforcement learning methods for task-oriented dialogue policy learning. Int J Autom Comput 20(3):318–334. https://doi.org/10.1007/s11633-022-1347-y
Dhingra B, Li L, Li X, Gao J, Chen Y, Ahmed F, Deng L (2017) Towards end-to-end reinforcement learning of dialogue agents for information access. 484–495. https://doi.org/10.18653/v1/P17-1045
Shi W, Yu Z (2018) Sentiment adaptive end-to-end dialog systems. 1509–1519. https://doi.org/10.18653/v1/P18-1140
Casanueva I, Temcinas T, Gerz D, Henderson M, Vulic I (2020) Efficient intent detection with dual sentence encoders. arXiv:2003.04807
Zhang J, Hashimoto K, Wu C, Wang Y, Yu PS, Socher R, Xiong C (2020) Find or classify? dual strategy for slot-value predictions on multi-domain dialog state tracking, 154–167
Zhu Q, Zhang Z, Fang Y, Li X, Takanobu R, Li J, Peng B, Gao J, Zhu X, Huang M (2020) Convlab-2: an open-source toolkit for building, evaluating, and diagnosing dialogue systems, 142–149. https://doi.org/10.18653/v1/2020.acl-demos.19
Peng B, Li X, Li L, Gao J, Celikyilmaz A, Lee S, Wong K (2017) Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning, 2231–2240. https://doi.org/10.18653/v1/d17-1237
Zhao T, Xie K, Eskénazi M (2019) Rethinking action spaces for reinforcement learning in end-to-end dialog agents with latent variable models, 1208–1218 . https://doi.org/10.18653/v1/n19-1123
Zhang Y, Ou Z, Yu Z (2020) Task-oriented dialog systems that consider multiple appropriate responses under the same context. AAAI Press, pp 9604–9611. https://doi.org/10.1609/aaai.v34i05.6507
Chen Z, Chen L, Liu X, Yu K (2020) Distributed structured actor-critic reinforcement learning for universal dialogue management. IEEE ACM Trans Audio Speech Lang Process 28:2400–2411. https://doi.org/10.1109/TASLP.2020.3013392
Lubis N, Geishauser C, Heck M, Lin H, Moresi M, Niekerk C, Gasic M (2020) LAVA: latent action spaces via variational auto-encoding for dialogue policy optimization, 465–479. https://doi.org/10.18653/v1/2020.coling-main.41
Rohmatillah M, Chien J (2023) Hierarchical reinforcement learning with guidance for multi-domain dialogue policy. IEEE ACM Trans Audio Speech Lang Process 31:748–761. https://doi.org/10.1109/TASLP.2023.3235202
Li Z, Lee S, Peng B, Li J, Kiseleva J, Rijke M, Shayandeh S, Gao J (2020) Guided dialogue policy learning without adversarial learning in the loop. EMNLP 2020, 2308–2317. https://doi.org/10.18653/v1/2020.findings-emnlp.209
Fujimoto S, Meger D, Precup D (2019) Off-policy deep reinforcement learning without exploration. 97:2052–2062
Fujimoto S, Gu SS (2021) A minimalist approach to offline reinforcement learning, 20132–20145
Kumar A, Zhou A, Tucker G, Levine S (2020) Conservative q-learning for offline reinforcement learning
Wu Y, Tucker G, Nachum O (2019) Behavior regularized offline reinforcement learning. arXiv:1911.11361
Takanobu R, Zhu H, Huang M (2019) Guided dialog policy learning: reward estimation for multi-domain task-oriented dialog, 100–110. https://doi.org/10.18653/v1/D19-1010
Wang Z, Hunt JJ, Zhou M (2023) Diffusion policies as an expressive policy class for offline reinforcement learning
Joshi T, Makker S, Kodamana H, Kandath H (2021) Twin actor twin delayed deep deterministic policy gradient (tatd3) learning for batch process control. Comput Chem Eng 155:107527
Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models, 6840–6851. https://doi.org/10.5555/3459574.3459739
Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning. PMLR, pp 8162–8171. http://proceedings.mlr.press/v139/nichol21a.html
Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models
Song Y, Sohl-Dickstein J, Kingma DP, Kumar A, Ermon S, Poole B (2020) Score-based generative modeling through stochastic differential equations. arXiv:2011.13456
Eric M, Goel R, Paul S, Sethi A, Agarwal S, Gao S, Kumar A, Goyal AK, Ku P, Hakkani-Tür D (2020) Multiwoz 2.1: A consolidated multi-domain dialogue dataset with state corrections and state tracking baselines, 422–428
Chen W, Chen J, Qin P, Yan X, Wang WY (2019) Semantically conditioned dialog response generation via hierarchical disentangled self-attention, 3696–3709. https://doi.org/10.18653/v1/p19-1360
Lei W, Jin X, Kan M, Ren Z, He X, Yin D (2018) Sequicity: Simplifying task-oriented dialogue systems with single sequence-to-sequence architectures, 1437–1447. https://doi.org/10.18653/v1/P18-1133
Snell C, Kostrikov I, Su Y, Yang S, Levine S (2023) Offline RL for natural language generation with implicit language Q learning
Jang Y, Lee J, Kim K (2022) Gpt-critic: offline reinforcement learning for end-to-end task-oriented dialogue systems
Durugkar I, Tec M, Niekum S, Stone P (2021) Adversarial intrinsic motivation for reinforcement learning, 8622–8636
Liu H, Trott A, Socher R, Xiong C (2019) Competitive experience replay
Ultes S, Budzianowski P, Casanueva I, Mrksic N, Rojas-Barahona LM, Su P, Wen T, Gasic M, Young SJ (2017) Reward-balancing for statistical spoken dialogue systems using multi-objective reinforcement learning, 65–70. https://doi.org/10.18653/v1/w17-5509
Peng B, Li X, Gao J, Liu J, Chen Y, Wong K (2018) Adversarial advantage actor-critic model for task-completion dialogue policy learning, 6149–6153. https://doi.org/10.1109/ICASSP.2018.8461918
Liu B, Lane IR (2018) Adversarial learning of task-oriented neural dialog models, 350–359.https://doi.org/10.18653/v1/w18-5041
Wang H, Peng B, Wong K (2020) Learning efficient dialogue policy from demonstrations through shaping, 6355–6365. https://doi.org/10.18653/v1/2020.acl-main.566
Ramachandran GS, Hashimoto K, Xiong C (2022) [CASPI] causal-aware safe policy improvement for task-oriented dialogue, 92–102. https://doi.org/10.18653/v1/2022.acl-long.8
Brown DS, Goo W, Nagarajan P, Niekum S (2019) Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. 97:783–792
Brown DS, Goo W, Niekum S (2019) Better-than-demonstrator imitation learning via automatically-ranked demonstrations. 100:330–359
Ajay A, Du Y, Gupta A, Tenenbaum JB, Jaakkola TS, Agrawal P (2023) Is conditional generative modeling all you need for decision making?
Chi C, Feng S, Du Y, Xu Z, Cousineau E, Burchfiel B, Song S (2023) Diffusion policy: visuomotor policy learning via action diffusion. https://doi.org/10.15607/RSS.2023.XIX.026
Reuss M, Li M, Jia X, Lioutikov R (2023) Goal-conditioned imitation learning using score-based diffusion policies. https://doi.org/10.15607/RSS.2023.XIX.028
Janner M, Du Y, Tenenbaum JB, Levine S (2022) Planning with diffusion for flexible behavior synthesis. 162:9902–9915
Chen H, Lu C, Ying C, Su H, Zhu J (2023) Offline reinforcement learning via high-fidelity generative behavior modeling
Yang L, Huang Z, Lei F, Zhong Y, Yang Y, Fang C, Wen S, Zhou B, Lin Z (2023) Policy representation via diffusion probability model for reinforcement learning. arXiv:2305.13122. https://doi.org/10.48550/arXiv.2305.13122
Chen L, Lu K, Rajeswaran A, Lee K, Grover A, Laskin M, Abbeel P, Srinivas A, Mordatch I (2021) Decision transformer: reinforcement learning via sequence modeling, 15084–15097
Zhang Z, Huang M, Zhao Z, Ji F, Chen H, Zhu X (2019) Memory-augmented dialogue management for task-oriented dialogue systems. ACM Trans Inf Syst 37(3):34–13430. https://doi.org/10.1145/3317612
Stolcke A, Ries K, Coccaro N, Shriberg E, Bates RA, Jurafsky D, Taylor P, Martin R, Ess-Dykema CV, Meteer M (2000) Dialogue act modeling for automatic tagging and recognition of conversational speech. CoRR. cs.CL/0006023
Budzianowski P, Wen T, Tseng B, Casanueva I, Ultes S, Ramadan O, Gasic M (2018) Multiwoz - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling, 5016–5026
Schatzmann J, Thomson B, Weilhammer K, Ye H, Young SJ (2007) Agenda-based user simulation for bootstrapping a POMDP dialogue system, 149–152
Su P, Gasic M, Mrksic N, Rojas-Barahona LM, Ultes S, Vandyke D, Wen T, Young SJ (2016) Continuously learning neural dialogue management. arXiv:1606.02689
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. CoRR. arXiv:1707.06347
Hasselt H, Guez A, Silver D (2016) Deep reinforcement learning with double q-learning, 2094–2100. https://doi.org/10.1609/aaai.v30i1.10295
Su S, Li X, Gao J, Liu J, Chen Y (2018) Discriminative deep dyna-q: robust planning for dialogue policy learning, 3813–3823. https://doi.org/10.18653/v1/d18-1416
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China (Grant no. 60971088), and in part by the Natural Science Foundation of Shandong Province (Grant no. ZR2020MF149). Special thanks to Hui Zhang from the School of Computer Science, Qufu Normal University for her work and suggestions during the manuscript revision process.
Author information
Authors and Affiliations
Contributions
Zhibin Liu: Conceptualization, Funding acquisition, Methodology, Project administration, Resources, Software, Supervision, Validation, Writing original draft, Writing - Review & Editing; Rucai Pang: Data curation, Formal analysis, Methodology, Project administration, Software, Visualization, Writing original draft, Writing - Review & Editing; Zhaoan Dong: Funding acquisition, Investigation, Supervision, Resources.
Corresponding author
Ethics declarations
Competing Interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethical and Informed Consent for Data Used
All authors have read this manuscript and agree to its publication. All authors approve of the data used in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, Z., Pang, R. & Dong, Z. Task-based dialogue policy learning based on diffusion models. Appl Intell 54, 11752–11764 (2024). https://doi.org/10.1007/s10489-024-05810-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05810-6