Task-based dialogue policy learning based on diffusion models

Liu, Zhibin; Pang, Rucai; Dong, Zhaoan

doi:10.1007/s10489-024-05810-6

Task-based dialogue policy learning based on diffusion models

Published: 02 September 2024

Volume 54, pages 11752–11764, (2024)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

237 Accesses
Explore all metrics

Abstract

The purpose of task-based dialogue systems is to help users achieve their dialogue needs using as few dialogue rounds as possible. As the demand increases, the dialogue tasks gradually involve multiple domains and develop in the direction of complexity and diversity. Achieving high performance with low computational effort has become an essential metric for multi-domain task-based dialogue systems. This paper proposes a new approach to guided dialogue policy. The method introduces a conditional diffusion model in the reinforcement learning Q-learning algorithm to regularise the policy in a diffusion Q-learning manner. The conditional diffusion model is used to learn the action value function, regulate the actions using regularisation, sample the actions, use the sampled actions in the policy update process, and additionally add a loss term that maximizes the value of the actions in the policy update process to improve the learning efficiency. Our proposed method is based on a conditional diffusion model, combined with the reinforcement learning TD3 algorithm as a dialogue policy and an inverse reinforcement learning approach to construct a reward estimator to provide rewards for policy updates as a way of completing a multi-domain dialogue task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Survey on Recent Advances and Challenges in Reinforcement Learning Methods for Task-oriented Dialogue Policy Learning

Article Open access 07 January 2023

Single-Model Multi-domain Dialogue Management with Deep Learning

A review of dialogue systems: current trends and future directions

Article 22 December 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability and Access

Some or all data, models, or code generated or used during the study are available from the corresponding author by request.

References

Chen H, Liu X, Yin D, Tang J (2017) A survey on dialogue systems: recent advances and new frontiers. SIGKDD Explor 19(2):25–35. https://doi.org/10.1145/3166054.3166058
Article Google Scholar
Kwan W, Wang H, Wang H, Wong K (2023) A survey on recent advances and challenges in reinforcement learning methods for task-oriented dialogue policy learning. Int J Autom Comput 20(3):318–334. https://doi.org/10.1007/s11633-022-1347-y
Article Google Scholar
Dhingra B, Li L, Li X, Gao J, Chen Y, Ahmed F, Deng L (2017) Towards end-to-end reinforcement learning of dialogue agents for information access. 484–495. https://doi.org/10.18653/v1/P17-1045
Shi W, Yu Z (2018) Sentiment adaptive end-to-end dialog systems. 1509–1519. https://doi.org/10.18653/v1/P18-1140
Casanueva I, Temcinas T, Gerz D, Henderson M, Vulic I (2020) Efficient intent detection with dual sentence encoders. arXiv:2003.04807
Zhang J, Hashimoto K, Wu C, Wang Y, Yu PS, Socher R, Xiong C (2020) Find or classify? dual strategy for slot-value predictions on multi-domain dialog state tracking, 154–167
Zhu Q, Zhang Z, Fang Y, Li X, Takanobu R, Li J, Peng B, Gao J, Zhu X, Huang M (2020) Convlab-2: an open-source toolkit for building, evaluating, and diagnosing dialogue systems, 142–149. https://doi.org/10.18653/v1/2020.acl-demos.19
Peng B, Li X, Li L, Gao J, Celikyilmaz A, Lee S, Wong K (2017) Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning, 2231–2240. https://doi.org/10.18653/v1/d17-1237
Zhao T, Xie K, Eskénazi M (2019) Rethinking action spaces for reinforcement learning in end-to-end dialog agents with latent variable models, 1208–1218 . https://doi.org/10.18653/v1/n19-1123
Zhang Y, Ou Z, Yu Z (2020) Task-oriented dialog systems that consider multiple appropriate responses under the same context. AAAI Press, pp 9604–9611. https://doi.org/10.1609/aaai.v34i05.6507
Chen Z, Chen L, Liu X, Yu K (2020) Distributed structured actor-critic reinforcement learning for universal dialogue management. IEEE ACM Trans Audio Speech Lang Process 28:2400–2411. https://doi.org/10.1109/TASLP.2020.3013392
Article Google Scholar
Lubis N, Geishauser C, Heck M, Lin H, Moresi M, Niekerk C, Gasic M (2020) LAVA: latent action spaces via variational auto-encoding for dialogue policy optimization, 465–479. https://doi.org/10.18653/v1/2020.coling-main.41
Rohmatillah M, Chien J (2023) Hierarchical reinforcement learning with guidance for multi-domain dialogue policy. IEEE ACM Trans Audio Speech Lang Process 31:748–761. https://doi.org/10.1109/TASLP.2023.3235202
Article Google Scholar
Li Z, Lee S, Peng B, Li J, Kiseleva J, Rijke M, Shayandeh S, Gao J (2020) Guided dialogue policy learning without adversarial learning in the loop. EMNLP 2020, 2308–2317. https://doi.org/10.18653/v1/2020.findings-emnlp.209
Fujimoto S, Meger D, Precup D (2019) Off-policy deep reinforcement learning without exploration. 97:2052–2062
Fujimoto S, Gu SS (2021) A minimalist approach to offline reinforcement learning, 20132–20145
Kumar A, Zhou A, Tucker G, Levine S (2020) Conservative q-learning for offline reinforcement learning
Wu Y, Tucker G, Nachum O (2019) Behavior regularized offline reinforcement learning. arXiv:1911.11361
Takanobu R, Zhu H, Huang M (2019) Guided dialog policy learning: reward estimation for multi-domain task-oriented dialog, 100–110. https://doi.org/10.18653/v1/D19-1010
Wang Z, Hunt JJ, Zhou M (2023) Diffusion policies as an expressive policy class for offline reinforcement learning
Joshi T, Makker S, Kodamana H, Kandath H (2021) Twin actor twin delayed deep deterministic policy gradient (tatd3) learning for batch process control. Comput Chem Eng 155:107527
Article Google Scholar
Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models, 6840–6851. https://doi.org/10.5555/3459574.3459739
Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning. PMLR, pp 8162–8171. http://proceedings.mlr.press/v139/nichol21a.html
Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models
Song Y, Sohl-Dickstein J, Kingma DP, Kumar A, Ermon S, Poole B (2020) Score-based generative modeling through stochastic differential equations. arXiv:2011.13456
Eric M, Goel R, Paul S, Sethi A, Agarwal S, Gao S, Kumar A, Goyal AK, Ku P, Hakkani-Tür D (2020) Multiwoz 2.1: A consolidated multi-domain dialogue dataset with state corrections and state tracking baselines, 422–428
Chen W, Chen J, Qin P, Yan X, Wang WY (2019) Semantically conditioned dialog response generation via hierarchical disentangled self-attention, 3696–3709. https://doi.org/10.18653/v1/p19-1360
Lei W, Jin X, Kan M, Ren Z, He X, Yin D (2018) Sequicity: Simplifying task-oriented dialogue systems with single sequence-to-sequence architectures, 1437–1447. https://doi.org/10.18653/v1/P18-1133
Snell C, Kostrikov I, Su Y, Yang S, Levine S (2023) Offline RL for natural language generation with implicit language Q learning
Jang Y, Lee J, Kim K (2022) Gpt-critic: offline reinforcement learning for end-to-end task-oriented dialogue systems
Durugkar I, Tec M, Niekum S, Stone P (2021) Adversarial intrinsic motivation for reinforcement learning, 8622–8636
Liu H, Trott A, Socher R, Xiong C (2019) Competitive experience replay
Ultes S, Budzianowski P, Casanueva I, Mrksic N, Rojas-Barahona LM, Su P, Wen T, Gasic M, Young SJ (2017) Reward-balancing for statistical spoken dialogue systems using multi-objective reinforcement learning, 65–70. https://doi.org/10.18653/v1/w17-5509
Peng B, Li X, Gao J, Liu J, Chen Y, Wong K (2018) Adversarial advantage actor-critic model for task-completion dialogue policy learning, 6149–6153. https://doi.org/10.1109/ICASSP.2018.8461918
Liu B, Lane IR (2018) Adversarial learning of task-oriented neural dialog models, 350–359.https://doi.org/10.18653/v1/w18-5041
Wang H, Peng B, Wong K (2020) Learning efficient dialogue policy from demonstrations through shaping, 6355–6365. https://doi.org/10.18653/v1/2020.acl-main.566
Ramachandran GS, Hashimoto K, Xiong C (2022) [CASPI] causal-aware safe policy improvement for task-oriented dialogue, 92–102. https://doi.org/10.18653/v1/2022.acl-long.8
Brown DS, Goo W, Nagarajan P, Niekum S (2019) Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. 97:783–792
Brown DS, Goo W, Niekum S (2019) Better-than-demonstrator imitation learning via automatically-ranked demonstrations. 100:330–359
Ajay A, Du Y, Gupta A, Tenenbaum JB, Jaakkola TS, Agrawal P (2023) Is conditional generative modeling all you need for decision making?
Chi C, Feng S, Du Y, Xu Z, Cousineau E, Burchfiel B, Song S (2023) Diffusion policy: visuomotor policy learning via action diffusion. https://doi.org/10.15607/RSS.2023.XIX.026
Reuss M, Li M, Jia X, Lioutikov R (2023) Goal-conditioned imitation learning using score-based diffusion policies. https://doi.org/10.15607/RSS.2023.XIX.028
Janner M, Du Y, Tenenbaum JB, Levine S (2022) Planning with diffusion for flexible behavior synthesis. 162:9902–9915
Chen H, Lu C, Ying C, Su H, Zhu J (2023) Offline reinforcement learning via high-fidelity generative behavior modeling
Yang L, Huang Z, Lei F, Zhong Y, Yang Y, Fang C, Wen S, Zhou B, Lin Z (2023) Policy representation via diffusion probability model for reinforcement learning. arXiv:2305.13122. https://doi.org/10.48550/arXiv.2305.13122
Chen L, Lu K, Rajeswaran A, Lee K, Grover A, Laskin M, Abbeel P, Srinivas A, Mordatch I (2021) Decision transformer: reinforcement learning via sequence modeling, 15084–15097
Zhang Z, Huang M, Zhao Z, Ji F, Chen H, Zhu X (2019) Memory-augmented dialogue management for task-oriented dialogue systems. ACM Trans Inf Syst 37(3):34–13430. https://doi.org/10.1145/3317612
Article Google Scholar
Stolcke A, Ries K, Coccaro N, Shriberg E, Bates RA, Jurafsky D, Taylor P, Martin R, Ess-Dykema CV, Meteer M (2000) Dialogue act modeling for automatic tagging and recognition of conversational speech. CoRR. cs.CL/0006023
Budzianowski P, Wen T, Tseng B, Casanueva I, Ultes S, Ramadan O, Gasic M (2018) Multiwoz - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling, 5016–5026
Schatzmann J, Thomson B, Weilhammer K, Ye H, Young SJ (2007) Agenda-based user simulation for bootstrapping a POMDP dialogue system, 149–152
Su P, Gasic M, Mrksic N, Rojas-Barahona LM, Ultes S, Vandyke D, Wen T, Young SJ (2016) Continuously learning neural dialogue management. arXiv:1606.02689
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. CoRR. arXiv:1707.06347
Hasselt H, Guez A, Silver D (2016) Deep reinforcement learning with double q-learning, 2094–2100. https://doi.org/10.1609/aaai.v30i1.10295
Su S, Li X, Gao J, Liu J, Chen Y (2018) Discriminative deep dyna-q: robust planning for dialogue policy learning, 3813–3823. https://doi.org/10.18653/v1/d18-1416

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (Grant no. 60971088), and in part by the Natural Science Foundation of Shandong Province (Grant no. ZR2020MF149). Special thanks to Hui Zhang from the School of Computer Science, Qufu Normal University for her work and suggestions during the manuscript revision process.

Author information

Authors and Affiliations

School of Computer Science, Qufu Normal University, Yantai road, Rizhao, 276826, Shandong, China
Zhibin Liu, Rucai Pang & Zhaoan Dong

Authors

Zhibin Liu
View author publications
You can also search for this author inPubMed Google Scholar
Rucai Pang
View author publications
You can also search for this author inPubMed Google Scholar
Zhaoan Dong
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Zhibin Liu: Conceptualization, Funding acquisition, Methodology, Project administration, Resources, Software, Supervision, Validation, Writing original draft, Writing - Review & Editing; Rucai Pang: Data curation, Formal analysis, Methodology, Project administration, Software, Visualization, Writing original draft, Writing - Review & Editing; Zhaoan Dong: Funding acquisition, Investigation, Supervision, Resources.

Corresponding author

Correspondence to Zhibin Liu.

Ethics declarations

Competing Interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical and Informed Consent for Data Used

All authors have read this manuscript and agree to its publication. All authors approve of the data used in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, Z., Pang, R. & Dong, Z. Task-based dialogue policy learning based on diffusion models. Appl Intell 54, 11752–11764 (2024). https://doi.org/10.1007/s10489-024-05810-6

Download citation

Accepted: 22 August 2024
Published: 02 September 2024
Issue Date: November 2024
DOI: https://doi.org/10.1007/s10489-024-05810-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Task-based dialogue policy learning based on diffusion models

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Survey on Recent Advances and Challenges in Reinforcement Learning Methods for Task-oriented Dialogue Policy Learning

Single-Model Multi-domain Dialogue Management with Deep Learning

A review of dialogue systems: current trends and future directions

Explore related subjects

Data Availability and Access

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Ethical and Informed Consent for Data Used

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now