Skip to main content
Log in

Task-based dialogue policy learning based on diffusion models

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

The purpose of task-based dialogue systems is to help users achieve their dialogue needs using as few dialogue rounds as possible. As the demand increases, the dialogue tasks gradually involve multiple domains and develop in the direction of complexity and diversity. Achieving high performance with low computational effort has become an essential metric for multi-domain task-based dialogue systems. This paper proposes a new approach to guided dialogue policy. The method introduces a conditional diffusion model in the reinforcement learning Q-learning algorithm to regularise the policy in a diffusion Q-learning manner. The conditional diffusion model is used to learn the action value function, regulate the actions using regularisation, sample the actions, use the sampled actions in the policy update process, and additionally add a loss term that maximizes the value of the actions in the policy update process to improve the learning efficiency. Our proposed method is based on a conditional diffusion model, combined with the reinforcement learning TD3 algorithm as a dialogue policy and an inverse reinforcement learning approach to construct a reward estimator to provide rewards for policy updates as a way of completing a multi-domain dialogue task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability and Access

Some or all data, models, or code generated or used during the study are available from the corresponding author by request.

References

  1. Chen H, Liu X, Yin D, Tang J (2017) A survey on dialogue systems: recent advances and new frontiers. SIGKDD Explor 19(2):25–35. https://doi.org/10.1145/3166054.3166058

    Article  Google Scholar 

  2. Kwan W, Wang H, Wang H, Wong K (2023) A survey on recent advances and challenges in reinforcement learning methods for task-oriented dialogue policy learning. Int J Autom Comput 20(3):318–334. https://doi.org/10.1007/s11633-022-1347-y

    Article  Google Scholar 

  3. Dhingra B, Li L, Li X, Gao J, Chen Y, Ahmed F, Deng L (2017) Towards end-to-end reinforcement learning of dialogue agents for information access. 484–495. https://doi.org/10.18653/v1/P17-1045

  4. Shi W, Yu Z (2018) Sentiment adaptive end-to-end dialog systems. 1509–1519. https://doi.org/10.18653/v1/P18-1140

  5. Casanueva I, Temcinas T, Gerz D, Henderson M, Vulic I (2020) Efficient intent detection with dual sentence encoders. arXiv:2003.04807

  6. Zhang J, Hashimoto K, Wu C, Wang Y, Yu PS, Socher R, Xiong C (2020) Find or classify? dual strategy for slot-value predictions on multi-domain dialog state tracking, 154–167

  7. Zhu Q, Zhang Z, Fang Y, Li X, Takanobu R, Li J, Peng B, Gao J, Zhu X, Huang M (2020) Convlab-2: an open-source toolkit for building, evaluating, and diagnosing dialogue systems, 142–149. https://doi.org/10.18653/v1/2020.acl-demos.19

  8. Peng B, Li X, Li L, Gao J, Celikyilmaz A, Lee S, Wong K (2017) Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning, 2231–2240. https://doi.org/10.18653/v1/d17-1237

  9. Zhao T, Xie K, Eskénazi M (2019) Rethinking action spaces for reinforcement learning in end-to-end dialog agents with latent variable models, 1208–1218 . https://doi.org/10.18653/v1/n19-1123

  10. Zhang Y, Ou Z, Yu Z (2020) Task-oriented dialog systems that consider multiple appropriate responses under the same context. AAAI Press, pp 9604–9611. https://doi.org/10.1609/aaai.v34i05.6507

  11. Chen Z, Chen L, Liu X, Yu K (2020) Distributed structured actor-critic reinforcement learning for universal dialogue management. IEEE ACM Trans Audio Speech Lang Process 28:2400–2411. https://doi.org/10.1109/TASLP.2020.3013392

    Article  Google Scholar 

  12. Lubis N, Geishauser C, Heck M, Lin H, Moresi M, Niekerk C, Gasic M (2020) LAVA: latent action spaces via variational auto-encoding for dialogue policy optimization, 465–479. https://doi.org/10.18653/v1/2020.coling-main.41

  13. Rohmatillah M, Chien J (2023) Hierarchical reinforcement learning with guidance for multi-domain dialogue policy. IEEE ACM Trans Audio Speech Lang Process 31:748–761. https://doi.org/10.1109/TASLP.2023.3235202

    Article  Google Scholar 

  14. Li Z, Lee S, Peng B, Li J, Kiseleva J, Rijke M, Shayandeh S, Gao J (2020) Guided dialogue policy learning without adversarial learning in the loop. EMNLP 2020, 2308–2317. https://doi.org/10.18653/v1/2020.findings-emnlp.209

  15. Fujimoto S, Meger D, Precup D (2019) Off-policy deep reinforcement learning without exploration. 97:2052–2062

  16. Fujimoto S, Gu SS (2021) A minimalist approach to offline reinforcement learning, 20132–20145

  17. Kumar A, Zhou A, Tucker G, Levine S (2020) Conservative q-learning for offline reinforcement learning

  18. Wu Y, Tucker G, Nachum O (2019) Behavior regularized offline reinforcement learning. arXiv:1911.11361

  19. Takanobu R, Zhu H, Huang M (2019) Guided dialog policy learning: reward estimation for multi-domain task-oriented dialog, 100–110. https://doi.org/10.18653/v1/D19-1010

  20. Wang Z, Hunt JJ, Zhou M (2023) Diffusion policies as an expressive policy class for offline reinforcement learning

  21. Joshi T, Makker S, Kodamana H, Kandath H (2021) Twin actor twin delayed deep deterministic policy gradient (tatd3) learning for batch process control. Comput Chem Eng 155:107527

    Article  Google Scholar 

  22. Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models, 6840–6851. https://doi.org/10.5555/3459574.3459739

  23. Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning. PMLR, pp 8162–8171. http://proceedings.mlr.press/v139/nichol21a.html

  24. Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models

  25. Song Y, Sohl-Dickstein J, Kingma DP, Kumar A, Ermon S, Poole B (2020) Score-based generative modeling through stochastic differential equations. arXiv:2011.13456

  26. Eric M, Goel R, Paul S, Sethi A, Agarwal S, Gao S, Kumar A, Goyal AK, Ku P, Hakkani-Tür D (2020) Multiwoz 2.1: A consolidated multi-domain dialogue dataset with state corrections and state tracking baselines, 422–428

  27. Chen W, Chen J, Qin P, Yan X, Wang WY (2019) Semantically conditioned dialog response generation via hierarchical disentangled self-attention, 3696–3709. https://doi.org/10.18653/v1/p19-1360

  28. Lei W, Jin X, Kan M, Ren Z, He X, Yin D (2018) Sequicity: Simplifying task-oriented dialogue systems with single sequence-to-sequence architectures, 1437–1447. https://doi.org/10.18653/v1/P18-1133

  29. Snell C, Kostrikov I, Su Y, Yang S, Levine S (2023) Offline RL for natural language generation with implicit language Q learning

  30. Jang Y, Lee J, Kim K (2022) Gpt-critic: offline reinforcement learning for end-to-end task-oriented dialogue systems

  31. Durugkar I, Tec M, Niekum S, Stone P (2021) Adversarial intrinsic motivation for reinforcement learning, 8622–8636

  32. Liu H, Trott A, Socher R, Xiong C (2019) Competitive experience replay

  33. Ultes S, Budzianowski P, Casanueva I, Mrksic N, Rojas-Barahona LM, Su P, Wen T, Gasic M, Young SJ (2017) Reward-balancing for statistical spoken dialogue systems using multi-objective reinforcement learning, 65–70. https://doi.org/10.18653/v1/w17-5509

  34. Peng B, Li X, Gao J, Liu J, Chen Y, Wong K (2018) Adversarial advantage actor-critic model for task-completion dialogue policy learning, 6149–6153. https://doi.org/10.1109/ICASSP.2018.8461918

  35. Liu B, Lane IR (2018) Adversarial learning of task-oriented neural dialog models, 350–359.https://doi.org/10.18653/v1/w18-5041

  36. Wang H, Peng B, Wong K (2020) Learning efficient dialogue policy from demonstrations through shaping, 6355–6365. https://doi.org/10.18653/v1/2020.acl-main.566

  37. Ramachandran GS, Hashimoto K, Xiong C (2022) [CASPI] causal-aware safe policy improvement for task-oriented dialogue, 92–102. https://doi.org/10.18653/v1/2022.acl-long.8

  38. Brown DS, Goo W, Nagarajan P, Niekum S (2019) Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. 97:783–792

  39. Brown DS, Goo W, Niekum S (2019) Better-than-demonstrator imitation learning via automatically-ranked demonstrations. 100:330–359

  40. Ajay A, Du Y, Gupta A, Tenenbaum JB, Jaakkola TS, Agrawal P (2023) Is conditional generative modeling all you need for decision making?

  41. Chi C, Feng S, Du Y, Xu Z, Cousineau E, Burchfiel B, Song S (2023) Diffusion policy: visuomotor policy learning via action diffusion. https://doi.org/10.15607/RSS.2023.XIX.026

  42. Reuss M, Li M, Jia X, Lioutikov R (2023) Goal-conditioned imitation learning using score-based diffusion policies. https://doi.org/10.15607/RSS.2023.XIX.028

  43. Janner M, Du Y, Tenenbaum JB, Levine S (2022) Planning with diffusion for flexible behavior synthesis. 162:9902–9915

  44. Chen H, Lu C, Ying C, Su H, Zhu J (2023) Offline reinforcement learning via high-fidelity generative behavior modeling

  45. Yang L, Huang Z, Lei F, Zhong Y, Yang Y, Fang C, Wen S, Zhou B, Lin Z (2023) Policy representation via diffusion probability model for reinforcement learning. arXiv:2305.13122. https://doi.org/10.48550/arXiv.2305.13122

  46. Chen L, Lu K, Rajeswaran A, Lee K, Grover A, Laskin M, Abbeel P, Srinivas A, Mordatch I (2021) Decision transformer: reinforcement learning via sequence modeling, 15084–15097

  47. Zhang Z, Huang M, Zhao Z, Ji F, Chen H, Zhu X (2019) Memory-augmented dialogue management for task-oriented dialogue systems. ACM Trans Inf Syst 37(3):34–13430. https://doi.org/10.1145/3317612

    Article  Google Scholar 

  48. Stolcke A, Ries K, Coccaro N, Shriberg E, Bates RA, Jurafsky D, Taylor P, Martin R, Ess-Dykema CV, Meteer M (2000) Dialogue act modeling for automatic tagging and recognition of conversational speech. CoRR. cs.CL/0006023

  49. Budzianowski P, Wen T, Tseng B, Casanueva I, Ultes S, Ramadan O, Gasic M (2018) Multiwoz - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling, 5016–5026

  50. Schatzmann J, Thomson B, Weilhammer K, Ye H, Young SJ (2007) Agenda-based user simulation for bootstrapping a POMDP dialogue system, 149–152

  51. Su P, Gasic M, Mrksic N, Rojas-Barahona LM, Ultes S, Vandyke D, Wen T, Young SJ (2016) Continuously learning neural dialogue management. arXiv:1606.02689

  52. Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. CoRR. arXiv:1707.06347

  53. Hasselt H, Guez A, Silver D (2016) Deep reinforcement learning with double q-learning, 2094–2100. https://doi.org/10.1609/aaai.v30i1.10295

  54. Su S, Li X, Gao J, Liu J, Chen Y (2018) Discriminative deep dyna-q: robust planning for dialogue policy learning, 3813–3823. https://doi.org/10.18653/v1/d18-1416

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (Grant no. 60971088), and in part by the Natural Science Foundation of Shandong Province (Grant no. ZR2020MF149). Special thanks to Hui Zhang from the School of Computer Science, Qufu Normal University for her work and suggestions during the manuscript revision process.

Author information

Authors and Affiliations

Authors

Contributions

Zhibin Liu: Conceptualization, Funding acquisition, Methodology, Project administration, Resources, Software, Supervision, Validation, Writing original draft, Writing - Review & Editing; Rucai Pang: Data curation, Formal analysis, Methodology, Project administration, Software, Visualization, Writing original draft, Writing - Review & Editing; Zhaoan Dong: Funding acquisition, Investigation, Supervision, Resources.

Corresponding author

Correspondence to Zhibin Liu.

Ethics declarations

Competing Interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical and Informed Consent for Data Used

All authors have read this manuscript and agree to its publication. All authors approve of the data used in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Z., Pang, R. & Dong, Z. Task-based dialogue policy learning based on diffusion models. Appl Intell 54, 11752–11764 (2024). https://doi.org/10.1007/s10489-024-05810-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05810-6

Keywords