Skip to main content
Log in

Model gradient: unified model and policy learning in model-based reinforcement learning

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Model-based reinforcement learning is a promising direction to improve the sample efficiency of reinforcement learning with learning a model of the environment. Previous model learning methods aim at fitting the transition data, and commonly employ a supervised learning approach to minimize the distance between the predicted state and the real state. The supervised model learning methods, however, diverge from the ultimate goal of model learning, i.e., optimizing the learned-in-the-model policy. In this work, we investigate how model learning and policy learning can share the same objective of maximizing the expected return in the real environment. We find model learning towards this objective can result in a target of enhancing the similarity between the gradient on generated data and the gradient on the real data. We thus derive the gradient of the model from this target and propose the Model Gradient algorithm (MG) to integrate this novel model learning approach with policy-gradient-based policy optimization. We conduct experiments on multiple locomotion control tasks and find that MG can not only achieve high sample efficiency but also lead to better convergence performance compared to traditional model-based reinforcement learning approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Sutton R S, Barto A G. Reinforcement Learning: An Introduction. 2nd ed. The MIT Press, 2018

  2. Silver D, Huang A, Maddison C J, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap T, Leach M, Kavukcuoglu K, Graepel T, Hassabis D. Mastering the game of go with deep neural networks and tree search. Nature, 2016, 529(7587): 484–489

    Article  Google Scholar 

  3. Mnih V, Kavukcuoglu K, Silver D, Rusu A A, Veness J, Bellemare M G, Graves A, Riedmiller M A, Fidjeland A K, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529–533

    Article  Google Scholar 

  4. Brafman R I, Tennenholtz M. R-MAX - A general polynomial time algorithm for near-optimal reinforcement learning. The Journal of Machine Learning Research, 2002, 3: 213–231

    MathSciNet  Google Scholar 

  5. Schrittwieser J, Antonoglou I, Hubert T, Simonyan K, Sifre L, Schmitt S, Guez A, Lockhart E, Hassabis D, Graepel T, Lillicrap T, Silver D. Mastering Atari, Go, Chess and Shogi by planning with a learned model. Nature, 2020, 588(7839): 604–609

    Article  Google Scholar 

  6. Luo Y, Xu H, Li Y, Tian Y, Darrell T, Ma T. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. In: Proceedings of International Conference on Learning Representations. 2019

  7. Janner M, Fu J, Zhang M, Levine S. When to trust your model: Modelbased policy optimization. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 1122

  8. Pan F, He J, Tu D, He Q. Trust the model when it is confident: Masked model-based actor-critic. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020

  9. Talvitie E. Self-correcting models for model-based reinforcement learning. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017, 2597–2603

  10. Kidambi R, Rajeswaran A, Netrapalli P, Joachims T. MOReL: modelbased offline reinforcement learning. 2020, arXiv preprint arXiv: 2005.05951

  11. Yu T, Thomas G, Yu L, Ermon S, Zou J Y, Levine S, Finn C, Ma T. MOPO: model-based offline policy optimization. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020

  12. Asadi K, Misra D, Kim S, Littman M L. Combating the compounding-error problem with a multi-step model. 2019, arXiv preprint arXiv: 1905.13320

  13. Williams R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992, 8(3): 229–256

    Article  Google Scholar 

  14. Sutton R S, McAllester D A, Singh S P, Mansour Y. Policy gradient methods for reinforcement learning with function approximation. In: Proceedings of the 12th International Conference on Neural Information Processing Systems. 1999, 1057–1063

  15. Kaelbling L P, Littman M L, Moore A W. Reinforcement learning: a survey. Journal of Artificial Intelligence Research, 1996, 4: 237–285

    Article  Google Scholar 

  16. Hewing L, Wabersich K P, Menner M, Zeilinger M N. Learning-based model predictive control: toward safe learning in control. Annual Review of Control, Robotics, and Autonomous Systems, 2020, 3(1): 269–296

    Article  Google Scholar 

  17. Ko J, Fox D. GP-BayesFilters: Bayesian filtering using Gaussian process prediction and observation models. Autonomous Robots, 2009, 27(1): 75–90

    Article  Google Scholar 

  18. Deisenroth M P, Fox D, Rasmussen C E. Gaussian processes for data-efficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(2): 408–423

    Article  Google Scholar 

  19. Levine S, Koltun V. Guided policy search. In: Proceedings of the 30th International Conference on Machine Learning. 2013, 1–9

  20. Lioutikov R, Paraschos A, Peters J, Neumann G. Sample-based informationl-theoretic stochastic optimal control. In: Proceedings of 2014 IEEE International Conference on Robotics and Automation. 2014, 3896–3902

  21. Kumar V, Todorov E, Levine S. Optimal control with learned local models: Application to dexterous manipulation. In: Proceedings of 2016 IEEE International Conference on Robotics and Automation. 2016, 378–383

  22. Amos B, Rodriguez I D J, Sacks J, Boots B, Kolter J Z. Differentiable MPC for end-to-end planning and control. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 8299–8310

  23. Khansari-Zadeh S M, Billard A. Learning stable nonlinear dynamical systems with gaussian mixture models. IEEE Transactions on Robotics, 2011, 27(5): 943–957

    Article  Google Scholar 

  24. Kurutach T, Clavera I, Duan Y, Tamar A, Abbeel P. Model-ensemble trust-region policy optimization. In: Proceedings of the 6th International Conference on Learning Representations. 2018

  25. Chua K, Calandra R, McAllister R, Levine S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 4759–4770

  26. Gu S, Lillicrap T P, Sutskever I, Levine S. Continuous deep Q-learning with model-based acceleration. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning. 2016, 2829–2838

  27. Clavera I, Rothfuss J, Schulman J, Fujita Y, Asfour T, Abbeel P. Model-based reinforcement learning via meta-policy optimization. In: Proceedings of the 2nd Conference on Robot Learning. 2018, 617–629

  28. Nagabandi A, Kahn G, Fearing R S, Levine S. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In: Proceedings of 2018 IEEE International Conference on Robotics and Automation. 2018, 7559–7566

  29. Asadi K, Misra D, Littman M L. Lipschitz continuity in model-based reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 264–273

  30. Amos B, Yarats D. The differentiable cross-entropy method. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 28

  31. Yang J, Petersen B K, Zha H, Faissol D M. Single episode policy transfer in reinforcement learning. In: Proceedings of International Conference on Learning Representations. 2020

  32. Melo L C. Transformers are meta-reinforcement learners. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 15340–15359

  33. Parisotto E, Song F, Rae J, Pascanu R, Gülçehre Ç, Jayakumar S M, Jaderberg M, Kaufman R L, Clark A, Noury S, Botvinick M M, Heess N, Hadsell R. Stabilizing transformers for reinforcement learning. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 7487–7498

  34. Grimm C, Barreto A, Singh S, Silver D. The value equivalence principle for model-based reinforcement learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020

  35. Qian H, Yu Y. Derivative-free reinforcement learning: A review. Frontiers of Computer Science, 2021, 15(6): 156336

    Article  Google Scholar 

  36. Rajeswaran A, Mordatch I, Kumar V. A game theoretic framework for model based reinforcement learning. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 737

  37. Heess N, Wayne G, Silver D, Lillicrap T P, Erez T, Tassa Y. Learning continuous control policies by stochastic value gradients. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015, 2944–2952

  38. Amos B, Stanton S, Yarats D, Wilson A G. On the model-based stochastic value gradient for continuous reinforcement learning. In: Proceedings of the 3rd Conference on Learning for Dynamics and Control. 2021, 6–20

  39. Ha D, Schmidhuber J. Recurrent world models facilitate policy evolution. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 2455–2467

  40. Hafner D, Lillicrap T, Fischer I, Villegas R, Ha D, Lee H, Davidson J. Learning latent dynamics for planning from pixels. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 2555–2565

  41. Feinberg V, Wan A, Stoica I, Jordan M I, Gonzalez J E, Levine S. Model-based value estimation for efficient model-free reinforcement learning. 2018, arXiv preprint arXiv: 1803.00101

  42. Buckman J, Hafner D, Tucker G, Brevdo E, Lee H. Sample-efficient reinforcement learning with stochastic ensemble value expansion. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 8234–8244

  43. Deisenroth M P, Rasmussen C E. PILCO: a model-based and data-efficient approach to policy search. In: Proceedings of the 28th International Conference on International Conference on Machine Learning. 2011, 465–472

  44. Tamar A, Levine S, Abbeel P, Wu Y, Thomas G. Value iteration networks. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 2154–2162

  45. Hodel A S. Linear-quadratic control: an introduction [book review]. Proceedings of the IEEE, 1999, 87(5): 927–928

    Article  Google Scholar 

  46. Wu C, Li T, Zhang Z, Yu Y. Bayesian optimistic optimization: Optimistic exploration for model-based reinforcement learning. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022

  47. Rigter M, Lacerda B, Hawes N. RAMBO-RL: robust adversarial modelbased offline reinforcement learning. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022

  48. Schulman J, Levine S, Moritz P, Jordan M I, Abbeel P. Trust region policy optimization. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning. 2015, 1889–1897

  49. Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. 2017, arXiv preprint arXiv: 1707.06347

  50. Agarwal A, Kakade S M, Lee J D, Mahajan G. On the theory of policy gradient methods: optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 2021, 22(1): 98

    MathSciNet  Google Scholar 

  51. Luo F, Xu T, Lai H, Chen X H, Zhang W, Yu Y. A survey on modelbased reinforcement learning. 2023, arXiv preprint arXiv: 2206.09328

  52. Farahmand A M, Barreto A, Nikovski D. Value-aware loss function for model-based reinforcement learning. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. 2017, 1486–1494

  53. Xu T, Li Z, Yu Y. Error bounds of imitating policies and environments. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020

  54. Wu Y H, Fan T H, Ramadge P J, Su H. Model imitation for modelbased reinforcement learning. In: Proceedings of International Conference on Learning Representations. 2019

  55. Ho J, Ermon S. Generative adversarial imitation learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 4572–4580

  56. Oh J, Singh S, Lee H. Value prediction network. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6120–6130

  57. Modhe N, Kamath H, Batra D, Kalyan A. Model-advantage and value-aware models for model-based reinforcement learning: bridging the gap in theory and practice. 2021, arXiv preprint arXiv: 2106.14080

  58. Lovatto  G, Bueno T P, Mauá D D, Barros L N. Decision-aware model learning for actor-critic methods: when theory does not meet practice. In: Proceedings of “I Can’t Believe It’s Not Better!” at NeurIPS Workshops. 2020

  59. Ayoub A, Jia Z, Szepesvári C, Wang M, Yang L F. Model-based reinforcement learning with value-targeted regression. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 44

  60. Abachi R, Ghavamzadeh M, Farahmand A M. Policy-aware model learning for policy gradient methods. 2020, arXiv preprint arXiv: 2003.00030

  61. D’Oro P, Metelli A M, Tirinzoni A, Papini M, Restelli M. Gradient-aware model-based policy search. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 3801–3808

  62. Nikishin E, Abachi R, Agarwal R, Bacon P L. Control-oriented modelbased reinforcement learning with implicit differentiation. In: Proceedings of the 37th AAAI Conference on Artificial Intelligence. 2022, 7886–7894

  63. Eysenbach B, Khazatsky A, Levine S, Salakhutdinov R. Mismatched no more: joint model-policy optimization for model-based RL. In: Proceedings of Deep RL Workshop NeurIPS 2021. 2021

  64. Joseph J, Geramifard A, Roberts J W, How J P, Roy N. Reinforcement learning with misspecified model classes. In: Proceedings of 2013 IEEE International Conference on Robotics and Automation. 2013, 939–946

  65. Todorov E, Erez T, Tassa Y. MuJoCo: a physics engine for modelbased control. In: Proceedings of 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. 2012, 5026–5033

  66. Wang T, Bao X, Clavera I, Hoang J, Wen Y, Langlois E, Zhang S, Zhang G, Abbeel P, Ba J. Benchmarking model-based reinforcement learning. 2019, arXiv preprint arXiv: 1907.02057

  67. Zha D, Ma W, Yuan L, Hu X, Liu J. Rank the episodes: a simple approach for exploration in procedurally-generated environments. In: Proceedings of the 9th International Conference on Learning Representations. 2021

Download references

Acknowledgements

This work was supported by the National Key R&D Program of China (2020AAA0107200) and the National Natural Science Foundation of China (Grant No. 61921006). The authors would like to thank Lei Yuan and Xiong-Hui Chen for their helpful discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yang Yu.

Ethics declarations

Competing interests The authors declare that they have no competing interests or financial conflicts to disclose.

Additional information

Chengxing Jia received a BSc degree from Shandong University, China in 2020. He is working towards a PhD in the National Key Lab for Novel Software Technology at the School of Artificial Intelligence, Nanjing University, China. His research interests include model-based reinforcement learning, offline reinforcement learning and meta reinforcement learning. He has served as the reviewer for NeurIPS, ICML, etc.

Fuxiang Zhang received a BSc degree from Nanjing University, China in 2021. He is pursuing a master degree in the National Key Lab for Novel Software Technology, Nanjing University, China. His research interests include offline reinforcement learning and multi-agent reinforcement learning.

Tian Xu received a BSc degree from Northwestern Polytechnical University, China in 2019. He is working towards a PhD in the National Key Lab for Novel Software Technology at the School of Artificial Intelligence, Nanjing University, China. His research interests include imitation learning and model-based RL.

Jing-Cheng Pang received a BSc degree from the University of Electronic Science and Technology of China, China in 2019. He has been pursuing his PhD with the School of Artificial Intelligence at Nanjing University, China, since 2019. His current research interests include reinforcement learning and machine learning. He has served as the reviewer for TETCI, ICML, UAI, etc. He is also a student committee member of RLChina Community.

Zongzhang Zhang received his PhD degree in computer science from University of Science and Technology of China, China in 2012. He was a research fellow at the School of Computing, National University of Singapore, Singapore from 2012 to 2014, and a visiting scholar at the Department of Aeronautics and Astronautics, Stanford University, USA from 2018 to 2019. He is currently an associate professor at the School of Artificial Intelligence, Nanjing University, China. He has co-authored more than 60 research papers. His research interests include reinforcement learning, intelligent planning, and multi-agent learning.

Yang Yu received his BSc and PhD degrees in computer science from Nanjing University, China in 2004 and 2011, respectively. Currently, he is a professor at the School of Artificial Intelligence, Nanjing University, China. His research interests include machine learning, and he is currently working on real-world reinforcement learning. His work has been published in Artificial Intelligence, IJCAI, AAAI, NIPS, KDD, etc. He received several conference best paper awards, including IDEAL16, GECCO11, PAKDD08, etc. He also received CCF-IEEE CS Young Scientist Award in 2020, was recognized as one of the AIs 10 to Watch by IEEE Intelligent Systems in 2018, and received the PAKDD Early Career Award in 2018. He was invited to give an Early Career Spotlight Talk in IJCAI18.

Electronic Supplementary Material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jia, C., Zhang, F., Xu, T. et al. Model gradient: unified model and policy learning in model-based reinforcement learning. Front. Comput. Sci. 18, 184339 (2024). https://doi.org/10.1007/s11704-023-3150-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-023-3150-5

Keywords