Abstract
Model-based reinforcement learning is a promising direction to improve the sample efficiency of reinforcement learning with learning a model of the environment. Previous model learning methods aim at fitting the transition data, and commonly employ a supervised learning approach to minimize the distance between the predicted state and the real state. The supervised model learning methods, however, diverge from the ultimate goal of model learning, i.e., optimizing the learned-in-the-model policy. In this work, we investigate how model learning and policy learning can share the same objective of maximizing the expected return in the real environment. We find model learning towards this objective can result in a target of enhancing the similarity between the gradient on generated data and the gradient on the real data. We thus derive the gradient of the model from this target and propose the Model Gradient algorithm (MG) to integrate this novel model learning approach with policy-gradient-based policy optimization. We conduct experiments on multiple locomotion control tasks and find that MG can not only achieve high sample efficiency but also lead to better convergence performance compared to traditional model-based reinforcement learning approaches.
Similar content being viewed by others
References
Sutton R S, Barto A G. Reinforcement Learning: An Introduction. 2nd ed. The MIT Press, 2018
Silver D, Huang A, Maddison C J, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap T, Leach M, Kavukcuoglu K, Graepel T, Hassabis D. Mastering the game of go with deep neural networks and tree search. Nature, 2016, 529(7587): 484–489
Mnih V, Kavukcuoglu K, Silver D, Rusu A A, Veness J, Bellemare M G, Graves A, Riedmiller M A, Fidjeland A K, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529–533
Brafman R I, Tennenholtz M. R-MAX - A general polynomial time algorithm for near-optimal reinforcement learning. The Journal of Machine Learning Research, 2002, 3: 213–231
Schrittwieser J, Antonoglou I, Hubert T, Simonyan K, Sifre L, Schmitt S, Guez A, Lockhart E, Hassabis D, Graepel T, Lillicrap T, Silver D. Mastering Atari, Go, Chess and Shogi by planning with a learned model. Nature, 2020, 588(7839): 604–609
Luo Y, Xu H, Li Y, Tian Y, Darrell T, Ma T. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. In: Proceedings of International Conference on Learning Representations. 2019
Janner M, Fu J, Zhang M, Levine S. When to trust your model: Modelbased policy optimization. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 1122
Pan F, He J, Tu D, He Q. Trust the model when it is confident: Masked model-based actor-critic. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020
Talvitie E. Self-correcting models for model-based reinforcement learning. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017, 2597–2603
Kidambi R, Rajeswaran A, Netrapalli P, Joachims T. MOReL: modelbased offline reinforcement learning. 2020, arXiv preprint arXiv: 2005.05951
Yu T, Thomas G, Yu L, Ermon S, Zou J Y, Levine S, Finn C, Ma T. MOPO: model-based offline policy optimization. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020
Asadi K, Misra D, Kim S, Littman M L. Combating the compounding-error problem with a multi-step model. 2019, arXiv preprint arXiv: 1905.13320
Williams R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992, 8(3): 229–256
Sutton R S, McAllester D A, Singh S P, Mansour Y. Policy gradient methods for reinforcement learning with function approximation. In: Proceedings of the 12th International Conference on Neural Information Processing Systems. 1999, 1057–1063
Kaelbling L P, Littman M L, Moore A W. Reinforcement learning: a survey. Journal of Artificial Intelligence Research, 1996, 4: 237–285
Hewing L, Wabersich K P, Menner M, Zeilinger M N. Learning-based model predictive control: toward safe learning in control. Annual Review of Control, Robotics, and Autonomous Systems, 2020, 3(1): 269–296
Ko J, Fox D. GP-BayesFilters: Bayesian filtering using Gaussian process prediction and observation models. Autonomous Robots, 2009, 27(1): 75–90
Deisenroth M P, Fox D, Rasmussen C E. Gaussian processes for data-efficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(2): 408–423
Levine S, Koltun V. Guided policy search. In: Proceedings of the 30th International Conference on Machine Learning. 2013, 1–9
Lioutikov R, Paraschos A, Peters J, Neumann G. Sample-based informationl-theoretic stochastic optimal control. In: Proceedings of 2014 IEEE International Conference on Robotics and Automation. 2014, 3896–3902
Kumar V, Todorov E, Levine S. Optimal control with learned local models: Application to dexterous manipulation. In: Proceedings of 2016 IEEE International Conference on Robotics and Automation. 2016, 378–383
Amos B, Rodriguez I D J, Sacks J, Boots B, Kolter J Z. Differentiable MPC for end-to-end planning and control. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 8299–8310
Khansari-Zadeh S M, Billard A. Learning stable nonlinear dynamical systems with gaussian mixture models. IEEE Transactions on Robotics, 2011, 27(5): 943–957
Kurutach T, Clavera I, Duan Y, Tamar A, Abbeel P. Model-ensemble trust-region policy optimization. In: Proceedings of the 6th International Conference on Learning Representations. 2018
Chua K, Calandra R, McAllister R, Levine S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 4759–4770
Gu S, Lillicrap T P, Sutskever I, Levine S. Continuous deep Q-learning with model-based acceleration. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning. 2016, 2829–2838
Clavera I, Rothfuss J, Schulman J, Fujita Y, Asfour T, Abbeel P. Model-based reinforcement learning via meta-policy optimization. In: Proceedings of the 2nd Conference on Robot Learning. 2018, 617–629
Nagabandi A, Kahn G, Fearing R S, Levine S. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In: Proceedings of 2018 IEEE International Conference on Robotics and Automation. 2018, 7559–7566
Asadi K, Misra D, Littman M L. Lipschitz continuity in model-based reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 264–273
Amos B, Yarats D. The differentiable cross-entropy method. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 28
Yang J, Petersen B K, Zha H, Faissol D M. Single episode policy transfer in reinforcement learning. In: Proceedings of International Conference on Learning Representations. 2020
Melo L C. Transformers are meta-reinforcement learners. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 15340–15359
Parisotto E, Song F, Rae J, Pascanu R, Gülçehre Ç, Jayakumar S M, Jaderberg M, Kaufman R L, Clark A, Noury S, Botvinick M M, Heess N, Hadsell R. Stabilizing transformers for reinforcement learning. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 7487–7498
Grimm C, Barreto A, Singh S, Silver D. The value equivalence principle for model-based reinforcement learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020
Qian H, Yu Y. Derivative-free reinforcement learning: A review. Frontiers of Computer Science, 2021, 15(6): 156336
Rajeswaran A, Mordatch I, Kumar V. A game theoretic framework for model based reinforcement learning. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 737
Heess N, Wayne G, Silver D, Lillicrap T P, Erez T, Tassa Y. Learning continuous control policies by stochastic value gradients. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015, 2944–2952
Amos B, Stanton S, Yarats D, Wilson A G. On the model-based stochastic value gradient for continuous reinforcement learning. In: Proceedings of the 3rd Conference on Learning for Dynamics and Control. 2021, 6–20
Ha D, Schmidhuber J. Recurrent world models facilitate policy evolution. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 2455–2467
Hafner D, Lillicrap T, Fischer I, Villegas R, Ha D, Lee H, Davidson J. Learning latent dynamics for planning from pixels. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 2555–2565
Feinberg V, Wan A, Stoica I, Jordan M I, Gonzalez J E, Levine S. Model-based value estimation for efficient model-free reinforcement learning. 2018, arXiv preprint arXiv: 1803.00101
Buckman J, Hafner D, Tucker G, Brevdo E, Lee H. Sample-efficient reinforcement learning with stochastic ensemble value expansion. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 8234–8244
Deisenroth M P, Rasmussen C E. PILCO: a model-based and data-efficient approach to policy search. In: Proceedings of the 28th International Conference on International Conference on Machine Learning. 2011, 465–472
Tamar A, Levine S, Abbeel P, Wu Y, Thomas G. Value iteration networks. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 2154–2162
Hodel A S. Linear-quadratic control: an introduction [book review]. Proceedings of the IEEE, 1999, 87(5): 927–928
Wu C, Li T, Zhang Z, Yu Y. Bayesian optimistic optimization: Optimistic exploration for model-based reinforcement learning. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022
Rigter M, Lacerda B, Hawes N. RAMBO-RL: robust adversarial modelbased offline reinforcement learning. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022
Schulman J, Levine S, Moritz P, Jordan M I, Abbeel P. Trust region policy optimization. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning. 2015, 1889–1897
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. 2017, arXiv preprint arXiv: 1707.06347
Agarwal A, Kakade S M, Lee J D, Mahajan G. On the theory of policy gradient methods: optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 2021, 22(1): 98
Luo F, Xu T, Lai H, Chen X H, Zhang W, Yu Y. A survey on modelbased reinforcement learning. 2023, arXiv preprint arXiv: 2206.09328
Farahmand A M, Barreto A, Nikovski D. Value-aware loss function for model-based reinforcement learning. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. 2017, 1486–1494
Xu T, Li Z, Yu Y. Error bounds of imitating policies and environments. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020
Wu Y H, Fan T H, Ramadge P J, Su H. Model imitation for modelbased reinforcement learning. In: Proceedings of International Conference on Learning Representations. 2019
Ho J, Ermon S. Generative adversarial imitation learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 4572–4580
Oh J, Singh S, Lee H. Value prediction network. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6120–6130
Modhe N, Kamath H, Batra D, Kalyan A. Model-advantage and value-aware models for model-based reinforcement learning: bridging the gap in theory and practice. 2021, arXiv preprint arXiv: 2106.14080
Lovatto  G, Bueno T P, Mauá D D, Barros L N. Decision-aware model learning for actor-critic methods: when theory does not meet practice. In: Proceedings of “I Can’t Believe It’s Not Better!” at NeurIPS Workshops. 2020
Ayoub A, Jia Z, Szepesvári C, Wang M, Yang L F. Model-based reinforcement learning with value-targeted regression. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 44
Abachi R, Ghavamzadeh M, Farahmand A M. Policy-aware model learning for policy gradient methods. 2020, arXiv preprint arXiv: 2003.00030
D’Oro P, Metelli A M, Tirinzoni A, Papini M, Restelli M. Gradient-aware model-based policy search. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 3801–3808
Nikishin E, Abachi R, Agarwal R, Bacon P L. Control-oriented modelbased reinforcement learning with implicit differentiation. In: Proceedings of the 37th AAAI Conference on Artificial Intelligence. 2022, 7886–7894
Eysenbach B, Khazatsky A, Levine S, Salakhutdinov R. Mismatched no more: joint model-policy optimization for model-based RL. In: Proceedings of Deep RL Workshop NeurIPS 2021. 2021
Joseph J, Geramifard A, Roberts J W, How J P, Roy N. Reinforcement learning with misspecified model classes. In: Proceedings of 2013 IEEE International Conference on Robotics and Automation. 2013, 939–946
Todorov E, Erez T, Tassa Y. MuJoCo: a physics engine for modelbased control. In: Proceedings of 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. 2012, 5026–5033
Wang T, Bao X, Clavera I, Hoang J, Wen Y, Langlois E, Zhang S, Zhang G, Abbeel P, Ba J. Benchmarking model-based reinforcement learning. 2019, arXiv preprint arXiv: 1907.02057
Zha D, Ma W, Yuan L, Hu X, Liu J. Rank the episodes: a simple approach for exploration in procedurally-generated environments. In: Proceedings of the 9th International Conference on Learning Representations. 2021
Acknowledgements
This work was supported by the National Key R&D Program of China (2020AAA0107200) and the National Natural Science Foundation of China (Grant No. 61921006). The authors would like to thank Lei Yuan and Xiong-Hui Chen for their helpful discussions.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests The authors declare that they have no competing interests or financial conflicts to disclose.
Additional information
Chengxing Jia received a BSc degree from Shandong University, China in 2020. He is working towards a PhD in the National Key Lab for Novel Software Technology at the School of Artificial Intelligence, Nanjing University, China. His research interests include model-based reinforcement learning, offline reinforcement learning and meta reinforcement learning. He has served as the reviewer for NeurIPS, ICML, etc.
Fuxiang Zhang received a BSc degree from Nanjing University, China in 2021. He is pursuing a master degree in the National Key Lab for Novel Software Technology, Nanjing University, China. His research interests include offline reinforcement learning and multi-agent reinforcement learning.
Tian Xu received a BSc degree from Northwestern Polytechnical University, China in 2019. He is working towards a PhD in the National Key Lab for Novel Software Technology at the School of Artificial Intelligence, Nanjing University, China. His research interests include imitation learning and model-based RL.
Jing-Cheng Pang received a BSc degree from the University of Electronic Science and Technology of China, China in 2019. He has been pursuing his PhD with the School of Artificial Intelligence at Nanjing University, China, since 2019. His current research interests include reinforcement learning and machine learning. He has served as the reviewer for TETCI, ICML, UAI, etc. He is also a student committee member of RLChina Community.
Zongzhang Zhang received his PhD degree in computer science from University of Science and Technology of China, China in 2012. He was a research fellow at the School of Computing, National University of Singapore, Singapore from 2012 to 2014, and a visiting scholar at the Department of Aeronautics and Astronautics, Stanford University, USA from 2018 to 2019. He is currently an associate professor at the School of Artificial Intelligence, Nanjing University, China. He has co-authored more than 60 research papers. His research interests include reinforcement learning, intelligent planning, and multi-agent learning.
Yang Yu received his BSc and PhD degrees in computer science from Nanjing University, China in 2004 and 2011, respectively. Currently, he is a professor at the School of Artificial Intelligence, Nanjing University, China. His research interests include machine learning, and he is currently working on real-world reinforcement learning. His work has been published in Artificial Intelligence, IJCAI, AAAI, NIPS, KDD, etc. He received several conference best paper awards, including IDEAL16, GECCO11, PAKDD08, etc. He also received CCF-IEEE CS Young Scientist Award in 2020, was recognized as one of the AIs 10 to Watch by IEEE Intelligent Systems in 2018, and received the PAKDD Early Career Award in 2018. He was invited to give an Early Career Spotlight Talk in IJCAI18.
Electronic Supplementary Material
Rights and permissions
About this article
Cite this article
Jia, C., Zhang, F., Xu, T. et al. Model gradient: unified model and policy learning in model-based reinforcement learning. Front. Comput. Sci. 18, 184339 (2024). https://doi.org/10.1007/s11704-023-3150-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11704-023-3150-5