Offline model-based reinforcement learning with causal structured world models

Zhu, Zhengmao; Tian, Honglong; Chen, Xionghui; Zhang, Kun; Yu, Yang

doi:10.1007/s11704-024-3946-y

Offline model-based reinforcement learning with causal structured world models

Research Article
Published: 14 December 2024

Volume 19, article number 194347, (2025)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Zhengmao Zhu¹,
Honglong Tian¹,
Xionghui Chen¹,
Kun Zhang² &
…
Yang Yu¹

88 Accesses
Explore all metrics

Abstract

Model-based methods have recently been shown promising for offline reinforcement learning (RL), which aims at learning good policies from historical data without interacting with the environment. Previous model-based offline RL methods employ a straightforward prediction method that maps the states and actions directly to the next-step states. However, such a prediction method tends to capture spurious relations caused by the sampling policy preference behind the offline data. It is sensible that the environment model should focus on causal influences, which can facilitate learning an effective policy that can generalize well to unseen states. In this paper, we first provide theoretical results that causal environment models can outperform plain environment models in offline RL by incorporating the causal structure into the generalization error bound. We also propose a practical algorithm, oFfline mOdel-based reinforcement learning with CaUsal Structured World Models (FOCUS), to illustrate the feasibility of learning and leveraging causal structure in offline RL. Experimental results on two benchmarks show that FOCUS reconstructs the underlying causal structure accurately and robustly, and, as a result, outperforms both model-based offline RL algorithms and causal model-based offline RL algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Variable-Agnostic Causal Exploration for Reinforcement Learning

A Survey: Limited Data Problem and Strategy of Reinforcement Learning

Towards Offline Reinforcement Learning with Pessimistic Value Priors

References

Yu F, Xian W, Chen Y, Liu F, Liao M, Madhavan V, Darrell T. BDD100K: a diverse driving video database with scalable annotation tooling. 2018, arXiv preprint arXiv: 1805.04687
Google Scholar
Gottesman O, Johansson F, Komorowski M, Faisal A, Sontag D, Doshi-Velez F, Celi L A. Guidelines for reinforcement learning in healthcare. Nature Medicine, 2019, 25(1): 16–18
Article Google Scholar
Yu T, Thomas G, Yu L, Ermon S, Zou J, Levine S, Finn C, Ma T. MOPO: model-based offline policy optimization. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1185
Google Scholar
Bengio Y, Deleu T, Rahaman N, Ke N R, Lachapelle S, Bilaniuk O, Goyal A, Pal C J. A meta-transfer objective for learning to disentangle causal mechanisms. In: Proceedings of the 8th International Conference on Learning Representations. 2020
Google Scholar
de Haan P, Jayaraman D, Levine S. Causal confusion in imitation learning. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 1049
Google Scholar
Tenenbaum J. Building machines that learn and think like people. In: Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. 2018, 5
Google Scholar
Edmonds M, Kubricht J, Summers C, Zhu Y, Rothrock B, Zhu S C, Lu H. Human causal transfer: challenges for deep reinforcement learning. In: Proceedings of the 40th Annual Meeting of the Cognitive Science Society. 2018
Google Scholar
Spirtes P, Glymour C N, Scheines R. Causation, Prediction, and Search. 2nd ed. Cambridge: MIT Press, 2000
Google Scholar
Zhang K, Peters J, Janzing D, Schölkopf B. Kernel-based conditional independence test and application in causal discovery. In: Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence. 2011, 804–813
Google Scholar
Sun X, Janzing D, Schölkopf B, Fukumizu K. A kernel-based causal learning algorithm. In: Proceedings of the 24th International Conference on Machine Learning. 2007, 855–862
Chapter Google Scholar
Heckerman D, Meek C, Cooper G. A Bayesian approach to causal discovery. In: Holmes D E, Jain L C, eds. Innovations in Machine Learning: Theory and Applications. Berlin, Heidelberg: Springer, 2006, 1–28
Google Scholar
Margaritis D. Distribution-free learning of Bayesian network structure in continuous domains. In: Proceedings of the 20th National Conference on Artificial Intelligence. 2005, 825–830
Google Scholar
Ke N R, Bilaniuk O, Goyal A, Bauer S, Larochelle H, Schölkopf B, Mozer M C, Pal C, Bengio Y. Learning neural causal models from unknown interventions. 2019, arXiv preprint arXiv: 1910.01075
Google Scholar
Wang Z, Xiao X, Xu Z, Zhu Y, Stone P. Causal dynamics learning for task-independent state abstraction. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 23151–23180
Google Scholar
Bellman R. A Markovian decision process. Journal of Mathematics and Mechanics, 1957, 6(5): 679–684
MathSciNet Google Scholar
Kurutach T, Clavera I, Duan Y, Tamar A, Abbeel P. Model-ensemble trust-region policy optimization. In: Proceedings of the 6th International Conference on Learning Representations. 2018
Google Scholar
Williams G, Wagener N, Goldfain B, Drews P, Rehg J M, Boots B, Theodorou E A. Information theoretic MPC for model-based reinforcement learning. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). 2017, 1714–1721
Google Scholar
Kidambi R, Rajeswaran A, Netrapalli P, Joachims T. MOReL: model-based offline reinforcement learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1830
Google Scholar
Akkaya I, Andrychowicz M, Chociej M, Litwin M, McGrew B, Petron A, Paino A, Plappert M, Powell G, Ribas R, Schneider J, Tezak N, Tworek J, Welinder P, Weng L, Yuan Q, Zaremba W, Zhang L. Solving Rubik’s cube with a robot hand. 2019, arXiv preprint arXiv: 1910.07113
Google Scholar
Hoerl A E, Kennard R W. Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 2000, 42(1): 80–86
Article Google Scholar
Pearl J. Causality: Models, Reasoning, and Inference. Cambridge: Cambridge University Press, 2000
Google Scholar
Koller D, Friedman N. Probabilistic Graphical Models: Principles and Techniques. Cambridge: MIT Press, 2009
Google Scholar
Yu T, Kumar A, Rafailov R, Rajeswaran A, Levine S, Finn C. COMBO: conservative offline model-based policy optimization. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. 2021, 2218
Google Scholar
Todorov E, Erez T, Tassa Y. MuJoCo: a physics engine for model-based control. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. 2012, 5026–5033
Google Scholar
Fu J, Kumar A, Nachum O, Tucker G, Levine S. D4RL: datasets for deep data-driven reinforcement learning. 2020, arXiv preprint arXiv: 2004.07219
Google Scholar
Xu T, Li Z, Yu Y. Error bounds of imitating policies and environments. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1320
Google Scholar

Download references

Author information

Authors and Affiliations

School of Artificial Intelligence, Nanjing University, Nanjing, 210023, China
Zhengmao Zhu, Honglong Tian, Xionghui Chen & Yang Yu
Department of Philosophy, Carnegie Mellon University, Pittsburgh, PA, 15112, USA
Kun Zhang

Authors

Zhengmao Zhu
View author publications
You can also search for this author inPubMed Google Scholar
Honglong Tian
View author publications
You can also search for this author inPubMed Google Scholar
Xionghui Chen
View author publications
You can also search for this author inPubMed Google Scholar
Kun Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Yang Yu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Yang Yu.

Ethics declarations

Competing interests The authors declare that they have no competing interests or financial conflicts to disclose.

Additional information

Zhengmao Zhu received the BSc degree in Department of Mathematical Science in June 2018 from Zhejiang University, China. He is currently pursuing the PhD degree with the School of Artificial Intelligence, Nanjing University, China. His current research interests mainly include reinforcement learning and causal learning. His works have been accepted in the top conferences of artificial intelligence, including NeurIPS, AAAI, etc. He have served as a reviewer of NeurIPS, ICML, AAAI, etc.

Honglong Tian received the BSc degree in Software Institute of Nanjing University, China in June 2022. He is currently a graduate student of Software Institute of Nanjing University Nanjing University, China. His current research interests mainly include reinforcement learning. His works have been accepted in the top conferences of artificial intelligence, including NeurIPS, AAAI, etc.

Xionghui Chen received the BSc degree from Southeast University, China in 2018. Currently, he is working towards the PhD degree in the National Key Lab for Novel Software Technology, the School of Artificial Intelligence, Nanjing University, China. His research focuses on handling the challenges of reinforcement learning in real-world applications. His works have been accepted in the top conferences of artificial intelligence, including NeurIPS, AAMAS, DAI, KDD, etc. He have served as a reviewer of NeurIPS, IJCAI, KDD, DAI, etc.

Kun Zhang received the BS degree in automation from the University of Science and Technology of China, China in 2001, and the PhD degree in computer science from The Chinese University of Hong Kong, China in 2005. He is currently an associate professor with the Philosophy Department and an Affiliate Faculty Member with the Machine Learning Department, Carnegie Mellon University, Pittsburgh, USA. His research interests lie in causality, machine learning, and artificial intelligence, especially in causal discovery, hidden causal representation learning, transfer learning, and generalpurpose artificial intelligence.

Yang Yu received the PhD degree in computer science from Nanjing University, China in 2011, and is currently a professor at the School of Artificial Intelligence, Nanjing University, China. His research interests include machine learning, mainly reinforcement learning and derivative-free optimization for learning. Prof. Yu was granted the CCF-IEEE CS Young Scientist Award in 2020, recognized as one of the AIs 10 to Watch by IEEE Intelligent Systems, and received the PAKDD Early Career Award in 2018. His team won the Champion of the 2018 OpenAI Retro Contest on transfer reinforcement learning and the 2021 ICAPS Learning to Run a Power Network Challenge with Trust. He served as AC for NeurIPS, ICML, IJCAI, AAAI, etc.

Electronic supplementary material