Abstract
Model-based methods have recently been shown promising for offline reinforcement learning (RL), which aims at learning good policies from historical data without interacting with the environment. Previous model-based offline RL methods employ a straightforward prediction method that maps the states and actions directly to the next-step states. However, such a prediction method tends to capture spurious relations caused by the sampling policy preference behind the offline data. It is sensible that the environment model should focus on causal influences, which can facilitate learning an effective policy that can generalize well to unseen states. In this paper, we first provide theoretical results that causal environment models can outperform plain environment models in offline RL by incorporating the causal structure into the generalization error bound. We also propose a practical algorithm, oFfline mOdel-based reinforcement learning with CaUsal Structured World Models (FOCUS), to illustrate the feasibility of learning and leveraging causal structure in offline RL. Experimental results on two benchmarks show that FOCUS reconstructs the underlying causal structure accurately and robustly, and, as a result, outperforms both model-based offline RL algorithms and causal model-based offline RL algorithms.
Similar content being viewed by others
References
Yu F, Xian W, Chen Y, Liu F, Liao M, Madhavan V, Darrell T. BDD100K: a diverse driving video database with scalable annotation tooling. 2018, arXiv preprint arXiv: 1805.04687
Gottesman O, Johansson F, Komorowski M, Faisal A, Sontag D, Doshi-Velez F, Celi L A. Guidelines for reinforcement learning in healthcare. Nature Medicine, 2019, 25(1): 16–18
Yu T, Thomas G, Yu L, Ermon S, Zou J, Levine S, Finn C, Ma T. MOPO: model-based offline policy optimization. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1185
Bengio Y, Deleu T, Rahaman N, Ke N R, Lachapelle S, Bilaniuk O, Goyal A, Pal C J. A meta-transfer objective for learning to disentangle causal mechanisms. In: Proceedings of the 8th International Conference on Learning Representations. 2020
de Haan P, Jayaraman D, Levine S. Causal confusion in imitation learning. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 1049
Tenenbaum J. Building machines that learn and think like people. In: Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. 2018, 5
Edmonds M, Kubricht J, Summers C, Zhu Y, Rothrock B, Zhu S C, Lu H. Human causal transfer: challenges for deep reinforcement learning. In: Proceedings of the 40th Annual Meeting of the Cognitive Science Society. 2018
Spirtes P, Glymour C N, Scheines R. Causation, Prediction, and Search. 2nd ed. Cambridge: MIT Press, 2000
Zhang K, Peters J, Janzing D, Schölkopf B. Kernel-based conditional independence test and application in causal discovery. In: Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence. 2011, 804–813
Sun X, Janzing D, Schölkopf B, Fukumizu K. A kernel-based causal learning algorithm. In: Proceedings of the 24th International Conference on Machine Learning. 2007, 855–862
Heckerman D, Meek C, Cooper G. A Bayesian approach to causal discovery. In: Holmes D E, Jain L C, eds. Innovations in Machine Learning: Theory and Applications. Berlin, Heidelberg: Springer, 2006, 1–28
Margaritis D. Distribution-free learning of Bayesian network structure in continuous domains. In: Proceedings of the 20th National Conference on Artificial Intelligence. 2005, 825–830
Ke N R, Bilaniuk O, Goyal A, Bauer S, Larochelle H, Schölkopf B, Mozer M C, Pal C, Bengio Y. Learning neural causal models from unknown interventions. 2019, arXiv preprint arXiv: 1910.01075
Wang Z, Xiao X, Xu Z, Zhu Y, Stone P. Causal dynamics learning for task-independent state abstraction. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 23151–23180
Bellman R. A Markovian decision process. Journal of Mathematics and Mechanics, 1957, 6(5): 679–684
Kurutach T, Clavera I, Duan Y, Tamar A, Abbeel P. Model-ensemble trust-region policy optimization. In: Proceedings of the 6th International Conference on Learning Representations. 2018
Williams G, Wagener N, Goldfain B, Drews P, Rehg J M, Boots B, Theodorou E A. Information theoretic MPC for model-based reinforcement learning. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). 2017, 1714–1721
Kidambi R, Rajeswaran A, Netrapalli P, Joachims T. MOReL: model-based offline reinforcement learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1830
Akkaya I, Andrychowicz M, Chociej M, Litwin M, McGrew B, Petron A, Paino A, Plappert M, Powell G, Ribas R, Schneider J, Tezak N, Tworek J, Welinder P, Weng L, Yuan Q, Zaremba W, Zhang L. Solving Rubik’s cube with a robot hand. 2019, arXiv preprint arXiv: 1910.07113
Hoerl A E, Kennard R W. Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 2000, 42(1): 80–86
Pearl J. Causality: Models, Reasoning, and Inference. Cambridge: Cambridge University Press, 2000
Koller D, Friedman N. Probabilistic Graphical Models: Principles and Techniques. Cambridge: MIT Press, 2009
Yu T, Kumar A, Rafailov R, Rajeswaran A, Levine S, Finn C. COMBO: conservative offline model-based policy optimization. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. 2021, 2218
Todorov E, Erez T, Tassa Y. MuJoCo: a physics engine for model-based control. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. 2012, 5026–5033
Fu J, Kumar A, Nachum O, Tucker G, Levine S. D4RL: datasets for deep data-driven reinforcement learning. 2020, arXiv preprint arXiv: 2004.07219
Xu T, Li Z, Yu Y. Error bounds of imitating policies and environments. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1320
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests The authors declare that they have no competing interests or financial conflicts to disclose.
Additional information
Zhengmao Zhu received the BSc degree in Department of Mathematical Science in June 2018 from Zhejiang University, China. He is currently pursuing the PhD degree with the School of Artificial Intelligence, Nanjing University, China. His current research interests mainly include reinforcement learning and causal learning. His works have been accepted in the top conferences of artificial intelligence, including NeurIPS, AAAI, etc. He have served as a reviewer of NeurIPS, ICML, AAAI, etc.
Honglong Tian received the BSc degree in Software Institute of Nanjing University, China in June 2022. He is currently a graduate student of Software Institute of Nanjing University Nanjing University, China. His current research interests mainly include reinforcement learning. His works have been accepted in the top conferences of artificial intelligence, including NeurIPS, AAAI, etc.
Xionghui Chen received the BSc degree from Southeast University, China in 2018. Currently, he is working towards the PhD degree in the National Key Lab for Novel Software Technology, the School of Artificial Intelligence, Nanjing University, China. His research focuses on handling the challenges of reinforcement learning in real-world applications. His works have been accepted in the top conferences of artificial intelligence, including NeurIPS, AAMAS, DAI, KDD, etc. He have served as a reviewer of NeurIPS, IJCAI, KDD, DAI, etc.
Kun Zhang received the BS degree in automation from the University of Science and Technology of China, China in 2001, and the PhD degree in computer science from The Chinese University of Hong Kong, China in 2005. He is currently an associate professor with the Philosophy Department and an Affiliate Faculty Member with the Machine Learning Department, Carnegie Mellon University, Pittsburgh, USA. His research interests lie in causality, machine learning, and artificial intelligence, especially in causal discovery, hidden causal representation learning, transfer learning, and generalpurpose artificial intelligence.
Yang Yu received the PhD degree in computer science from Nanjing University, China in 2011, and is currently a professor at the School of Artificial Intelligence, Nanjing University, China. His research interests include machine learning, mainly reinforcement learning and derivative-free optimization for learning. Prof. Yu was granted the CCF-IEEE CS Young Scientist Award in 2020, recognized as one of the AIs 10 to Watch by IEEE Intelligent Systems, and received the PAKDD Early Career Award in 2018. His team won the Champion of the 2018 OpenAI Retro Contest on transfer reinforcement learning and the 2021 ICAPS Learning to Run a Power Network Challenge with Trust. He served as AC for NeurIPS, ICML, IJCAI, AAAI, etc.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Zhu, Z., Tian, H., Chen, X. et al. Offline model-based reinforcement learning with causal structured world models. Front. Comput. Sci. 19, 194347 (2025). https://doi.org/10.1007/s11704-024-3946-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11704-024-3946-y