How to Select the Appropriate One from the Trained Models for Model-Based OPE

Li, Chongchong; Wang, Yue; Ma, Zhi-Ming; Liu, Yuting

doi:10.1007/978-981-99-9119-8_26

Chongchong Li¹¹,
Yue Wang¹¹,
Zhi-Ming Ma¹² &
…
Yuting Liu¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14474))

Included in the following conference series:

CAAI International Conference on Artificial Intelligence

266 Accesses

Abstract

Offline Policy Evaluation (OPE) is a method for evaluating and selecting complex policies in reinforcement learning for decision-making using large, offline datasets. Recently, Model-Based Offline Policy Evaluation (MBOPE) methods have become popular because they are easy to implement and perform well. The model-based approach provides a mechanism for approximating the value of a given policy directly using estimated transition and reward functions of the environment. However, a challenge remains in selecting an appropriate model from those trained for further use. We begin by analyzing the upper bound of the difference between the true value and the approximated value calculated using the model. Theoretical results show that this difference is related to the trajectories generated by the given policy on the learned model and the prediction error of the transition and reward functions at these generated data points. We then propose a novel criterion inspired by the theoretical results to determine which trained model is better suited for evaluating the given policy. Finally, we demonstrate the effectiveness of the proposed method on both simulated and benchmark offline datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Agarwal, R., Schuurmans, D., Norouzi, M.: An optimistic perspective on offline reinforcement learning. In: ICML, November 2020
Google Scholar
Bain, M., Sammut, C.: A framework for behavioural cloning. In: Machine Intelligence 15, Intelligent Agents, pp. 103–129. Oxford University, GBR (1999)
Google Scholar
Barth-Maron, G., et al.: Distributional policy gradients. In: ICLR (2018)
Google Scholar
Chua, K., Calandra, R., McAllister, R., Levine, S.: Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: NeurIPS (2018)
Google Scholar
Dudík, M., Langford, J., Li, L.: Doubly robust policy evaluation and learning. In: ICML, pp. 1097–1104. Madison, WI, USA, June 2011
Google Scholar
Fu, J., et al.: Benchmarks for deep off-policy evaluation. In: ICLR (2021)
Google Scholar
Fujimoto, S., Meger, D., Precup, D.: Off-policy deep reinforcement learning without exploration. In: ICML, pp. 2052–2062, May 2019
Google Scholar
Gulcehre, C., et al.: RL Unplugged: a suite of benchmarks for offline reinforcement learning. In: NeurIPS, vol. 33, pp. 7248–7259 (2020)
Google Scholar
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: ICML, pp. 1861–1870, July 2018
Google Scholar
Hallak, A., Schnitzler, F., Mann, T., Mannor, S.: Off-policy model-based learning under unknown factored dynamics. In: ICML, pp. 711–719, June 2015
Google Scholar
Hanna, J.P., Stone, P., Niekum, S.: Bootstrapping with models: confidence intervals for off-policy evaluation. In: AAAI, February 2017
Google Scholar
Janner, M., Fu, J., Zhang, M., Levine, S.: When to trust your model: model-based policy optimization. In: NeurIPS, vol. 32 (2019)
Google Scholar
Kostrikov, I., Nachum, O.: Statistical Bootstrapping for Uncertainty Estimation in Off-Policy Evaluation. arXiv:2007.13609 [cs, stat], July 2020. arXiv: 2007.13609
Kumar, A., Fu, J., Soh, M., Tucker, G., Levine, S.: Stabilizing off-policy Q-learning via bootstrapping error reduction. In: NeurIPS, vol. 32 (2019)
Google Scholar
Kégl, B., Hurtado, G., Thomas, A.: Model-based micro-data reinforcement learning: what are the crucial model properties and which model to choose? In: ICLR, September 2020
Google Scholar
Lange, S., Gabel, T., Riedmiller, M.: Batch reinforcement learning. In: Reinforcement Learning: State-of-the-Art, pp. 45–73. Adaptation, Learning, and Optimization. Springer, Heidelberg (2012)
Google Scholar
Levine, S., Kumar, A., Tucker, G., Fu, J.: Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv:2005.01643 (Nov 2020)
Li, L., Chu, W., Langford, J., Wang, X.: Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In: WSDM, February 2011
Google Scholar
Li, L., Munos, R., Szepesvari, C.: On Minimax Optimal Offline Policy Evaluation. arXiv:1409.3653 [cs], September 2014. arXiv: 1409.3653
Mandel, T., Liu, Y.E., Levine, S., Brunskill, E., Popovic, Z.: Offline policy evaluation across representations with applications to educational games. In: AAMAS, May 2014
Google Scholar
Mnih, V., et al.: Asynchronous Methods for Deep Reinforcement Learning. In: ICML, pp. 1928–1937, June 2016
Google Scholar
Murphy, S.A., van der Laan, M.J., Robins, J.M.: Marginal mean models for dynamic regimes. J. Am. Stat. Assoc. 96(456), 1410–1423 (2001)
Article MathSciNet Google Scholar
Nachum, O., Chow, Y., Dai, B., Li, L.: DualDICE: behavior-agnostic estimation of discounted stationary distribution corrections. In: NeurIPS, vol. 32 (2019)
Google Scholar
Paine, T.L., et al.: Hyperparameter Selection for Offline Reinforcement Learning. arXiv:2007.09055 [cs, stat], July 2020
Precup, D., Sutton, R.S., Singh, S.P.: Eligibility traces for off-policy policy evaluation. In: ICML, pp. 759–766. San Francisco, CA, USA, June 2000
Google Scholar
Siegel, N., et al.: Keep doing what worked: Behavior modelling priors for offline reinforcement learning. In: ICLR (2020)
Google Scholar
Silver, D., et al.: Mastering the game of Go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
Article Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
Google Scholar
Tassa, Y., et al.: DeepMind Control Suite. arXiv:1801.00690 [cs], January 2018
Thomas, P., Brunskill, E.: Data-efficient off-policy policy evaluation for reinforcement learning. In: ICML, pp. 2139–2148, June 2016, iSSN: 1938–7228
Google Scholar
Todorov, E., Erez, T., Tassa, Y.: MuJoCo: a physics engine for model-based control. In: IROS, pp. 5026–5033, October 2012
Google Scholar
Uehara, M., Huang, J., Jiang, N.: Minimax weight and Q-function learning for off-policy evaluation. In: ICML, November 2020
Google Scholar
Voloshin, C., Le, H.M., Jiang, N., Yue, Y.: Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning. arXiv:1911.06854 [cs, stat], November 2021
Wang, T., et al.: Benchmarking Model-Based Reinforcement Learning. arXiv:1907.02057 [cs, stat], July 2019. arXiv: 1907.02057
Wang, Z., et al.: Critic regularized regression. In: NeurIPS, vol. 33, pp. 7768–7778 (2020)
Google Scholar
Yang, M., Nachum, O., Dai, B., Li, L., Schuurmans, D.: Off-policy evaluation via the regularized Lagrangian. In: NeurIPS, vol. 33, pp. 6551–6561 (2020)
Google Scholar
Zhang, M.R., Paine, T., Nachum, O., Paduraru, C., Tucker, G., ziyu wang, Norouzi, M.: Autoregressive dynamics models for offline policy evaluation and optimization. In: ICLR (2021)
Google Scholar

Download references

Acknowledgments

This work was supported in part by the Beijing Natural Science Foundation (L222051) and in part by the Fundamental Research Funds for the Central Universities (2022JBMC049).

Author information

Authors and Affiliations

Beijing Jiaotong University, Beijing, China
Chongchong Li, Yue Wang & Yuting Liu
Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
Zhi-Ming Ma

Authors

Chongchong Li
View author publications
You can also search for this author in PubMed Google Scholar
Yue Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhi-Ming Ma
View author publications
You can also search for this author in PubMed Google Scholar
Yuting Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuting Liu .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Lu Fang
Duke University, Durham, NC, USA
Jian Pei
Shanghai Jiao Tong Univeristy, Shanghai, China
Guangtao Zhai
Chinese Academy of Sciences, Beijing, China
Ruiping Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, C., Wang, Y., Ma, ZM., Liu, Y. (2024). How to Select the Appropriate One from the Trained Models for Model-Based OPE. In: Fang, L., Pei, J., Zhai, G., Wang, R. (eds) Artificial Intelligence. CICAI 2023. Lecture Notes in Computer Science(), vol 14474. Springer, Singapore. https://doi.org/10.1007/978-981-99-9119-8_26

Download citation

DOI: https://doi.org/10.1007/978-981-99-9119-8_26
Published: 03 February 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-9118-1
Online ISBN: 978-981-99-9119-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

How to Select the Appropriate One from the Trained Models for Model-Based OPE