Skip to main content

How to Select the Appropriate One from the Trained Models for Model-Based OPE

  • Conference paper
  • First Online:
Artificial Intelligence (CICAI 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14474))

Included in the following conference series:

  • 266 Accesses

Abstract

Offline Policy Evaluation (OPE) is a method for evaluating and selecting complex policies in reinforcement learning for decision-making using large, offline datasets. Recently, Model-Based Offline Policy Evaluation (MBOPE) methods have become popular because they are easy to implement and perform well. The model-based approach provides a mechanism for approximating the value of a given policy directly using estimated transition and reward functions of the environment. However, a challenge remains in selecting an appropriate model from those trained for further use. We begin by analyzing the upper bound of the difference between the true value and the approximated value calculated using the model. Theoretical results show that this difference is related to the trajectories generated by the given policy on the learned model and the prediction error of the transition and reward functions at these generated data points. We then propose a novel criterion inspired by the theoretical results to determine which trained model is better suited for evaluating the given policy. Finally, we demonstrate the effectiveness of the proposed method on both simulated and benchmark offline datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Agarwal, R., Schuurmans, D., Norouzi, M.: An optimistic perspective on offline reinforcement learning. In: ICML, November 2020

    Google Scholar 

  2. Bain, M., Sammut, C.: A framework for behavioural cloning. In: Machine Intelligence 15, Intelligent Agents, pp. 103–129. Oxford University, GBR (1999)

    Google Scholar 

  3. Barth-Maron, G., et al.: Distributional policy gradients. In: ICLR (2018)

    Google Scholar 

  4. Chua, K., Calandra, R., McAllister, R., Levine, S.: Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: NeurIPS (2018)

    Google Scholar 

  5. Dudík, M., Langford, J., Li, L.: Doubly robust policy evaluation and learning. In: ICML, pp. 1097–1104. Madison, WI, USA, June 2011

    Google Scholar 

  6. Fu, J., et al.: Benchmarks for deep off-policy evaluation. In: ICLR (2021)

    Google Scholar 

  7. Fujimoto, S., Meger, D., Precup, D.: Off-policy deep reinforcement learning without exploration. In: ICML, pp. 2052–2062, May 2019

    Google Scholar 

  8. Gulcehre, C., et al.: RL Unplugged: a suite of benchmarks for offline reinforcement learning. In: NeurIPS, vol. 33, pp. 7248–7259 (2020)

    Google Scholar 

  9. Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: ICML, pp. 1861–1870, July 2018

    Google Scholar 

  10. Hallak, A., Schnitzler, F., Mann, T., Mannor, S.: Off-policy model-based learning under unknown factored dynamics. In: ICML, pp. 711–719, June 2015

    Google Scholar 

  11. Hanna, J.P., Stone, P., Niekum, S.: Bootstrapping with models: confidence intervals for off-policy evaluation. In: AAAI, February 2017

    Google Scholar 

  12. Janner, M., Fu, J., Zhang, M., Levine, S.: When to trust your model: model-based policy optimization. In: NeurIPS, vol. 32 (2019)

    Google Scholar 

  13. Kostrikov, I., Nachum, O.: Statistical Bootstrapping for Uncertainty Estimation in Off-Policy Evaluation. arXiv:2007.13609 [cs, stat], July 2020. arXiv: 2007.13609

  14. Kumar, A., Fu, J., Soh, M., Tucker, G., Levine, S.: Stabilizing off-policy Q-learning via bootstrapping error reduction. In: NeurIPS, vol. 32 (2019)

    Google Scholar 

  15. Kégl, B., Hurtado, G., Thomas, A.: Model-based micro-data reinforcement learning: what are the crucial model properties and which model to choose? In: ICLR, September 2020

    Google Scholar 

  16. Lange, S., Gabel, T., Riedmiller, M.: Batch reinforcement learning. In: Reinforcement Learning: State-of-the-Art, pp. 45–73. Adaptation, Learning, and Optimization. Springer, Heidelberg (2012)

    Google Scholar 

  17. Levine, S., Kumar, A., Tucker, G., Fu, J.: Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv:2005.01643 (Nov 2020)

  18. Li, L., Chu, W., Langford, J., Wang, X.: Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In: WSDM, February 2011

    Google Scholar 

  19. Li, L., Munos, R., Szepesvari, C.: On Minimax Optimal Offline Policy Evaluation. arXiv:1409.3653 [cs], September 2014. arXiv: 1409.3653

  20. Mandel, T., Liu, Y.E., Levine, S., Brunskill, E., Popovic, Z.: Offline policy evaluation across representations with applications to educational games. In: AAMAS, May 2014

    Google Scholar 

  21. Mnih, V., et al.: Asynchronous Methods for Deep Reinforcement Learning. In: ICML, pp. 1928–1937, June 2016

    Google Scholar 

  22. Murphy, S.A., van der Laan, M.J., Robins, J.M.: Marginal mean models for dynamic regimes. J. Am. Stat. Assoc. 96(456), 1410–1423 (2001)

    Article  MathSciNet  Google Scholar 

  23. Nachum, O., Chow, Y., Dai, B., Li, L.: DualDICE: behavior-agnostic estimation of discounted stationary distribution corrections. In: NeurIPS, vol. 32 (2019)

    Google Scholar 

  24. Paine, T.L., et al.: Hyperparameter Selection for Offline Reinforcement Learning. arXiv:2007.09055 [cs, stat], July 2020

  25. Precup, D., Sutton, R.S., Singh, S.P.: Eligibility traces for off-policy policy evaluation. In: ICML, pp. 759–766. San Francisco, CA, USA, June 2000

    Google Scholar 

  26. Siegel, N., et al.: Keep doing what worked: Behavior modelling priors for offline reinforcement learning. In: ICLR (2020)

    Google Scholar 

  27. Silver, D., et al.: Mastering the game of Go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)

    Article  Google Scholar 

  28. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)

    Google Scholar 

  29. Tassa, Y., et al.: DeepMind Control Suite. arXiv:1801.00690 [cs], January 2018

  30. Thomas, P., Brunskill, E.: Data-efficient off-policy policy evaluation for reinforcement learning. In: ICML, pp. 2139–2148, June 2016, iSSN: 1938–7228

    Google Scholar 

  31. Todorov, E., Erez, T., Tassa, Y.: MuJoCo: a physics engine for model-based control. In: IROS, pp. 5026–5033, October 2012

    Google Scholar 

  32. Uehara, M., Huang, J., Jiang, N.: Minimax weight and Q-function learning for off-policy evaluation. In: ICML, November 2020

    Google Scholar 

  33. Voloshin, C., Le, H.M., Jiang, N., Yue, Y.: Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning. arXiv:1911.06854 [cs, stat], November 2021

  34. Wang, T., et al.: Benchmarking Model-Based Reinforcement Learning. arXiv:1907.02057 [cs, stat], July 2019. arXiv: 1907.02057

  35. Wang, Z., et al.: Critic regularized regression. In: NeurIPS, vol. 33, pp. 7768–7778 (2020)

    Google Scholar 

  36. Yang, M., Nachum, O., Dai, B., Li, L., Schuurmans, D.: Off-policy evaluation via the regularized Lagrangian. In: NeurIPS, vol. 33, pp. 6551–6561 (2020)

    Google Scholar 

  37. Zhang, M.R., Paine, T., Nachum, O., Paduraru, C., Tucker, G., ziyu wang, Norouzi, M.: Autoregressive dynamics models for offline policy evaluation and optimization. In: ICLR (2021)

    Google Scholar 

Download references

Acknowledgments

This work was supported in part by the Beijing Natural Science Foundation (L222051) and in part by the Fundamental Research Funds for the Central Universities (2022JBMC049).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuting Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, C., Wang, Y., Ma, ZM., Liu, Y. (2024). How to Select the Appropriate One from the Trained Models for Model-Based OPE. In: Fang, L., Pei, J., Zhai, G., Wang, R. (eds) Artificial Intelligence. CICAI 2023. Lecture Notes in Computer Science(), vol 14474. Springer, Singapore. https://doi.org/10.1007/978-981-99-9119-8_26

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-9119-8_26

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-9118-1

  • Online ISBN: 978-981-99-9119-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics