Skip to main content

Safe Policy Improvement Approaches and Their Limitations

  • Conference paper
  • First Online:
Agents and Artificial Intelligence (ICAART 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13786))

Included in the following conference series:

  • 293 Accesses

Abstract

Safe Policy Improvement (SPI) is an important technique for offline reinforcement learning in safety critical applications as it improves the behavior policy with a high probability. We classify various SPI approaches from the literature into two groups, based on how they utilize the uncertainty of state-action pairs. Focusing on the Soft-SPIBB (Safe Policy Improvement with Soft Baseline Bootstrapping) algorithms, we show that their claim of being provably safe does not hold. Based on this finding, we develop adaptations, the Adv-Soft-SPIBB algorithms, and show that they are provably safe. A heuristic adaptation, Lower-Approx-Soft-SPIBB, yields the best performance among all SPIBB algorithms in extensive experiments on two benchmarks. We also check the safety guarantees of the provably safe algorithms and show that huge amounts of data are necessary such that the safety bounds become useful in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/Philipp238/Safe-Policy-Improvement-Approaches-on-Discrete-Markov-Decision-Processes.

  2. 2.

    https://github.com/Philipp238/Safe-Policy-Improvement-Approaches-on-Discrete-Markov-Decision-Processes/blob/master/auxiliary_tests/assumption_test.py.

References

  1. Brafman, R.I., Tennenholtz, M.: R-MAX - a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 3 (2003)

    Google Scholar 

  2. Chow, Y., Tamar, A., Mannor, S., Pavone, M.: Risk-sensitive and robust decision-making: a CVaR optimization approach. In: Proceedings of the 28th International Conference on Neural Information Processing Systems (2015)

    Google Scholar 

  3. Dantzig, G.B.: Linear Programming and Extensions. RAND Corporation, Santa Monica (1963)

    Google Scholar 

  4. Fujimoto, S., Meger, D., Precup, D.: Off-policy deep reinforcement learning without exploration. In: Proceedings of the 36th International Conference on Machine Learning (2019)

    Google Scholar 

  5. García, J., Fernandez, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16 (2015)

    Google Scholar 

  6. Hans, A., Duell, S., Udluft, S.: Agent self-assessment: determining policy quality without execution. In: IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (2011)

    Google Scholar 

  7. Hans, A., Udluft, S.: Efficient uncertainty propagation for reinforcement learning with limited data. In: Artificial Neural Networks - ICANN, vol. 5768 (2009)

    Google Scholar 

  8. Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963)

    Article  MATH  Google Scholar 

  9. Lange, S., Gabel, T., Riedmiller, M.: Batch reinforcement learning. In: Wiering, M., van Otterlo, M. (eds.) Reinforcement Learning. ALO, vol. 12, pp. 45–73. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-27645-3_2

    Chapter  Google Scholar 

  10. Laroche, R., Trichelair, P., Tachet des Combes, R.: Safe policy improvement with baseline bootstrapping. In: Proceedings of the 36th International Conference on Machine Learning (2019)

    Google Scholar 

  11. Leurent, E.: Safe and efficient reinforcement learning for behavioural planning in autonomous driving. Theses, Université de Lille (2020)

    Google Scholar 

  12. Levine, S., Kumar, A., Tucker, G., Fu, J.: Offline reinforcement learning: tutorial, review, and perspectives on open problems. CoRR abs/2005.01643 (2020)

    Google Scholar 

  13. Maurer, A., Pontil, M.: Empirical Bernstein bounds and sample-variance penalization. In: COLT (2009)

    Google Scholar 

  14. Nadjahi, K., Laroche, R., Tachet des Combes, R.: Safe policy improvement with soft baseline bootstrapping. In: Brefeld, U., Fromont, E., Hotho, A., Knobbe, A., Maathuis, M., Robardet, C. (eds.) ECML PKDD 2019. LNCS (LNAI), vol. 11908, pp. 53–68. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-46133-1_4

    Chapter  Google Scholar 

  15. Nilim, A., El Ghaoui, L.: Robustness in Markov decision problems with uncertain transition matrices. In: Proceedings of the 16th International Conference on Neural Information Processing Systems (2003)

    Google Scholar 

  16. Petrik, M., Ghavamzadeh, M., Chow, Y.: Safe policy improvement by minimizing robust baseline regret. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 2016, Curran Associates Inc., Red Hook (2016)

    Google Scholar 

  17. Schaefer, A.M., Schneegass, D., Sterzing, V., Udluft, S.: A neural reinforcement learning approach to gas turbine control. In: International Joint Conference on Neural Networks (2007)

    Google Scholar 

  18. Schneegass, D., Hans, A., Udluft, S.: Uncertainty in reinforcement learning - awareness, quantisation, and control. In: Robot Learning. Sciyo (2010)

    Google Scholar 

  19. Scholl, P.: Evaluation of safe policy improvement with soft baseline bootstrapping. Master’s thesis, Technical University of Munich (2021)

    Google Scholar 

  20. Scholl, P., Dietrich, F., Otte, C., Udluft, S.: Safe policy improvement approaches on discrete Markov decision processes. In: Proceedings of the 14th International Conference on Agents and Artificial Intelligence, ICAART, vol. 2, pp. 142–151. INSTICC, SciTePress (2022). https://doi.org/10.5220/0010786600003116

  21. Simão, T.D., Laroche, R., Tachet des Combes, R.: Safe policy improvement with an estimated baseline policy. In: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems (2020)

    Google Scholar 

  22. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)

    MATH  Google Scholar 

  23. Thomas, P.S.: Safe reinforcement learning. Doctoral dissertations. University of Massachusetts (2015)

    Google Scholar 

  24. Wang, R., Foster, D., Kakade, S.M.: What are the statistical limits of offline RL with linear function approximation? In: International Conference on Learning Representations (2021)

    Google Scholar 

Download references

Acknowledgements

FD was partly funded by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation), project 468830823. PS, CO and SU were partly funded by German Federal Ministry of Education and Research, project 01IS18049A (ALICE III).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Philipp Scholl .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Scholl, P., Dietrich, F., Otte, C., Udluft, S. (2022). Safe Policy Improvement Approaches and Their Limitations. In: Rocha, A.P., Steels, L., van den Herik, J. (eds) Agents and Artificial Intelligence. ICAART 2022. Lecture Notes in Computer Science(), vol 13786. Springer, Cham. https://doi.org/10.1007/978-3-031-22953-4_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-22953-4_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-22952-7

  • Online ISBN: 978-3-031-22953-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics