Abstract
Safe Policy Improvement (SPI) is an important technique for offline reinforcement learning in safety critical applications as it improves the behavior policy with a high probability. We classify various SPI approaches from the literature into two groups, based on how they utilize the uncertainty of state-action pairs. Focusing on the Soft-SPIBB (Safe Policy Improvement with Soft Baseline Bootstrapping) algorithms, we show that their claim of being provably safe does not hold. Based on this finding, we develop adaptations, the Adv-Soft-SPIBB algorithms, and show that they are provably safe. A heuristic adaptation, Lower-Approx-Soft-SPIBB, yields the best performance among all SPIBB algorithms in extensive experiments on two benchmarks. We also check the safety guarantees of the provably safe algorithms and show that huge amounts of data are necessary such that the safety bounds become useful in practice.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Brafman, R.I., Tennenholtz, M.: R-MAX - a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 3 (2003)
Chow, Y., Tamar, A., Mannor, S., Pavone, M.: Risk-sensitive and robust decision-making: a CVaR optimization approach. In: Proceedings of the 28th International Conference on Neural Information Processing Systems (2015)
Dantzig, G.B.: Linear Programming and Extensions. RAND Corporation, Santa Monica (1963)
Fujimoto, S., Meger, D., Precup, D.: Off-policy deep reinforcement learning without exploration. In: Proceedings of the 36th International Conference on Machine Learning (2019)
García, J., Fernandez, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16 (2015)
Hans, A., Duell, S., Udluft, S.: Agent self-assessment: determining policy quality without execution. In: IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (2011)
Hans, A., Udluft, S.: Efficient uncertainty propagation for reinforcement learning with limited data. In: Artificial Neural Networks - ICANN, vol. 5768 (2009)
Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963)
Lange, S., Gabel, T., Riedmiller, M.: Batch reinforcement learning. In: Wiering, M., van Otterlo, M. (eds.) Reinforcement Learning. ALO, vol. 12, pp. 45–73. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-27645-3_2
Laroche, R., Trichelair, P., Tachet des Combes, R.: Safe policy improvement with baseline bootstrapping. In: Proceedings of the 36th International Conference on Machine Learning (2019)
Leurent, E.: Safe and efficient reinforcement learning for behavioural planning in autonomous driving. Theses, Université de Lille (2020)
Levine, S., Kumar, A., Tucker, G., Fu, J.: Offline reinforcement learning: tutorial, review, and perspectives on open problems. CoRR abs/2005.01643 (2020)
Maurer, A., Pontil, M.: Empirical Bernstein bounds and sample-variance penalization. In: COLT (2009)
Nadjahi, K., Laroche, R., Tachet des Combes, R.: Safe policy improvement with soft baseline bootstrapping. In: Brefeld, U., Fromont, E., Hotho, A., Knobbe, A., Maathuis, M., Robardet, C. (eds.) ECML PKDD 2019. LNCS (LNAI), vol. 11908, pp. 53–68. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-46133-1_4
Nilim, A., El Ghaoui, L.: Robustness in Markov decision problems with uncertain transition matrices. In: Proceedings of the 16th International Conference on Neural Information Processing Systems (2003)
Petrik, M., Ghavamzadeh, M., Chow, Y.: Safe policy improvement by minimizing robust baseline regret. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 2016, Curran Associates Inc., Red Hook (2016)
Schaefer, A.M., Schneegass, D., Sterzing, V., Udluft, S.: A neural reinforcement learning approach to gas turbine control. In: International Joint Conference on Neural Networks (2007)
Schneegass, D., Hans, A., Udluft, S.: Uncertainty in reinforcement learning - awareness, quantisation, and control. In: Robot Learning. Sciyo (2010)
Scholl, P.: Evaluation of safe policy improvement with soft baseline bootstrapping. Master’s thesis, Technical University of Munich (2021)
Scholl, P., Dietrich, F., Otte, C., Udluft, S.: Safe policy improvement approaches on discrete Markov decision processes. In: Proceedings of the 14th International Conference on Agents and Artificial Intelligence, ICAART, vol. 2, pp. 142–151. INSTICC, SciTePress (2022). https://doi.org/10.5220/0010786600003116
Simão, T.D., Laroche, R., Tachet des Combes, R.: Safe policy improvement with an estimated baseline policy. In: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems (2020)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)
Thomas, P.S.: Safe reinforcement learning. Doctoral dissertations. University of Massachusetts (2015)
Wang, R., Foster, D., Kakade, S.M.: What are the statistical limits of offline RL with linear function approximation? In: International Conference on Learning Representations (2021)
Acknowledgements
FD was partly funded by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation), project 468830823. PS, CO and SU were partly funded by German Federal Ministry of Education and Research, project 01IS18049A (ALICE III).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Scholl, P., Dietrich, F., Otte, C., Udluft, S. (2022). Safe Policy Improvement Approaches and Their Limitations. In: Rocha, A.P., Steels, L., van den Herik, J. (eds) Agents and Artificial Intelligence. ICAART 2022. Lecture Notes in Computer Science(), vol 13786. Springer, Cham. https://doi.org/10.1007/978-3-031-22953-4_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-22953-4_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22952-7
Online ISBN: 978-3-031-22953-4
eBook Packages: Computer ScienceComputer Science (R0)