Safe Reinforcement Learning Through Regret and State Restorations in Evaluation Stages

Gros, Timo P.; Müller, Nicola J.; Höller, Daniel; Wolf, Verena

doi:10.1007/978-3-031-75778-5_2

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15262))

128 Accesses

Abstract

Deep reinforcement learning (DRL) has succeeded tremendously in many complex decision-making tasks. However, for many real-world applications standard DRL training results in agents with brittle performance because, in particular for safety-critical problems, the discovery of both, safe and successful strategies is very challenging. Various exploration strategies have been proposed to address this problem. However, they do not take information about the current safety performance into account; thus, they fail to systematically focus on the parts of the state space most relevant for training. Here, we propose r egret a nd state r estoration in e valuation-based deep reinforcement learning (RARE), a framework that introduces two innovations: (i) it combines safety evaluation stages with state restorations, i.e., restarting episodes in formerly visited states, and (ii) it exploits estimations of the regret, i.e., the gap between the policies’ current and optimal performance. We show that both innovations are beneficial and that RARE outperforms baselines such as deep Q-learning and Go-Explore in an empirical evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Online shielding for reinforcement learning

Article Open access 23 September 2022

Identifying Critical States by the Action-Based Variance of Expected Return

Exploiting Reward Machines with Deep Reinforcement Learning in Continuous Action Domains

Notes

1.
The goal-conditioned policy is tasked with reaching the sampled state s, after which the training continues with the regular policy $\pi _\theta $. We follow the concept introduced by the Go-Explore algorithm [12].
2.
Note that the learning stage uses the states from the current archive $\mathscr {A}_j$ for restarting episodes. However, archive $\mathscr {A}_{j+1}$ contains the states currently stored for the next evaluation stage.
3.
The application of the Euclidean distance assumes that the state representation is based on physical attributes, such as coordinates or velocities, which is often used in RL benchmarks. This method would also be affective for imaged-based state representation. Alternative distance metrics might need to be considered if this assumption is not fulfilled.
4.
Otherwise we report the training as failed.

References

Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI), pp. 2669–2678. AAAI Press (2018)
Google Scholar
Amit, R., Meir, R., Ciosek, K.: Discount factor as a regularizer in reinforcement learning. In: Proceedings of the 37th International Conference on Machine Learning (ICML), pp. 269–278. PMLR (2020)
Google Scholar
Anderson, G., Chaudhuri, S., Dillig, I.: Guiding safe exploration with weakest preconditions. In: The Eleventh International Conference on Learning Representations (2022)
Google Scholar
Andrychowicz, M., et al.: Hindsight experience replay. In: Advances in Neural Information Processing Systems, pp. 5048–5058 (2017)
Google Scholar
Avni, G., Bloem, R., Chatterjee, K., Henzinger, T.A., Könighofer, B., Pranger, S.: Run-time optimization for learned controllers through quantitative games. In: Dillig, I., Tasiran, S. (eds.) CAV 2019. LNCS, vol. 11561, pp. 630–649. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-25540-4_36
Chapter Google Scholar
Azar, M.G., Osband, I., Munos, R.: Minimax regret bounds for reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 263–272. PMLR (2017)
Google Scholar
Baier, C., Christakis, M., Gros, T.P., Groß, D., Gumhold, S., Hermanns, H., Hoffmann, J., Klauck, M.: Lab conditions for research on explainable automated decisions. In: Heintz, F., Milano, M., O’Sullivan, B. (eds.) TAILOR 2020. LNCS (LNAI), vol. 12641, pp. 83–90. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73959-1_8
Chapter Google Scholar
Bharadhwaj, H., Kumar, A., Rhinehart, N., Levine, S., Shkurti, F., Garg, A.: Conservative safety critics for exploration. In: Proceedings of the 9th International Conference on Learning Representations (ICLR). OpenReview (2021)
Google Scholar
Burda, Y., Edwards, H., Storkey, A.J., Klimov, O.: Exploration by random network distillation. In: Proceedings of the 7th International Conference on Learning Representations (ICLR). OpenReview (2019)
Google Scholar
Campero, A., Raileanu, R., Küttler, H., Tenenbaum, J.B., Rocktäschel, T., Grefenstette, E.: Learning with AMIGo: adversarially motivated intrinsic goals. In: Proceedings of the 9th International Conference on Learning Representations (ICLR). OpenReview (2021)
Google Scholar
Chevalier-Boisvert, M., et al.: BabyAI: a platform to study the sample efficiency of grounded language learning. In: Proceedings of the 7th International Conference on Learning Representations (ICLR). OpenReview (2019)
Google Scholar
Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K.O., Clune, J.: First return, then explore. Nature 590(7847), 580–586 (2021)
Article Google Scholar
Flet-Berliac, Y., Ferret, J., Pietquin, O., Preux, P., Geist, M.: Adversarially guided actor-critic. In: Proceedings of the 9th International Conference on Learning Representations (ICLR). OpenReview (2021)
Google Scholar
Fujita, Y., Nagarajan, P., Kataoka, T., Ishikawa, T.: ChainerRL: a deep reinforcement learning library. J. Mach. Learn. Res. 22, 77:1–77:14 (2021)
Google Scholar
García, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16, 1437–1480 (2015)
MathSciNet Google Scholar
Gros, T.P., et al.: DSMC evaluation stages: fostering robust and safe behavior in deep reinforcement learning - extended version. ACM Trans. Model. Comput. Simulat. 33(4), 17:1–17:28 (2023). https://doi.org/10.1145/3607198
Gros, T.P., Hermanns, H., Hoffmann, J., Klauck, M., Köhl, M.A., Wolf, V.: MoGym: using formal models for training and verifying decision-making agents. In: Shoham, S., Vizel, Y. (eds.) CAV 2022, Part II, pp. 430–443. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-13188-2_21
Gros, T.P., Hermanns, H., Hoffmann, J., Klauck, M., Steinmetz, M.: Deep statistical model checking. In: Gotsman, A., Sokolova, A. (eds.) FORTE 2020. LNCS, vol. 12136, pp. 96–114. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50086-3_6
Chapter Google Scholar
Gros, T.P., Hermanns, H., Hoffmann, J., Klauck, M., Steinmetz, M.: Analyzing neural network behavior through deep statistical model checking. Int. J. Softw. Tools Technol. Transfer 25(3), 407–426 (2023)
Article Google Scholar
Gros, T.P., Höller, D., Hoffmann, J., Klauck, M., Meerkamp, H., Wolf, V.: DSMC evaluation stages: fostering robust and safe behavior in deep reinforcement learning. In: Abate, A., Marin, A. (eds.) QEST 2021. LNCS, vol. 12846, pp. 197–216. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-85172-9_11
Chapter Google Scholar
Gros, T.P., Höller, D., Hoffmann, J., Wolf, V.: Tracking the race between deep reinforcement learning and imitation learning. In: Gribaudo, M., Jansen, D.N., Remke, A. (eds.) QEST 2020. LNCS, vol. 12289, pp. 11–17. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59854-9_2
Chapter Google Scholar
Gu, S., Holly, E., Lillicrap, T.P., Levine, S.: Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 3389–3396. IEEE Press (2017)
Google Scholar
Hare, J.: Dealing with sparse rewards in reinforcement learning. arXiv preprint arXiv:1910.09281 (2019)
Hasanbeig, M., Abate, A., Kroening, D.: Logically-constrained reinforcement learning. arXiv preprint arXiv:1801.08099 (2018)
Jansen, N., Könighofer, B., Junges, S., Serban, A., Bloem, R.: Safe reinforcement learning using probabilistic shields. In: Proceedings of the 31st International Conference on Concurrency Theory (CONCUR), pp. 3:1–3:16. Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
Google Scholar
Jegourel, C., Legay, A., Sedwards, S.: Importance splitting for statistical model checking rare properties. In: Sharygina, N., Veith, H. (eds.) CAV 2013. LNCS, vol. 8044, pp. 576–591. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39799-8_38
Chapter Google Scholar
Jiang, M., Dennis, M., Parker-Holder, J., Foerster, J.N., Grefenstette, E., Rocktäschel, T.: Replay-guided adversarial environment design. In: Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), pp. 1884–1897 (2021)
Google Scholar
Kirkpatrick, J., et al.: Overcoming catastrophic forgetting in neural networks. arXiv preprint arXiv:1612.00796 (2016)
Knox, W.B., Stone, P.: Reinforcement learning from human reward: discounting in episodic tasks. In: Proceedings of the 21st IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 878–885. IEEE Press (2012)
Google Scholar
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015)
Google Scholar
Morio, J., Pastel, R., Le Gland, F.: An overview of importance splitting for rare event simulation. Eur. J. Phys. 31(5), 1295 (2010)
Article Google Scholar
Nazari, M., Oroojlooy, A., Snyder, L.V., Takác, M.: Reinforcement learning for solving the vehicle routing problem. In: Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), pp. 9861–9871 (2018)
Google Scholar
Parker-Holder, J., et al.: Evolving curricula with regret-based environment design. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 17473–17498. PMLR (2022)
Google Scholar
Raileanu, R., Rocktäschel, T.: RIDE: rewarding impact-driven exploration for procedurally-generated environments. In: Proceedings of the 8th International Conference on Learning Representations (ICLR). OpenReview (2020)
Google Scholar
Riedmiller, M.A., et al.: Learning by playing solving sparse reward tasks from scratch. In: Proceedings of the 35th International Conference on Machine Learning (ICML), pp. 4341–4350. PMLR (2018)
Google Scholar
Sallab, A.E., Abdou, M., Perot, E., Yogamani, S.: Deep reinforcement learning framework for autonomous driving. Electron. Imaging 2017(19), 70–76 (2017)
Article Google Scholar
Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized experience replay. In: Proceedings of the 4th International Conference on Learning Representations (ICLR) (2016)
Google Scholar
Schwartz, A.: A reinforcement learning method for maximizing undiscounted rewards. In: Proceedings of the 10th International Conference on Machine Learning (ICML), pp. 298–305. Morgan Kaufmann (1993)
Google Scholar
Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
Google Scholar
Silver, D., et al.: A General reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362(6419), 1140–1144 (2018)
Google Scholar
Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017)
Google Scholar
Stooke, A., Abbeel, P.: rlpyt: a research code base for deep reinforcement learning in PyTorch. arXiv preprint arXiv:1909.01500 (2019)
Sutton, R.S., Barto, A.G.: Reinforcement Learning - An Introduction, Adaptive Computation and Machine Learning. MIT Press (1998)
Google Scholar

Download references

Acknowledgments

This work was partially funded by the European Union’s Horizon Europe Research and Innovation program under the grant agreement TUPLES No 101070149, by the German Research Foundation (DFG) under grant No. 389792660, as part of TRR 248, see https://perspicuous-computing.science, by the German Research Foundation (DFG) - GRK 2853/1 “Neuroexplicit Models of Language, Vision, and Action” - project number 471607914, and by the European Regional Development Fund (ERDF) and the Saarland within the scope of (To)CERTAIN.

We thank the anonymous reviewers for their careful reading of our manuscript and their insightful comments and suggestions.

Author information

Authors and Affiliations

Saarland University, Saarland Informatics Campus, Saarbrücken, Germany
Timo P. Gros, Nicola J. Müller, Daniel Höller & Verena Wolf
German Research Center for Artificial Intelligence (DFKI), Saarbrücken, Germany
Timo P. Gros & Verena Wolf
Center for European Research in Trusted Artificial Intelligence (CERTAIN), Saarbrücken, Germany
Timo P. Gros & Nicola J. Müller

Authors

Timo P. Gros
View author publications
You can also search for this author in PubMed Google Scholar
Nicola J. Müller
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Höller
View author publications
You can also search for this author in PubMed Google Scholar
Verena Wolf
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Timo P. Gros .

Editor information

Editors and Affiliations

Ruhr University Bochum and Radboud University Nijmegen, Bochum, Germany
Nils Jansen
Radboud University Nijmegen, Nijmegen, The Netherlands
Sebastian Junges
Saarland University and University College London, Saarbrücken, Germany
Benjamin Lucien Kaminski
University of Oldenburg, Oldenburg, Germany
Christoph Matheja
RWTH Aachen University, Aachen, Germany
Thomas Noll
RWTH Aachen University, Aachen, Germany
Tim Quatmann
University of Twente and Radboud University Nijmegen, Enschede, The Netherlands
Mariëlle Stoelinga
Eindhoven University of Technology, Eindhoven, The Netherlands
Matthias Volk

Appendices

A Benchmark Statistics

Racetrack. In Racetrack, the number of states is given through all the possible positions in the map and the maximal achievable velocity. As the most of the states cannot be reached with the maximal velocity, this is an upper bound and not an exact number.

MiniGrid. In MiniGrid, the number of states is given through all possible combinations of the agent’s position, it’s direction, whether the door is open, and the positions of the moving obstacles. The latter is the responsible for the huge state space even for relatively small maps. For DynObsDoor, we have $\sim 3.04 \cdot 10^{10}$ states.

B Pseudo-code

The components of the RARE algorithm are highlighted in .

C Hyperparameters

Hyperparameters that are used in multiple algorithms but only have one table entry have the same value in all instances.

DQN:

DQNPR:

RAREID and RAREPR:

Go-Explore

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Gros, T.P., Müller, N.J., Höller, D., Wolf, V. (2025). Safe Reinforcement Learning Through Regret and State Restorations in Evaluation Stages. In: Jansen, N., et al. Principles of Verification: Cycling the Probabilistic Landscape . Lecture Notes in Computer Science, vol 15262. Springer, Cham. https://doi.org/10.1007/978-3-031-75778-5_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-75778-5_2
Published: 18 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-75777-8
Online ISBN: 978-3-031-75778-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Safe Reinforcement Learning Through Regret and State Restorations in Evaluation Stages