Skip to main content

Safe Reinforcement Learning Through Regret and State Restorations in Evaluation Stages

  • Chapter
  • First Online:
Principles of Verification: Cycling the Probabilistic Landscape

Abstract

Deep reinforcement learning (DRL) has succeeded tremendously in many complex decision-making tasks. However, for many real-world applications standard DRL training results in agents with brittle performance because, in particular for safety-critical problems, the discovery of both, safe and successful strategies is very challenging. Various exploration strategies have been proposed to address this problem. However, they do not take information about the current safety performance into account; thus, they fail to systematically focus on the parts of the state space most relevant for training. Here, we propose r egret a nd state r estoration in e valuation-based deep reinforcement learning (RARE), a framework that introduces two innovations: (i) it combines safety evaluation stages with state restorations, i.e., restarting episodes in formerly visited states, and (ii) it exploits estimations of the regret, i.e., the gap between the policies’ current and optimal performance. We show that both innovations are beneficial and that RARE outperforms baselines such as deep Q-learning and Go-Explore in an empirical evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The goal-conditioned policy is tasked with reaching the sampled state s, after which the training continues with the regular policy \(\pi _\theta \). We follow the concept introduced by the Go-Explore algorithm [12].

  2. 2.

    Note that the learning stage uses the states from the current archive \(\mathscr {A}_j\) for restarting episodes. However, archive \(\mathscr {A}_{j+1}\) contains the states currently stored for the next evaluation stage.

  3. 3.

    The application of the Euclidean distance assumes that the state representation is based on physical attributes, such as coordinates or velocities, which is often used in RL benchmarks. This method would also be affective for imaged-based state representation. Alternative distance metrics might need to be considered if this assumption is not fulfilled.

  4. 4.

    Otherwise we report the training as failed.

References

  1. Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI), pp. 2669–2678. AAAI Press (2018)

    Google Scholar 

  2. Amit, R., Meir, R., Ciosek, K.: Discount factor as a regularizer in reinforcement learning. In: Proceedings of the 37th International Conference on Machine Learning (ICML), pp. 269–278. PMLR (2020)

    Google Scholar 

  3. Anderson, G., Chaudhuri, S., Dillig, I.: Guiding safe exploration with weakest preconditions. In: The Eleventh International Conference on Learning Representations (2022)

    Google Scholar 

  4. Andrychowicz, M., et al.: Hindsight experience replay. In: Advances in Neural Information Processing Systems, pp. 5048–5058 (2017)

    Google Scholar 

  5. Avni, G., Bloem, R., Chatterjee, K., Henzinger, T.A., Könighofer, B., Pranger, S.: Run-time optimization for learned controllers through quantitative games. In: Dillig, I., Tasiran, S. (eds.) CAV 2019. LNCS, vol. 11561, pp. 630–649. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-25540-4_36

    Chapter  Google Scholar 

  6. Azar, M.G., Osband, I., Munos, R.: Minimax regret bounds for reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 263–272. PMLR (2017)

    Google Scholar 

  7. Baier, C., Christakis, M., Gros, T.P., Groß, D., Gumhold, S., Hermanns, H., Hoffmann, J., Klauck, M.: Lab conditions for research on explainable automated decisions. In: Heintz, F., Milano, M., O’Sullivan, B. (eds.) TAILOR 2020. LNCS (LNAI), vol. 12641, pp. 83–90. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73959-1_8

    Chapter  Google Scholar 

  8. Bharadhwaj, H., Kumar, A., Rhinehart, N., Levine, S., Shkurti, F., Garg, A.: Conservative safety critics for exploration. In: Proceedings of the 9th International Conference on Learning Representations (ICLR). OpenReview (2021)

    Google Scholar 

  9. Burda, Y., Edwards, H., Storkey, A.J., Klimov, O.: Exploration by random network distillation. In: Proceedings of the 7th International Conference on Learning Representations (ICLR). OpenReview (2019)

    Google Scholar 

  10. Campero, A., Raileanu, R., Küttler, H., Tenenbaum, J.B., Rocktäschel, T., Grefenstette, E.: Learning with AMIGo: adversarially motivated intrinsic goals. In: Proceedings of the 9th International Conference on Learning Representations (ICLR). OpenReview (2021)

    Google Scholar 

  11. Chevalier-Boisvert, M., et al.: BabyAI: a platform to study the sample efficiency of grounded language learning. In: Proceedings of the 7th International Conference on Learning Representations (ICLR). OpenReview (2019)

    Google Scholar 

  12. Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K.O., Clune, J.: First return, then explore. Nature 590(7847), 580–586 (2021)

    Article  Google Scholar 

  13. Flet-Berliac, Y., Ferret, J., Pietquin, O., Preux, P., Geist, M.: Adversarially guided actor-critic. In: Proceedings of the 9th International Conference on Learning Representations (ICLR). OpenReview (2021)

    Google Scholar 

  14. Fujita, Y., Nagarajan, P., Kataoka, T., Ishikawa, T.: ChainerRL: a deep reinforcement learning library. J. Mach. Learn. Res. 22, 77:1–77:14 (2021)

    Google Scholar 

  15. García, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16, 1437–1480 (2015)

    MathSciNet  Google Scholar 

  16. Gros, T.P., et al.: DSMC evaluation stages: fostering robust and safe behavior in deep reinforcement learning - extended version. ACM Trans. Model. Comput. Simulat. 33(4), 17:1–17:28 (2023). https://doi.org/10.1145/3607198

  17. Gros, T.P., Hermanns, H., Hoffmann, J., Klauck, M., Köhl, M.A., Wolf, V.: MoGym: using formal models for training and verifying decision-making agents. In: Shoham, S., Vizel, Y. (eds.) CAV 2022, Part II, pp. 430–443. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-13188-2_21

  18. Gros, T.P., Hermanns, H., Hoffmann, J., Klauck, M., Steinmetz, M.: Deep statistical model checking. In: Gotsman, A., Sokolova, A. (eds.) FORTE 2020. LNCS, vol. 12136, pp. 96–114. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50086-3_6

    Chapter  Google Scholar 

  19. Gros, T.P., Hermanns, H., Hoffmann, J., Klauck, M., Steinmetz, M.: Analyzing neural network behavior through deep statistical model checking. Int. J. Softw. Tools Technol. Transfer 25(3), 407–426 (2023)

    Article  Google Scholar 

  20. Gros, T.P., Höller, D., Hoffmann, J., Klauck, M., Meerkamp, H., Wolf, V.: DSMC evaluation stages: fostering robust and safe behavior in deep reinforcement learning. In: Abate, A., Marin, A. (eds.) QEST 2021. LNCS, vol. 12846, pp. 197–216. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-85172-9_11

    Chapter  Google Scholar 

  21. Gros, T.P., Höller, D., Hoffmann, J., Wolf, V.: Tracking the race between deep reinforcement learning and imitation learning. In: Gribaudo, M., Jansen, D.N., Remke, A. (eds.) QEST 2020. LNCS, vol. 12289, pp. 11–17. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59854-9_2

    Chapter  Google Scholar 

  22. Gu, S., Holly, E., Lillicrap, T.P., Levine, S.: Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 3389–3396. IEEE Press (2017)

    Google Scholar 

  23. Hare, J.: Dealing with sparse rewards in reinforcement learning. arXiv preprint arXiv:1910.09281 (2019)

  24. Hasanbeig, M., Abate, A., Kroening, D.: Logically-constrained reinforcement learning. arXiv preprint arXiv:1801.08099 (2018)

  25. Jansen, N., Könighofer, B., Junges, S., Serban, A., Bloem, R.: Safe reinforcement learning using probabilistic shields. In: Proceedings of the 31st International Conference on Concurrency Theory (CONCUR), pp. 3:1–3:16. Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

    Google Scholar 

  26. Jegourel, C., Legay, A., Sedwards, S.: Importance splitting for statistical model checking rare properties. In: Sharygina, N., Veith, H. (eds.) CAV 2013. LNCS, vol. 8044, pp. 576–591. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39799-8_38

    Chapter  Google Scholar 

  27. Jiang, M., Dennis, M., Parker-Holder, J., Foerster, J.N., Grefenstette, E., Rocktäschel, T.: Replay-guided adversarial environment design. In: Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), pp. 1884–1897 (2021)

    Google Scholar 

  28. Kirkpatrick, J., et al.: Overcoming catastrophic forgetting in neural networks. arXiv preprint arXiv:1612.00796 (2016)

  29. Knox, W.B., Stone, P.: Reinforcement learning from human reward: discounting in episodic tasks. In: Proceedings of the 21st IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 878–885. IEEE Press (2012)

    Google Scholar 

  30. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015)

    Google Scholar 

  31. Morio, J., Pastel, R., Le Gland, F.: An overview of importance splitting for rare event simulation. Eur. J. Phys. 31(5), 1295 (2010)

    Article  Google Scholar 

  32. Nazari, M., Oroojlooy, A., Snyder, L.V., Takác, M.: Reinforcement learning for solving the vehicle routing problem. In: Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), pp. 9861–9871 (2018)

    Google Scholar 

  33. Parker-Holder, J., et al.: Evolving curricula with regret-based environment design. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 17473–17498. PMLR (2022)

    Google Scholar 

  34. Raileanu, R., Rocktäschel, T.: RIDE: rewarding impact-driven exploration for procedurally-generated environments. In: Proceedings of the 8th International Conference on Learning Representations (ICLR). OpenReview (2020)

    Google Scholar 

  35. Riedmiller, M.A., et al.: Learning by playing solving sparse reward tasks from scratch. In: Proceedings of the 35th International Conference on Machine Learning (ICML), pp. 4341–4350. PMLR (2018)

    Google Scholar 

  36. Sallab, A.E., Abdou, M., Perot, E., Yogamani, S.: Deep reinforcement learning framework for autonomous driving. Electron. Imaging 2017(19), 70–76 (2017)

    Article  Google Scholar 

  37. Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized experience replay. In: Proceedings of the 4th International Conference on Learning Representations (ICLR) (2016)

    Google Scholar 

  38. Schwartz, A.: A reinforcement learning method for maximizing undiscounted rewards. In: Proceedings of the 10th International Conference on Machine Learning (ICML), pp. 298–305. Morgan Kaufmann (1993)

    Google Scholar 

  39. Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)

    Google Scholar 

  40. Silver, D., et al.: A General reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362(6419), 1140–1144 (2018)

    Google Scholar 

  41. Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017)

    Google Scholar 

  42. Stooke, A., Abbeel, P.: rlpyt: a research code base for deep reinforcement learning in PyTorch. arXiv preprint arXiv:1909.01500 (2019)

  43. Sutton, R.S., Barto, A.G.: Reinforcement Learning - An Introduction, Adaptive Computation and Machine Learning. MIT Press (1998)

    Google Scholar 

Download references

Acknowledgments

This work was partially funded by the European Union’s Horizon Europe Research and Innovation program under the grant agreement TUPLES No 101070149, by the German Research Foundation (DFG) under grant No. 389792660, as part of TRR 248, see https://perspicuous-computing.science, by the German Research Foundation (DFG) - GRK 2853/1 “Neuroexplicit Models of Language, Vision, and Action” - project number 471607914, and by the European Regional Development Fund (ERDF) and the Saarland within the scope of (To)CERTAIN.

We thank the anonymous reviewers for their careful reading of our manuscript and their insightful comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Timo P. Gros .

Editor information

Editors and Affiliations

Appendices

A Benchmark Statistics

Racetrack. In Racetrack, the number of states is given through all the possible positions in the map and the maximal achievable velocity. As the most of the states cannot be reached with the maximal velocity, this is an upper bound and not an exact number.

figure a

MiniGrid. In MiniGrid, the number of states is given through all possible combinations of the agent’s position, it’s direction, whether the door is open, and the positions of the moving obstacles. The latter is the responsible for the huge state space even for relatively small maps. For DynObsDoor, we have \(\sim 3.04 \cdot 10^{10}\) states.

B Pseudo-code

The components of the RARE algorithm are highlighted in .

figure c

C Hyperparameters

Hyperparameters that are used in multiple algorithms but only have one table entry have the same value in all instances.

figure d

DQN:

figure e

DQNPR:

figure f

RAREID and RAREPR:

figure g

Go-Explore

figure h

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Gros, T.P., Müller, N.J., Höller, D., Wolf, V. (2025). Safe Reinforcement Learning Through Regret and State Restorations in Evaluation Stages. In: Jansen, N., et al. Principles of Verification: Cycling the Probabilistic Landscape . Lecture Notes in Computer Science, vol 15262. Springer, Cham. https://doi.org/10.1007/978-3-031-75778-5_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-75778-5_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-75777-8

  • Online ISBN: 978-3-031-75778-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics