Abstract
Deep reinforcement learning (DRL) has succeeded tremendously in many complex decision-making tasks. However, for many real-world applications standard DRL training results in agents with brittle performance because, in particular for safety-critical problems, the discovery of both, safe and successful strategies is very challenging. Various exploration strategies have been proposed to address this problem. However, they do not take information about the current safety performance into account; thus, they fail to systematically focus on the parts of the state space most relevant for training. Here, we propose r egret a nd state r estoration in e valuation-based deep reinforcement learning (RARE), a framework that introduces two innovations: (i) it combines safety evaluation stages with state restorations, i.e., restarting episodes in formerly visited states, and (ii) it exploits estimations of the regret, i.e., the gap between the policies’ current and optimal performance. We show that both innovations are beneficial and that RARE outperforms baselines such as deep Q-learning and Go-Explore in an empirical evaluation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The goal-conditioned policy is tasked with reaching the sampled state s, after which the training continues with the regular policy \(\pi _\theta \). We follow the concept introduced by the Go-Explore algorithm [12].
- 2.
Note that the learning stage uses the states from the current archive \(\mathscr {A}_j\) for restarting episodes. However, archive \(\mathscr {A}_{j+1}\) contains the states currently stored for the next evaluation stage.
- 3.
The application of the Euclidean distance assumes that the state representation is based on physical attributes, such as coordinates or velocities, which is often used in RL benchmarks. This method would also be affective for imaged-based state representation. Alternative distance metrics might need to be considered if this assumption is not fulfilled.
- 4.
Otherwise we report the training as failed.
References
Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI), pp. 2669–2678. AAAI Press (2018)
Amit, R., Meir, R., Ciosek, K.: Discount factor as a regularizer in reinforcement learning. In: Proceedings of the 37th International Conference on Machine Learning (ICML), pp. 269–278. PMLR (2020)
Anderson, G., Chaudhuri, S., Dillig, I.: Guiding safe exploration with weakest preconditions. In: The Eleventh International Conference on Learning Representations (2022)
Andrychowicz, M., et al.: Hindsight experience replay. In: Advances in Neural Information Processing Systems, pp. 5048–5058 (2017)
Avni, G., Bloem, R., Chatterjee, K., Henzinger, T.A., Könighofer, B., Pranger, S.: Run-time optimization for learned controllers through quantitative games. In: Dillig, I., Tasiran, S. (eds.) CAV 2019. LNCS, vol. 11561, pp. 630–649. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-25540-4_36
Azar, M.G., Osband, I., Munos, R.: Minimax regret bounds for reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 263–272. PMLR (2017)
Baier, C., Christakis, M., Gros, T.P., Groß, D., Gumhold, S., Hermanns, H., Hoffmann, J., Klauck, M.: Lab conditions for research on explainable automated decisions. In: Heintz, F., Milano, M., O’Sullivan, B. (eds.) TAILOR 2020. LNCS (LNAI), vol. 12641, pp. 83–90. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73959-1_8
Bharadhwaj, H., Kumar, A., Rhinehart, N., Levine, S., Shkurti, F., Garg, A.: Conservative safety critics for exploration. In: Proceedings of the 9th International Conference on Learning Representations (ICLR). OpenReview (2021)
Burda, Y., Edwards, H., Storkey, A.J., Klimov, O.: Exploration by random network distillation. In: Proceedings of the 7th International Conference on Learning Representations (ICLR). OpenReview (2019)
Campero, A., Raileanu, R., Küttler, H., Tenenbaum, J.B., Rocktäschel, T., Grefenstette, E.: Learning with AMIGo: adversarially motivated intrinsic goals. In: Proceedings of the 9th International Conference on Learning Representations (ICLR). OpenReview (2021)
Chevalier-Boisvert, M., et al.: BabyAI: a platform to study the sample efficiency of grounded language learning. In: Proceedings of the 7th International Conference on Learning Representations (ICLR). OpenReview (2019)
Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K.O., Clune, J.: First return, then explore. Nature 590(7847), 580–586 (2021)
Flet-Berliac, Y., Ferret, J., Pietquin, O., Preux, P., Geist, M.: Adversarially guided actor-critic. In: Proceedings of the 9th International Conference on Learning Representations (ICLR). OpenReview (2021)
Fujita, Y., Nagarajan, P., Kataoka, T., Ishikawa, T.: ChainerRL: a deep reinforcement learning library. J. Mach. Learn. Res. 22, 77:1–77:14 (2021)
García, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16, 1437–1480 (2015)
Gros, T.P., et al.: DSMC evaluation stages: fostering robust and safe behavior in deep reinforcement learning - extended version. ACM Trans. Model. Comput. Simulat. 33(4), 17:1–17:28 (2023). https://doi.org/10.1145/3607198
Gros, T.P., Hermanns, H., Hoffmann, J., Klauck, M., Köhl, M.A., Wolf, V.: MoGym: using formal models for training and verifying decision-making agents. In: Shoham, S., Vizel, Y. (eds.) CAV 2022, Part II, pp. 430–443. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-13188-2_21
Gros, T.P., Hermanns, H., Hoffmann, J., Klauck, M., Steinmetz, M.: Deep statistical model checking. In: Gotsman, A., Sokolova, A. (eds.) FORTE 2020. LNCS, vol. 12136, pp. 96–114. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50086-3_6
Gros, T.P., Hermanns, H., Hoffmann, J., Klauck, M., Steinmetz, M.: Analyzing neural network behavior through deep statistical model checking. Int. J. Softw. Tools Technol. Transfer 25(3), 407–426 (2023)
Gros, T.P., Höller, D., Hoffmann, J., Klauck, M., Meerkamp, H., Wolf, V.: DSMC evaluation stages: fostering robust and safe behavior in deep reinforcement learning. In: Abate, A., Marin, A. (eds.) QEST 2021. LNCS, vol. 12846, pp. 197–216. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-85172-9_11
Gros, T.P., Höller, D., Hoffmann, J., Wolf, V.: Tracking the race between deep reinforcement learning and imitation learning. In: Gribaudo, M., Jansen, D.N., Remke, A. (eds.) QEST 2020. LNCS, vol. 12289, pp. 11–17. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59854-9_2
Gu, S., Holly, E., Lillicrap, T.P., Levine, S.: Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 3389–3396. IEEE Press (2017)
Hare, J.: Dealing with sparse rewards in reinforcement learning. arXiv preprint arXiv:1910.09281 (2019)
Hasanbeig, M., Abate, A., Kroening, D.: Logically-constrained reinforcement learning. arXiv preprint arXiv:1801.08099 (2018)
Jansen, N., Könighofer, B., Junges, S., Serban, A., Bloem, R.: Safe reinforcement learning using probabilistic shields. In: Proceedings of the 31st International Conference on Concurrency Theory (CONCUR), pp. 3:1–3:16. Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
Jegourel, C., Legay, A., Sedwards, S.: Importance splitting for statistical model checking rare properties. In: Sharygina, N., Veith, H. (eds.) CAV 2013. LNCS, vol. 8044, pp. 576–591. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39799-8_38
Jiang, M., Dennis, M., Parker-Holder, J., Foerster, J.N., Grefenstette, E., Rocktäschel, T.: Replay-guided adversarial environment design. In: Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), pp. 1884–1897 (2021)
Kirkpatrick, J., et al.: Overcoming catastrophic forgetting in neural networks. arXiv preprint arXiv:1612.00796 (2016)
Knox, W.B., Stone, P.: Reinforcement learning from human reward: discounting in episodic tasks. In: Proceedings of the 21st IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 878–885. IEEE Press (2012)
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015)
Morio, J., Pastel, R., Le Gland, F.: An overview of importance splitting for rare event simulation. Eur. J. Phys. 31(5), 1295 (2010)
Nazari, M., Oroojlooy, A., Snyder, L.V., Takác, M.: Reinforcement learning for solving the vehicle routing problem. In: Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), pp. 9861–9871 (2018)
Parker-Holder, J., et al.: Evolving curricula with regret-based environment design. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 17473–17498. PMLR (2022)
Raileanu, R., Rocktäschel, T.: RIDE: rewarding impact-driven exploration for procedurally-generated environments. In: Proceedings of the 8th International Conference on Learning Representations (ICLR). OpenReview (2020)
Riedmiller, M.A., et al.: Learning by playing solving sparse reward tasks from scratch. In: Proceedings of the 35th International Conference on Machine Learning (ICML), pp. 4341–4350. PMLR (2018)
Sallab, A.E., Abdou, M., Perot, E., Yogamani, S.: Deep reinforcement learning framework for autonomous driving. Electron. Imaging 2017(19), 70–76 (2017)
Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized experience replay. In: Proceedings of the 4th International Conference on Learning Representations (ICLR) (2016)
Schwartz, A.: A reinforcement learning method for maximizing undiscounted rewards. In: Proceedings of the 10th International Conference on Machine Learning (ICML), pp. 298–305. Morgan Kaufmann (1993)
Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
Silver, D., et al.: A General reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362(6419), 1140–1144 (2018)
Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017)
Stooke, A., Abbeel, P.: rlpyt: a research code base for deep reinforcement learning in PyTorch. arXiv preprint arXiv:1909.01500 (2019)
Sutton, R.S., Barto, A.G.: Reinforcement Learning - An Introduction, Adaptive Computation and Machine Learning. MIT Press (1998)
Acknowledgments
This work was partially funded by the European Union’s Horizon Europe Research and Innovation program under the grant agreement TUPLES No 101070149, by the German Research Foundation (DFG) under grant No. 389792660, as part of TRR 248, see https://perspicuous-computing.science, by the German Research Foundation (DFG) - GRK 2853/1 “Neuroexplicit Models of Language, Vision, and Action” - project number 471607914, and by the European Regional Development Fund (ERDF) and the Saarland within the scope of (To)CERTAIN.
We thank the anonymous reviewers for their careful reading of our manuscript and their insightful comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Benchmark Statistics
Racetrack. In Racetrack, the number of states is given through all the possible positions in the map and the maximal achievable velocity. As the most of the states cannot be reached with the maximal velocity, this is an upper bound and not an exact number.
MiniGrid. In MiniGrid, the number of states is given through all possible combinations of the agent’s position, it’s direction, whether the door is open, and the positions of the moving obstacles. The latter is the responsible for the huge state space even for relatively small maps. For DynObsDoor, we have \(\sim 3.04 \cdot 10^{10}\) states.
B Pseudo-code
The components of the RARE algorithm are highlighted in .
C Hyperparameters
Hyperparameters that are used in multiple algorithms but only have one table entry have the same value in all instances.
DQN:
DQNPR:
RAREID and RAREPR:
Go-Explore
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Gros, T.P., Müller, N.J., Höller, D., Wolf, V. (2025). Safe Reinforcement Learning Through Regret and State Restorations in Evaluation Stages. In: Jansen, N., et al. Principles of Verification: Cycling the Probabilistic Landscape . Lecture Notes in Computer Science, vol 15262. Springer, Cham. https://doi.org/10.1007/978-3-031-75778-5_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-75778-5_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-75777-8
Online ISBN: 978-3-031-75778-5
eBook Packages: Computer ScienceComputer Science (R0)