Abstract
Safe exploration aims at addressing the limitations of Reinforcement Learning (RL) in safety-critical scenarios, where failures during trial-and-error learning may incur high costs. Several methods exist to incorporate external knowledge or to use proximal sensor data to limit the exploration of unsafe states. However, reducing exploration risks in unknown environments, where an agent must discover safety threats during exploration, remains challenging.
In this paper, we target the problem of safe exploration by guiding the training with counterexamples of the safety requirement. Our method abstracts both continuous and discrete state-space systems into compact abstract models representing the safety-relevant knowledge acquired by the agent during exploration. We then exploit probabilistic counterexample generation to construct minimal simulation submodels eliciting safety requirement violations, where the agent can efficiently train offline to refine its policy towards minimising the risk of safety violations during the subsequent online exploration.
We demonstrate our method’s effectiveness in reducing safety violations during online exploration in preliminary experiments by an average of 40.3% compared with QL and DQN standard algorithms and 29.1% compared with previous related work, while achieving comparable cumulative rewards with respect to unrestricted exploration and alternative approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31. PMLR (2017)
Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Baier, C., Katoen, J.P.: Principles of Model Checking. MIT press, Cambridge (2008)
Bellman, R.: A Markovian decision process. J. Math. Mech., 679–684 (1957)
Bharadhwaj, H., Kumar, A., Rhinehart, N., Levine, S., Shkurti, F., Garg, A.: Conservative safety critics for exploration. arXiv preprint arXiv:2010.14497 (2020)
Brockman, G., et al.: Openai gym. arXiv preprint arXiv:1606.01540 (2016)
Brunke, L., Greeff, M., Hall, A.W., Yuan, Z., Zhou, S., Panerati, J., Schoellig, A.P.: Safe learning in robotics: from learning-based control to safe reinforcement learning. Ann. Rev. Control Rob. Auton. Syst. 5, 411–444 (2022)
Bshouty, N.H., Goldman, S.A., Mathias, H.D., Suri, S., Tamaki, H.: Noise-tolerant distribution-free learning of general geometric concepts. J. ACM (JACM) 45(5), 863–890 (1998)
Buckman, J., Gelada, C., Bellemare, M.G.: The importance of pessimism in fixed-dataset policy optimization. arXiv preprint arXiv:2009.06799 (2020)
Češka, M., Hensel, C., Junges, S., Katoen, J.-P.: Counterexample-driven synthesis for probabilistic program sketches. In: ter Beek, M.H., McIver, A., Oliveira, J.N. (eds.) FM 2019. LNCS, vol. 11800, pp. 101–120. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30942-8_8
Dalal, G., Dvijotham, K., Vecerik, M., Hester, T., Paduraru, C., Tassa, Y.: Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757 (2018)
Desharnais, J., Laviolette, F., Tracol, M.: Approximate analysis of probabilistic processes: logic, simulation and games. In: 2008 Fifth International Conference on Quantitative Evaluation of Systems, pp. 264–273. IEEE (2008)
Downey, A.: Think Bayes. O’Reilly Media, Sebastopol (2021)
Fulton, N., Platzer, A.: Safe reinforcement learning via formal methods: toward safe control through proof and learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1 (2018)
Garcia, J., Fernández, F.: Safe exploration of state and action spaces in reinforcement learning. J. Artif. Intell. Res. 45, 515–564 (2012)
Garcıa, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16(1), 1437–1480 (2015)
Gurobi Optimization, LLC: Gurobi Optimizer Reference Manual (2022). https://www.gurobi.com
Han, T., Katoen, J.P., Berteun, D.: Counterexample generation in probabilistic model checking. IEEE Trans. Softw. Eng. 35(2), 241–257 (2009)
Hansson, H., Jonsson, B.: A logic for reasoning about time and reliability. Formal Aspects Comput. 6(5), 512–535 (1994)
Hasanbeig, M., Abate, A., Kroening, D.: Logically-constrained reinforcement learning. arXiv preprint arXiv:1801.08099 (2018)
Hasanbeig, M., Kroening, D., Abate, A.: LCRL: Certified policy synthesis via logically-constrained reinforcement learning - implementation. https://github.com/grockious/lcrl
Hasanbeig, M., Kroening, D., Abate, A.: Deep reinforcement learning with temporal logics. In: Bertrand, N., Jansen, N. (eds.) FORMATS 2020. LNCS, vol. 12288, pp. 1–22. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57628-8_1
Hasanbeig, M., Kroening, D., Abate, A.: LCRL: certified policy synthesis via logically-constrained reinforcement learning. In: Abraham, E., Paolieri, M. (eds.) Quantitative Evaluation of Systems, QEST. LNCS, vol. 13479, pp. 217–231. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16336-4_11
Huang, J., Wu, F., Precup, D., Cai, Y.: Learning safe policies with expert guidance. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, Curran Associates Inc., pp. 9123–9132 (2018)
Jansen, N., Könighofer, B., Junges, S., Bloem, R.: Shielded decision-making in mdps. arXiv preprint arXiv:1807.06096 (2018)
Ji, X., Filieri, A.: Probabilistic counterexample guidance for safer reinforcement learning (extended version). arXiv preprint arXiv:2307.04927 (2023)
Kim, Y., Allmendinger, R., López-Ibáñez, M.: Safe learning and optimization techniques: towards a survey of the state of the art. In: Heintz, F., Milano, M., O’Sullivan, B. (eds.) TAILOR 2020. LNCS (LNAI), vol. 12641, pp. 123–139. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73959-1_12
Kumar, A., Fu, J., Tucker, G., Levine, S.: Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. Curran Associates Inc. (2019)
Kumar, A., Zhou, A., Tucker, G., Levine, S.: Conservative q-learning for offline reinforcement learning. Adv. Neural. Inf. Process. Syst. 33, 1179–1191 (2020)
Lawler, E.L., Wood, D.E.: Branch-and-bound methods: a survey. Oper. Res. 14(4), 699–719 (1966)
Levine, S., Kumar, A., Tucker, G., Fu, J.: Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643 (2020)
Liu, A., Shi, G., Chung, S.J., Anandkumar, A., Yue, Y.: Robust regression for safe exploration in control. In: Learning for Dynamics and Control, pp. 608–619. PMLR (2020)
Mason, G.R., Calinescu, R.C., Kudenko, D., Banks, A.: Assured reinforcement learning with formally verified abstract policies. In: 9th International Conference on Agents and Artificial Intelligence (ICAART), York (2017)
McEwen, A.S., et al.: Recurring slope lineae in equatorial regions of Mars. Nature Geosci. 7(1), 53–58 (2014)
Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
Moldovan, T.M., Abbeel, P.: Safe exploration in Markov decision processes. arXiv preprint arXiv:1205.4810 (2012)
OpenAI: Stable baselines version 3 - dqn. https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html
Pham, T.H., De Magistris, G., Tachibana, R.: Optlayer-practical constrained optimization for deep reinforcement learning in the real world. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6236–6243. IEEE (2018)
Prakash, B., Khatwani, M., Waytowich, N., Mohsenin, T.: Improving safety in reinforcement learning using model-based architectures and human intervention. arXiv preprint arXiv:1903.09328 (2019)
Sharma, R., Gupta, S., Hariharan, B., Aiken, A., Nori, A.V.: Verification as learning geometric concepts. In: Logozzo, F., Fähndrich, M. (eds.) SAS 2013. LNCS, vol. 7935, pp. 388–411. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38856-9_21
Siegel, N.Y., et al.: Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. arXiv preprint arXiv:2002.08396 (2020)
Singh, G., Püschel, M., Vechev, M.: A practical construction for decomposing numerical abstract domains. Proc. ACM Program. Lang. 2(POPL) (2017)
Stooke, A., Achiam, J., Abbeel, P.: Responsive safety in reinforcement learning by pid Lagrangian methods. In: International Conference on Machine Learning, pp. 9133–9143. PMLR (2020)
Sui, Y., Gotovos, A., Burdick, J., Krause, A.: Safe exploration for optimization with gaussian processes. In: International Conference on Machine Learning, pp. 997–1005. PMLR (2015)
Tessler, C., Mankowitz, D.J., Mannor, DS.: Reward constrained policy optimization. arXiv preprint arXiv:1805.11074 (2018)
Urpí, N.A., Curi, S., Krause, A.: Risk-averse offline reinforcement learning. arXiv preprint arXiv:2102.05371 (2021)
Wachi, A., Sui, Y., Yue, Y., Ono, M.: Safe exploration and optimization of constrained mdps using Gaussian processes. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3), 279–292 (1992)
Wimmer, R., Jansen, N., Vorpahl, A., Ábrahám, E., Katoen, J.-P., Becker, B.: High-level counterexamples for probabilistic automata. In: Joshi, K., Siegle, M., Stoelinga, M., D’Argenio, P.R. (eds.) QEST 2013. LNCS, vol. 8054, pp. 39–54. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40196-1_4
Wu, Y., Tucker, G., Nachum, O.: Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361 (2019)
Xu, H., Zhan, X., Zhu, X.: Constraints penalized q-learning for safe offline reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 8753–8760 (2022)
Zhou, W., Li, W.: Safety-aware apprenticeship learning. In: Chockler, H., Weissenbacher, G. (eds.) CAV 2018. LNCS, vol. 10981, pp. 662–680. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-96145-3_38
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Ethics declarations
Data availability
A prototype Python implementation of our method is available at Github: https://github.com/xtji/CEX-guided-RL.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ji, X., Filieri, A. (2023). Probabilistic Counterexample Guidance for Safer Reinforcement Learning. In: Jansen, N., Tribastone, M. (eds) Quantitative Evaluation of Systems. QEST 2023. Lecture Notes in Computer Science, vol 14287. Springer, Cham. https://doi.org/10.1007/978-3-031-43835-6_22
Download citation
DOI: https://doi.org/10.1007/978-3-031-43835-6_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43834-9
Online ISBN: 978-3-031-43835-6
eBook Packages: Computer ScienceComputer Science (R0)