Probabilistic Counterexample Guidance for Safer Reinforcement Learning

Ji, Xiaotong; Filieri, Antonio

doi:10.1007/978-3-031-43835-6_22

Xiaotong Ji⁹ &
Antonio Filieri⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14287))

Included in the following conference series:

International Conference on Quantitative Evaluation of Systems

248 Accesses
1 Citations

Abstract

Safe exploration aims at addressing the limitations of Reinforcement Learning (RL) in safety-critical scenarios, where failures during trial-and-error learning may incur high costs. Several methods exist to incorporate external knowledge or to use proximal sensor data to limit the exploration of unsafe states. However, reducing exploration risks in unknown environments, where an agent must discover safety threats during exploration, remains challenging.

In this paper, we target the problem of safe exploration by guiding the training with counterexamples of the safety requirement. Our method abstracts both continuous and discrete state-space systems into compact abstract models representing the safety-relevant knowledge acquired by the agent during exploration. We then exploit probabilistic counterexample generation to construct minimal simulation submodels eliciting safety requirement violations, where the agent can efficiently train offline to refine its policy towards minimising the risk of safety violations during the subsequent online exploration.

We demonstrate our method’s effectiveness in reducing safety violations during online exploration in preliminary experiments by an average of 40.3% compared with QL and DQN standard algorithms and 29.1% compared with previous related work, while achieving comparable cumulative rewards with respect to unrestricted exploration and alternative approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31. PMLR (2017)
Google Scholar
Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Baier, C., Katoen, J.P.: Principles of Model Checking. MIT press, Cambridge (2008)
MATH Google Scholar
Bellman, R.: A Markovian decision process. J. Math. Mech., 679–684 (1957)
Google Scholar
Bharadhwaj, H., Kumar, A., Rhinehart, N., Levine, S., Shkurti, F., Garg, A.: Conservative safety critics for exploration. arXiv preprint arXiv:2010.14497 (2020)
Brockman, G., et al.: Openai gym. arXiv preprint arXiv:1606.01540 (2016)
Brunke, L., Greeff, M., Hall, A.W., Yuan, Z., Zhou, S., Panerati, J., Schoellig, A.P.: Safe learning in robotics: from learning-based control to safe reinforcement learning. Ann. Rev. Control Rob. Auton. Syst. 5, 411–444 (2022)
Article Google Scholar
Bshouty, N.H., Goldman, S.A., Mathias, H.D., Suri, S., Tamaki, H.: Noise-tolerant distribution-free learning of general geometric concepts. J. ACM (JACM) 45(5), 863–890 (1998)
Article MathSciNet MATH Google Scholar
Buckman, J., Gelada, C., Bellemare, M.G.: The importance of pessimism in fixed-dataset policy optimization. arXiv preprint arXiv:2009.06799 (2020)
Češka, M., Hensel, C., Junges, S., Katoen, J.-P.: Counterexample-driven synthesis for probabilistic program sketches. In: ter Beek, M.H., McIver, A., Oliveira, J.N. (eds.) FM 2019. LNCS, vol. 11800, pp. 101–120. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30942-8_8
Chapter Google Scholar
Dalal, G., Dvijotham, K., Vecerik, M., Hester, T., Paduraru, C., Tassa, Y.: Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757 (2018)
Desharnais, J., Laviolette, F., Tracol, M.: Approximate analysis of probabilistic processes: logic, simulation and games. In: 2008 Fifth International Conference on Quantitative Evaluation of Systems, pp. 264–273. IEEE (2008)
Google Scholar
Downey, A.: Think Bayes. O’Reilly Media, Sebastopol (2021)
Google Scholar
Fulton, N., Platzer, A.: Safe reinforcement learning via formal methods: toward safe control through proof and learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1 (2018)
Google Scholar
Garcia, J., Fernández, F.: Safe exploration of state and action spaces in reinforcement learning. J. Artif. Intell. Res. 45, 515–564 (2012)
Article MathSciNet MATH Google Scholar
Garcıa, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16(1), 1437–1480 (2015)
MathSciNet MATH Google Scholar
Gurobi Optimization, LLC: Gurobi Optimizer Reference Manual (2022). https://www.gurobi.com
Han, T., Katoen, J.P., Berteun, D.: Counterexample generation in probabilistic model checking. IEEE Trans. Softw. Eng. 35(2), 241–257 (2009)
Article Google Scholar
Hansson, H., Jonsson, B.: A logic for reasoning about time and reliability. Formal Aspects Comput. 6(5), 512–535 (1994)
Article MATH Google Scholar
Hasanbeig, M., Abate, A., Kroening, D.: Logically-constrained reinforcement learning. arXiv preprint arXiv:1801.08099 (2018)
Hasanbeig, M., Kroening, D., Abate, A.: LCRL: Certified policy synthesis via logically-constrained reinforcement learning - implementation. https://github.com/grockious/lcrl
Hasanbeig, M., Kroening, D., Abate, A.: Deep reinforcement learning with temporal logics. In: Bertrand, N., Jansen, N. (eds.) FORMATS 2020. LNCS, vol. 12288, pp. 1–22. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57628-8_1
Chapter Google Scholar
Hasanbeig, M., Kroening, D., Abate, A.: LCRL: certified policy synthesis via logically-constrained reinforcement learning. In: Abraham, E., Paolieri, M. (eds.) Quantitative Evaluation of Systems, QEST. LNCS, vol. 13479, pp. 217–231. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16336-4_11
Chapter Google Scholar
Huang, J., Wu, F., Precup, D., Cai, Y.: Learning safe policies with expert guidance. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, Curran Associates Inc., pp. 9123–9132 (2018)
Google Scholar
Jansen, N., Könighofer, B., Junges, S., Bloem, R.: Shielded decision-making in mdps. arXiv preprint arXiv:1807.06096 (2018)
Ji, X., Filieri, A.: Probabilistic counterexample guidance for safer reinforcement learning (extended version). arXiv preprint arXiv:2307.04927 (2023)
Kim, Y., Allmendinger, R., López-Ibáñez, M.: Safe learning and optimization techniques: towards a survey of the state of the art. In: Heintz, F., Milano, M., O’Sullivan, B. (eds.) TAILOR 2020. LNCS (LNAI), vol. 12641, pp. 123–139. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73959-1_12
Chapter Google Scholar
Kumar, A., Fu, J., Tucker, G., Levine, S.: Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. Curran Associates Inc. (2019)
Google Scholar
Kumar, A., Zhou, A., Tucker, G., Levine, S.: Conservative q-learning for offline reinforcement learning. Adv. Neural. Inf. Process. Syst. 33, 1179–1191 (2020)
Google Scholar
Lawler, E.L., Wood, D.E.: Branch-and-bound methods: a survey. Oper. Res. 14(4), 699–719 (1966)
Article MathSciNet MATH Google Scholar
Levine, S., Kumar, A., Tucker, G., Fu, J.: Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643 (2020)
Liu, A., Shi, G., Chung, S.J., Anandkumar, A., Yue, Y.: Robust regression for safe exploration in control. In: Learning for Dynamics and Control, pp. 608–619. PMLR (2020)
Google Scholar
Mason, G.R., Calinescu, R.C., Kudenko, D., Banks, A.: Assured reinforcement learning with formally verified abstract policies. In: 9th International Conference on Agents and Artificial Intelligence (ICAART), York (2017)
Google Scholar
McEwen, A.S., et al.: Recurring slope lineae in equatorial regions of Mars. Nature Geosci. 7(1), 53–58 (2014)
Article Google Scholar
Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
Moldovan, T.M., Abbeel, P.: Safe exploration in Markov decision processes. arXiv preprint arXiv:1205.4810 (2012)
OpenAI: Stable baselines version 3 - dqn. https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html
Pham, T.H., De Magistris, G., Tachibana, R.: Optlayer-practical constrained optimization for deep reinforcement learning in the real world. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6236–6243. IEEE (2018)
Google Scholar
Prakash, B., Khatwani, M., Waytowich, N., Mohsenin, T.: Improving safety in reinforcement learning using model-based architectures and human intervention. arXiv preprint arXiv:1903.09328 (2019)
Sharma, R., Gupta, S., Hariharan, B., Aiken, A., Nori, A.V.: Verification as learning geometric concepts. In: Logozzo, F., Fähndrich, M. (eds.) SAS 2013. LNCS, vol. 7935, pp. 388–411. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38856-9_21
Chapter Google Scholar
Siegel, N.Y., et al.: Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. arXiv preprint arXiv:2002.08396 (2020)
Singh, G., Püschel, M., Vechev, M.: A practical construction for decomposing numerical abstract domains. Proc. ACM Program. Lang. 2(POPL) (2017)
Google Scholar
Stooke, A., Achiam, J., Abbeel, P.: Responsive safety in reinforcement learning by pid Lagrangian methods. In: International Conference on Machine Learning, pp. 9133–9143. PMLR (2020)
Google Scholar
Sui, Y., Gotovos, A., Burdick, J., Krause, A.: Safe exploration for optimization with gaussian processes. In: International Conference on Machine Learning, pp. 997–1005. PMLR (2015)
Google Scholar
Tessler, C., Mankowitz, D.J., Mannor, DS.: Reward constrained policy optimization. arXiv preprint arXiv:1805.11074 (2018)
Urpí, N.A., Curi, S., Krause, A.: Risk-averse offline reinforcement learning. arXiv preprint arXiv:2102.05371 (2021)
Wachi, A., Sui, Y., Yue, Y., Ono, M.: Safe exploration and optimization of constrained mdps using Gaussian processes. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3), 279–292 (1992)
Article MATH Google Scholar
Wimmer, R., Jansen, N., Vorpahl, A., Ábrahám, E., Katoen, J.-P., Becker, B.: High-level counterexamples for probabilistic automata. In: Joshi, K., Siegle, M., Stoelinga, M., D’Argenio, P.R. (eds.) QEST 2013. LNCS, vol. 8054, pp. 39–54. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40196-1_4
Chapter Google Scholar
Wu, Y., Tucker, G., Nachum, O.: Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361 (2019)
Xu, H., Zhan, X., Zhu, X.: Constraints penalized q-learning for safe offline reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 8753–8760 (2022)
Google Scholar
Zhou, W., Li, W.: Safety-aware apprenticeship learning. In: Chockler, H., Weissenbacher, G. (eds.) CAV 2018. LNCS, vol. 10981, pp. 662–680. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-96145-3_38
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing, Imperial College London, London, SW7 2AZ, UK
Xiaotong Ji & Antonio Filieri

Authors

Xiaotong Ji
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Filieri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Xiaotong Ji or Antonio Filieri .

Editor information

Editors and Affiliations

Radboud University Nijmegen, Nijmegen, The Netherlands
Nils Jansen
IMT Lucca, Lucca, Italy
Mirco Tribastone

Ethics declarations

Data availability

A prototype Python implementation of our method is available at Github: https://github.com/xtji/CEX-guided-RL.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ji, X., Filieri, A. (2023). Probabilistic Counterexample Guidance for Safer Reinforcement Learning. In: Jansen, N., Tribastone, M. (eds) Quantitative Evaluation of Systems. QEST 2023. Lecture Notes in Computer Science, vol 14287. Springer, Cham. https://doi.org/10.1007/978-3-031-43835-6_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-43835-6_22
Published: 15 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43834-9
Online ISBN: 978-3-031-43835-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Probabilistic Counterexample Guidance for Safer Reinforcement Learning