Skip to main content

Probabilistic Counterexample Guidance for Safer Reinforcement Learning

  • Conference paper
  • First Online:
Quantitative Evaluation of Systems (QEST 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14287))

Included in the following conference series:

Abstract

Safe exploration aims at addressing the limitations of Reinforcement Learning (RL) in safety-critical scenarios, where failures during trial-and-error learning may incur high costs. Several methods exist to incorporate external knowledge or to use proximal sensor data to limit the exploration of unsafe states. However, reducing exploration risks in unknown environments, where an agent must discover safety threats during exploration, remains challenging.

In this paper, we target the problem of safe exploration by guiding the training with counterexamples of the safety requirement. Our method abstracts both continuous and discrete state-space systems into compact abstract models representing the safety-relevant knowledge acquired by the agent during exploration. We then exploit probabilistic counterexample generation to construct minimal simulation submodels eliciting safety requirement violations, where the agent can efficiently train offline to refine its policy towards minimising the risk of safety violations during the subsequent online exploration.

We demonstrate our method’s effectiveness in reducing safety violations during online exploration in preliminary experiments by an average of 40.3% compared with QL and DQN standard algorithms and 29.1% compared with previous related work, while achieving comparable cumulative rewards with respect to unrestricted exploration and alternative approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31. PMLR (2017)

    Google Scholar 

  2. Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  3. Baier, C., Katoen, J.P.: Principles of Model Checking. MIT press, Cambridge (2008)

    MATH  Google Scholar 

  4. Bellman, R.: A Markovian decision process. J. Math. Mech., 679–684 (1957)

    Google Scholar 

  5. Bharadhwaj, H., Kumar, A., Rhinehart, N., Levine, S., Shkurti, F., Garg, A.: Conservative safety critics for exploration. arXiv preprint arXiv:2010.14497 (2020)

  6. Brockman, G., et al.: Openai gym. arXiv preprint arXiv:1606.01540 (2016)

  7. Brunke, L., Greeff, M., Hall, A.W., Yuan, Z., Zhou, S., Panerati, J., Schoellig, A.P.: Safe learning in robotics: from learning-based control to safe reinforcement learning. Ann. Rev. Control Rob. Auton. Syst. 5, 411–444 (2022)

    Article  Google Scholar 

  8. Bshouty, N.H., Goldman, S.A., Mathias, H.D., Suri, S., Tamaki, H.: Noise-tolerant distribution-free learning of general geometric concepts. J. ACM (JACM) 45(5), 863–890 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  9. Buckman, J., Gelada, C., Bellemare, M.G.: The importance of pessimism in fixed-dataset policy optimization. arXiv preprint arXiv:2009.06799 (2020)

  10. Češka, M., Hensel, C., Junges, S., Katoen, J.-P.: Counterexample-driven synthesis for probabilistic program sketches. In: ter Beek, M.H., McIver, A., Oliveira, J.N. (eds.) FM 2019. LNCS, vol. 11800, pp. 101–120. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30942-8_8

    Chapter  Google Scholar 

  11. Dalal, G., Dvijotham, K., Vecerik, M., Hester, T., Paduraru, C., Tassa, Y.: Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757 (2018)

  12. Desharnais, J., Laviolette, F., Tracol, M.: Approximate analysis of probabilistic processes: logic, simulation and games. In: 2008 Fifth International Conference on Quantitative Evaluation of Systems, pp. 264–273. IEEE (2008)

    Google Scholar 

  13. Downey, A.: Think Bayes. O’Reilly Media, Sebastopol (2021)

    Google Scholar 

  14. Fulton, N., Platzer, A.: Safe reinforcement learning via formal methods: toward safe control through proof and learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1 (2018)

    Google Scholar 

  15. Garcia, J., Fernández, F.: Safe exploration of state and action spaces in reinforcement learning. J. Artif. Intell. Res. 45, 515–564 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  16. Garcıa, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16(1), 1437–1480 (2015)

    MathSciNet  MATH  Google Scholar 

  17. Gurobi Optimization, LLC: Gurobi Optimizer Reference Manual (2022). https://www.gurobi.com

  18. Han, T., Katoen, J.P., Berteun, D.: Counterexample generation in probabilistic model checking. IEEE Trans. Softw. Eng. 35(2), 241–257 (2009)

    Article  Google Scholar 

  19. Hansson, H., Jonsson, B.: A logic for reasoning about time and reliability. Formal Aspects Comput. 6(5), 512–535 (1994)

    Article  MATH  Google Scholar 

  20. Hasanbeig, M., Abate, A., Kroening, D.: Logically-constrained reinforcement learning. arXiv preprint arXiv:1801.08099 (2018)

  21. Hasanbeig, M., Kroening, D., Abate, A.: LCRL: Certified policy synthesis via logically-constrained reinforcement learning - implementation. https://github.com/grockious/lcrl

  22. Hasanbeig, M., Kroening, D., Abate, A.: Deep reinforcement learning with temporal logics. In: Bertrand, N., Jansen, N. (eds.) FORMATS 2020. LNCS, vol. 12288, pp. 1–22. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57628-8_1

    Chapter  Google Scholar 

  23. Hasanbeig, M., Kroening, D., Abate, A.: LCRL: certified policy synthesis via logically-constrained reinforcement learning. In: Abraham, E., Paolieri, M. (eds.) Quantitative Evaluation of Systems, QEST. LNCS, vol. 13479, pp. 217–231. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16336-4_11

    Chapter  Google Scholar 

  24. Huang, J., Wu, F., Precup, D., Cai, Y.: Learning safe policies with expert guidance. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, Curran Associates Inc., pp. 9123–9132 (2018)

    Google Scholar 

  25. Jansen, N., Könighofer, B., Junges, S., Bloem, R.: Shielded decision-making in mdps. arXiv preprint arXiv:1807.06096 (2018)

  26. Ji, X., Filieri, A.: Probabilistic counterexample guidance for safer reinforcement learning (extended version). arXiv preprint arXiv:2307.04927 (2023)

  27. Kim, Y., Allmendinger, R., López-Ibáñez, M.: Safe learning and optimization techniques: towards a survey of the state of the art. In: Heintz, F., Milano, M., O’Sullivan, B. (eds.) TAILOR 2020. LNCS (LNAI), vol. 12641, pp. 123–139. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73959-1_12

    Chapter  Google Scholar 

  28. Kumar, A., Fu, J., Tucker, G., Levine, S.: Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. Curran Associates Inc. (2019)

    Google Scholar 

  29. Kumar, A., Zhou, A., Tucker, G., Levine, S.: Conservative q-learning for offline reinforcement learning. Adv. Neural. Inf. Process. Syst. 33, 1179–1191 (2020)

    Google Scholar 

  30. Lawler, E.L., Wood, D.E.: Branch-and-bound methods: a survey. Oper. Res. 14(4), 699–719 (1966)

    Article  MathSciNet  MATH  Google Scholar 

  31. Levine, S., Kumar, A., Tucker, G., Fu, J.: Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643 (2020)

  32. Liu, A., Shi, G., Chung, S.J., Anandkumar, A., Yue, Y.: Robust regression for safe exploration in control. In: Learning for Dynamics and Control, pp. 608–619. PMLR (2020)

    Google Scholar 

  33. Mason, G.R., Calinescu, R.C., Kudenko, D., Banks, A.: Assured reinforcement learning with formally verified abstract policies. In: 9th International Conference on Agents and Artificial Intelligence (ICAART), York (2017)

    Google Scholar 

  34. McEwen, A.S., et al.: Recurring slope lineae in equatorial regions of Mars. Nature Geosci. 7(1), 53–58 (2014)

    Article  Google Scholar 

  35. Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)

  36. Moldovan, T.M., Abbeel, P.: Safe exploration in Markov decision processes. arXiv preprint arXiv:1205.4810 (2012)

  37. OpenAI: Stable baselines version 3 - dqn. https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html

  38. Pham, T.H., De Magistris, G., Tachibana, R.: Optlayer-practical constrained optimization for deep reinforcement learning in the real world. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6236–6243. IEEE (2018)

    Google Scholar 

  39. Prakash, B., Khatwani, M., Waytowich, N., Mohsenin, T.: Improving safety in reinforcement learning using model-based architectures and human intervention. arXiv preprint arXiv:1903.09328 (2019)

  40. Sharma, R., Gupta, S., Hariharan, B., Aiken, A., Nori, A.V.: Verification as learning geometric concepts. In: Logozzo, F., Fähndrich, M. (eds.) SAS 2013. LNCS, vol. 7935, pp. 388–411. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38856-9_21

    Chapter  Google Scholar 

  41. Siegel, N.Y., et al.: Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. arXiv preprint arXiv:2002.08396 (2020)

  42. Singh, G., Püschel, M., Vechev, M.: A practical construction for decomposing numerical abstract domains. Proc. ACM Program. Lang. 2(POPL) (2017)

    Google Scholar 

  43. Stooke, A., Achiam, J., Abbeel, P.: Responsive safety in reinforcement learning by pid Lagrangian methods. In: International Conference on Machine Learning, pp. 9133–9143. PMLR (2020)

    Google Scholar 

  44. Sui, Y., Gotovos, A., Burdick, J., Krause, A.: Safe exploration for optimization with gaussian processes. In: International Conference on Machine Learning, pp. 997–1005. PMLR (2015)

    Google Scholar 

  45. Tessler, C., Mankowitz, D.J., Mannor, DS.: Reward constrained policy optimization. arXiv preprint arXiv:1805.11074 (2018)

  46. Urpí, N.A., Curi, S., Krause, A.: Risk-averse offline reinforcement learning. arXiv preprint arXiv:2102.05371 (2021)

  47. Wachi, A., Sui, Y., Yue, Y., Ono, M.: Safe exploration and optimization of constrained mdps using Gaussian processes. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

    Google Scholar 

  48. Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3), 279–292 (1992)

    Article  MATH  Google Scholar 

  49. Wimmer, R., Jansen, N., Vorpahl, A., Ábrahám, E., Katoen, J.-P., Becker, B.: High-level counterexamples for probabilistic automata. In: Joshi, K., Siegle, M., Stoelinga, M., D’Argenio, P.R. (eds.) QEST 2013. LNCS, vol. 8054, pp. 39–54. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40196-1_4

    Chapter  Google Scholar 

  50. Wu, Y., Tucker, G., Nachum, O.: Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361 (2019)

  51. Xu, H., Zhan, X., Zhu, X.: Constraints penalized q-learning for safe offline reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 8753–8760 (2022)

    Google Scholar 

  52. Zhou, W., Li, W.: Safety-aware apprenticeship learning. In: Chockler, H., Weissenbacher, G. (eds.) CAV 2018. LNCS, vol. 10981, pp. 662–680. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-96145-3_38

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Xiaotong Ji or Antonio Filieri .

Editor information

Editors and Affiliations

Ethics declarations

Data availability

A prototype Python implementation of our method is available at Github: https://github.com/xtji/CEX-guided-RL.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ji, X., Filieri, A. (2023). Probabilistic Counterexample Guidance for Safer Reinforcement Learning. In: Jansen, N., Tribastone, M. (eds) Quantitative Evaluation of Systems. QEST 2023. Lecture Notes in Computer Science, vol 14287. Springer, Cham. https://doi.org/10.1007/978-3-031-43835-6_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43835-6_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43834-9

  • Online ISBN: 978-3-031-43835-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics