Skip to main content

Verified Probabilistic Policies for Deep Reinforcement Learning

  • Conference paper
  • First Online:
NASA Formal Methods (NFM 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13260))

Included in the following conference series:

Abstract

Deep reinforcement learning is an increasingly popular technique for synthesising policies to control an agent’s interaction with its environment. There is also growing interest in formally verifying that such policies are correct and execute safely. Progress has been made in this area by building on existing work for verification of deep neural networks and of continuous-state dynamical systems. In this paper, we tackle the problem of verifying probabilistic policies for deep reinforcement learning, which are used to, for example, tackle adversarial environments, break symmetries and manage trade-offs. We propose an abstraction approach, based on interval Markov decision processes, that yields probabilistic guarantees on a policy’s execution, and present techniques to build and solve these models using abstract interpretation, mixed-integer linear programming, entropy-based refinement and probabilistic model checking. We implement our approach and illustrate its effectiveness on a selection of reinforcement learning benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: Proceedings of 32nd AAAI Conference on Artificial Intelligence (AAAI 2018), pp. 2669–2678 (2018)

    Google Scholar 

  2. Bacci, E.: Formal Verification of Deep Reinforcement Learning Agents. Ph.D. thesis, School of Computer Science, University of Birmingham (2022)

    Google Scholar 

  3. Bacci, E., Giacobbe, M., Parker, D.: Verifying reinforcement learning up to infinity. In: Proceedings 30th International Joint Conference on Artificial Intelligence (IJCAI 2021), pp. 2154–2160 (2021)

    Google Scholar 

  4. Bacci, E., Parker, D.: Probabilistic guarantees for safe deep reinforcement learning. In: Bertrand, N., Jansen, N. (eds.) FORMATS 2020. LNCS, vol. 12288, pp. 231–248. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57628-8_14

    Chapter  Google Scholar 

  5. Bastani, O.: Safe reinforcement learning with nonlinear dynamics via model predictive shielding. In: Proceedings of the American Control Conference, pp. 3488–3494 (2021)

    Google Scholar 

  6. Bastani, O., Pu, Y., Solar-Lezama, A.: Verifiable reinforcement learning via policy extraction. In: Proceedings of 2018 Annual Conference on Neural Information Processing Systems (NeurIPS 2018), pp. 2499–2509 (2018)

    Google Scholar 

  7. Bogomolov, S., Frehse, G., Giacobbe, M., Henzinger, T.A.: Counterexample-guided refinement of template polyhedra. In: TACAS (1), pp. 589–606 (2017)

    Google Scholar 

  8. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: OpenAI Gym, June 2016

    Google Scholar 

  9. Bunel, R., Turkaslan, I., Torr, P., Kohli, P., Kumar, P.: A unified view of piecewise linear neural network verification. In: Proceedings of 32nd International Conference on Neural Information Processing Systems (NIPS 2018), pp. 4795–4804 (2018)

    Google Scholar 

  10. Carr, S., Jansen, N., Topcu, U.: Task-aware verifiable RNN-based policies for partially observable Markov decision processes. J. Artif. Intell. Res. 72, 819–847 (2021)

    Article  MathSciNet  Google Scholar 

  11. Cauchi, N., Laurenti, L., Lahijanian, M., Abate, A., Kwiatkowska, M., Cardelli, L.: Efficiency through uncertainty: scalable formal synthesis for stochastic hybrid systems. In: 22nd ACM International Conference on Hybrid Systems: Computation and Control (2019)

    Google Scholar 

  12. Cheng, C.-H., Nührenberg, G., Ruess, H.: Maximum resilience of artificial neural networks. In: D’Souza, D., Narayan Kumar, K. (eds.) ATVA 2017. LNCS, vol. 10482, pp. 251–268. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68167-2_18

    Chapter  Google Scholar 

  13. Cheng, R., Orosz, G., Murray, R.M., Burdick, J.W.: End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In: AAAI, pp. 3387–3395. AAAI Press (2019)

    Google Scholar 

  14. Delgrange, F., Ann Now e, G.A.P.: Distillation of RL policies with formal guarantees via variational abstraction of Markov decision processes. In: Proceedings of 36th AAAI Conference on Artificial Intelligence (AAAI 2022) (2022)

    Google Scholar 

  15. Fecher, H., Leucker, M., Wolf, V.: Don’t Know in probabilistic systems. In: Valmari, A. (ed.) SPIN 2006. LNCS, vol. 3925, pp. 71–88. Springer, Heidelberg (2006). https://doi.org/10.1007/11691617_5

    Chapter  Google Scholar 

  16. Frehse, G., Giacobbe, M., Henzinger, T.A.: Space-time interpolants. In: Chockler, H., Weissenbacher, G. (eds.) CAV 2018. LNCS, vol. 10981, pp. 468–486. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-96145-3_25

    Chapter  Google Scholar 

  17. Fulton, N., Platzer, A.: Safe reinforcement learning via formal methods: toward safe control through proof and learning. In: AAAI, pp. 6485–6492. AAAI Press (2018)

    Google Scholar 

  18. García, J., Fernández, F.: Probabilistic policy reuse for safe reinforcement learning. ACM Trans. Autonomous Adaptive Syst. 13(3), 1–24 (2018)

    Article  Google Scholar 

  19. Gu, S., Holly, E., Lillicrap, T.P., Levine, S.: Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In: Proceedings of 2017 IEEE International Conference on Robotics and Automation (ICRA 2017), pp. 3389–3396 (2017)

    Google Scholar 

  20. Gurobi Optimization, LLC: Gurobi Optimizer Reference Manual (2021)

    Google Scholar 

  21. Hasanbeig, M., Abate, A., Kroening, D.: Logically-constrained neural fitted q-iteration. In: AAMAS, pp. 2012–2014. IFAAMAS (2019)

    Google Scholar 

  22. Hasanbeig, M., Abate, A., Kroening, D.: Cautious reinforcement learning with logical constraints. In: AAMAS, pp. 483–491. International Foundation for Autonomous Agents and Multiagent Systems (2020)

    Google Scholar 

  23. Hunt, N., Fulton, N., Magliacane, S., Hoang, T.N., Das, S., Solar-Lezama, A.: Verifiably safe exploration for end-to-end reinforcement learning. In: Proceedings of 24th International Conference on Hybrid Systems: Computation and Control (HSCC 2021) (2021)

    Google Scholar 

  24. Jaeger, M., Jensen, P.G., Guldstrand Larsen, K., Legay, A., Sedwards, S., Taankvist, J.H.: Teaching stratego to play ball: optimal synthesis for continuous space MDPs. In: Chen, Y.-F., Cheng, C.-H., Esparza, J. (eds.) ATVA 2019. LNCS, vol. 11781, pp. 81–97. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-31784-3_5

    Chapter  Google Scholar 

  25. Jansen, N., Könighofer, B., Junges, S., Serban, A., Bloem, R.: Safe reinforcement learning using probabilistic shields. In: Proceedings of 31st International Conference on Concurrency Theory (CONCUR 2020), vol. 171, pp. 31–316 (2020)

    Google Scholar 

  26. Jin, P., Zhang, M., Li, J., Han, L., Wen, X.: Learning on Abstract Domains: A New Approach for Verifiable Guarantee in Reinforcement Learning, June 2021

    Google Scholar 

  27. Kattenbelt, M., Kwiatkowska, M., Norman, G., Parker, D.: A game-based abstraction-refinement framework for Markov decision processes. Formal Methods Syst. Des. 36(3), 246–280 (2010)

    Article  Google Scholar 

  28. Kazak, Y., Barrett, C.W., Katz, G., Schapira, M.: Verifying deep-RL-driven systems. In: Proceedings of the 2019 Workshop on Network Meets AI & ML, NetAI@SIGCOMM 2019, pp. 83–89. ACM (2019)

    Google Scholar 

  29. Kemeny, J., Snell, J., Knapp, A.: Denumerable Markov Chains, 2nd edn. Springer (1976)

    Google Scholar 

  30. Kendall, A., et al.: Learning to drive in a day. In: ICRA, pp. 8248–8254. IEEE (2019)

    Google Scholar 

  31. Könighofer, B., Lorber, F., Jansen, N., Bloem, R.: Shield synthesis for reinforcement learning. In: Margaria, T., Steffen, B. (eds.) ISoLA 2020. LNCS, vol. 12476, pp. 290–306. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61362-4_16

    Chapter  Google Scholar 

  32. Kwiatkowska, M., Norman, G., Parker, D.: PRISM 4.0: verification of probabilistic real-time systems. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 585–591. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22110-1_47

    Chapter  Google Scholar 

  33. Lahijania, M., Andersson, S.B., Belta, C.: Formal verification and synthesis for discrete-time stochastic systems. IEEE Trans. Autom. Control 60(8), 2031–2045 (2015)

    Article  MathSciNet  Google Scholar 

  34. Langford, J., Zhang, T.: The epoch-greedy algorithm for contextual multi-armed bandits. Adv. Neural. Inf. Process. Syst. 20(1), 96–1 (2007)

    Google Scholar 

  35. Liang, E., et al.: RLlib: abstractions for distributed reinforcement learning. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 3053–3062. PMLR, 10–15 July 2018

    Google Scholar 

  36. Lun, Y.Z., Wheatley, J., D’Innocenzo, A., Abate, A.: Approximate abstractions of Markov chains with interval decision processes. In: Proceedings of 6th IFAC Conference on Analysis and Design of Hybrid Systems (2018)

    Google Scholar 

  37. Ma, H., Guan, Y., Li, S.E., Zhang, X., Zheng, S., Chen, J.: Feasible Actor-Critic: Constrained Reinforcement Learning for Ensuring Statewise Safety (2021)

    Google Scholar 

  38. Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of 33rd International Conference on Machine Learning, vol. 48, pp. 1928–1937. PMLR (2016)

    Google Scholar 

  39. Osborne, M.J., et al.: An Introduction to Game Theory, vol. 3. Oxford University Press, New York (2004)

    Google Scholar 

  40. Papoudakis, G., Christianos, F., Albrecht, S.V.: Agent modelling under partial observability for deep reinforcement learning. In: Proceedings of the Neural Information Processing Systems (NeurIPS) (2021)

    Google Scholar 

  41. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  42. Sankaranarayanan, S., Sipma, H.B., Manna, Z.: Scalable analysis of linear systems using mathematical programming. In: Cousot, R. (ed.) VMCAI 2005. LNCS, vol. 3385, pp. 25–41. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-30579-8_2

    Chapter  Google Scholar 

  43. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv:1707.06347 (2017)

  44. Smith, R.L.: Efficient Monte Carlo procedures for generating points uniformly distributed over bounded regions. Oper. Res. 32(6), 1296–1308 (1984)

    Article  MathSciNet  Google Scholar 

  45. Srinivasan, K., Eysenbach, B., Ha, S., Tan, J., Finn, C.: Learning to be Safe: Deep RL with a Safety Critic (2020)

    Google Scholar 

  46. Tjeng, V., Xiao, K., Tedrake, R.: Evaluating Robustness of Neural Networks with Mixed Integer Programming (2017)

    Google Scholar 

  47. Vamplew, P., Dazeley, R., Barker, E., Kelarev, A.: Constructing stochastic mixture policies for episodic multiobjective reinforcement learning tasks. In: Nicholson, A., Li, X. (eds.) AI 2009. LNCS (LNAI), vol. 5866, pp. 340–349. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-10439-8_35

    Chapter  Google Scholar 

  48. Wolff, E., Topcu, U., Murray, R.: Robust control of uncertain Markov decision processes with temporal logic specifications. In: Proceedings of 51th IEEE Conference on Decision and Control (CDC 2012), pp. 3372–3379 (2012)

    Google Scholar 

  49. Yu, C., Liu, J., Nemati, S., Yin, G.: Reinforcement learning in healthcare: a survey. ACM Comput. Surv. 55(1), 1–36 (2021)

    Article  Google Scholar 

  50. Networkx - network analysis in python. https://networkx.github.io/. Accessed 07 May 2020

  51. Pytorch. https://pytorch.org/. Accessed 07 May 2020

  52. Zhu, H., Magill, S., Xiong, Z., Jagannathan, S.: An inductive synthesis framework for verifiable reinforcement learning. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pp. 686–701. Association for Computing Machinery, June 2019

    Google Scholar 

Download references

Acknowledgements

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 834115, FUN2MODEL).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David Parker .

Editor information

Editors and Affiliations

Appendix: Proof of Theorem 1

Appendix: Proof of Theorem 1

We provide here a proof of Theorem 1, from Sect. 3, which states that:

Given a state \(s\in S\) of an RL execution model DTMP, and abstract state \(\hat{s}\in \hat{S}\) of the corresponding controller abstraction IMDP for which \(s\in \hat{s}\), we have:

$$ { Pr _{s}^{}}(\Diamond ^{\leqslant k} fail ) \ \leqslant \ { Pr _{\hat{s}}^{\max \max }}(\Diamond ^{\leqslant k} fail ) $$

By the definition of \({ Pr _{\hat{s}}^{\max \max }}(\cdot )\), it suffices to show that there is some policy \(\sigma \) and some environment policy \(\tau \) in the IMDP such that:

$$\begin{aligned} { Pr _{s}^{}}(\Diamond ^{\leqslant k} fail ) \ \leqslant \ { Pr _{\hat{s}}^{\sigma ,\tau }}(\Diamond ^{\leqslant k} fail ) \end{aligned}$$
(2)

Recall that, in the construction of the IMDP (see Definition 7), an abstract state \(\hat{s}\) is associated with a partition of subsets \(\hat{s}_j\) of \(\hat{s}\), each of which is used to define the j-labelled choice in state \(\hat{s}\). Let \(\sigma \) be the policy that picks in each state s (regardless of history) the unique index \(j_s\) such that \(s\in \hat{s}_{j_s}\). Then, let \(\tau \) be the environment policy that selects the upper bound of the interval for every transition probability. We use function \(\hat{{\mathbf {P}}}_{\tau }\) to denote the chosen probabilities, i.e., we have \(\hat{{\mathbf {P}}}_{\tau }(\hat{s},j_s,\hat{s}') = \hat{{\mathbf {P}}}_{U}(\hat{s},j_s,\hat{s}')\) for any \(\hat{s},j_s,\hat{s}'\).

The probabilities \({ Pr _{\hat{s}}^{\sigma ,\tau }}(\Diamond ^{\leqslant k} fail )\) for these policies, starting in \(\hat{s}\), are defined similarly to those for discrete-time Markov processes (see Sect. 2):

$$\begin{aligned} { Pr _{\hat{s}}^{\sigma ,\tau }}(\Diamond ^{\leqslant k} fail )= \left\{ \begin{array}{cl} 1 &{} \text {if } \hat{s}\models fail \\ 0 &{} \text {if } \hat{s}\not \models fail \wedge k{\,=\,}0 \\ \sum \limits _{\hat{s}'\in \text {supp}(\hat{{\mathbf {P}}}(\hat{s},j_s,\cdot ))}\hat{{\mathbf {P}}}(\hat{s},j_s,\hat{s}'){\cdot }{ Pr _{\hat{s}'}^{\sigma ,\tau }}(\Diamond ^{\leqslant k-1} fail ) &{} \text {otherwise.} \end{array}\right. \end{aligned}$$

Since this is defined recursively, we prove (2) by induction over k. For the case \(k=0\), the definitions of \({ Pr _{s}^{}}(\Diamond ^{\leqslant 0} fail )\) and \({ Pr _{\hat{s}}^{}}(\Diamond ^{\leqslant 0} fail )\) are equivalent: they equal 1 if \(s\models fail \) (or \(\hat{s}\models { fail }\)) and 0 otherwise. From Definition 7, \(s\models { fail }\) implies \(\hat{s}\models { fail }\). Therefore, \({ Pr _{s}^{}}(\Diamond ^{\leqslant 0} fail ) \ \leqslant \ { Pr _{\hat{s}}^{\sigma ,\tau }}(\Diamond ^{\leqslant 0} fail )\).

Next, for the inductive step, we will assume, as the inductive hypothesis, that \({ Pr _{s'}^{}}(\Diamond ^{\leqslant k-1} fail ) \ \leqslant \ { Pr _{\hat{s}'}^{\sigma ,\tau }}(\Diamond ^{\leqslant k-1} fail )\) for \(s'\in S\) and \(\hat{s}'\in \hat{S}\) with \(s'\in \hat{s}'\). If \(\hat{s}\models { fail }\) then \({ Pr _{\hat{s}}^{\sigma ,\tau }}(\Diamond ^{\leqslant k} fail )=1 \ \geqslant \ { Pr _{s}^{}}(\Diamond ^{\leqslant k} fail )\). Otherwise we have:

$$ {{\begin{array}{rcll} &{}&{} { Pr _{\hat{s}}^{\sigma ,\tau }}(\Diamond ^{\leqslant k} fail ) \\ &{} = &{} \sum \nolimits _{\hat{s}'\in \text {supp}(\hat{{\mathbf {P}}}_{\tau }(\hat{s},j_s,\cdot ))}\hat{{\mathbf {P}}}_{\tau }(\hat{s},j_s,\hat{s}')\cdot { Pr _{\hat{s}'}^{}}(\Diamond ^{\leqslant k-1} fail ) \ &{} \text {by defn. of } \sigma \text { and }{ Pr _{\hat{s}}^{\sigma ,\tau }}(\Diamond ^{\leqslant k} fail ) \\ &{} = &{} \sum \nolimits _{\hat{s}'\in \text {supp}(\hat{{\mathbf {P}}}_{U}(\hat{s},j_s,\cdot ))}\hat{{\mathbf {P}}}_{U}(\hat{s},j_s,\hat{s}')\cdot { Pr _{\hat{s}'}^{}}(\Diamond ^{\leqslant k-1} fail ) \ &{} \text {by defn. of } \tau \\ &{} = &{} \sum \nolimits _{a\in {A}}\pi _U(\hat{s},a)\cdot { Pr _{\hat{E}(\hat{s}_j,a)}^{}}(\Diamond ^{\leqslant k-1} fail ) \ &{} \text {by defn. of }\hat{{\mathbf {P}}}_{U}(\hat{s},j,\hat{s}') \\ &{} \geqslant &{} \sum \nolimits _{a\in {A}}\pi (s,a)\cdot { Pr _{\hat{E}(\hat{s}_j,a)}^{}}(\Diamond ^{\leqslant k-1} fail ) \ &{} \text {since } s\in \hat{s} \text { and by Defn.6 } \\ &{} \geqslant &{} \sum \nolimits _{a\in {A}} \pi (s,a)\cdot { Pr _{E(s,a)}^{}}(\Diamond ^{\leqslant k-1} fail ) &{} \text {by induction and since, by} \\ &{}&{}&{} \text {Defn. 5}, E(s,w)\in \hat{E}(\hat{s}_j,w) \\ &{} = &{} \sum \nolimits _{s'\in \text {supp}({\mathbf {P}}(s,\cdot ))}{\mathbf {P}}(s,s')\cdot { Pr _{s'}^{}}(\Diamond ^{\leqslant k-1} fail ) &{} \text {by defn. of } {\mathbf {P}}(s,s') \\ &{} = &{} { Pr _{s}^{}}(\Diamond ^{\leqslant k} fail ) &{} \text {by defn. of }{ Pr _{s}^{}}(\Diamond ^{\leqslant k} fail ) \\ \end{array}}} $$

which completes the proof.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bacci, E., Parker, D. (2022). Verified Probabilistic Policies for Deep Reinforcement Learning. In: Deshmukh, J.V., Havelund, K., Perez, I. (eds) NASA Formal Methods. NFM 2022. Lecture Notes in Computer Science, vol 13260. Springer, Cham. https://doi.org/10.1007/978-3-031-06773-0_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-06773-0_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-06772-3

  • Online ISBN: 978-3-031-06773-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics