Verified Probabilistic Policies for Deep Reinforcement Learning

Bacci, Edoardo; Parker, David

doi:10.1007/978-3-031-06773-0_10

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13260))

Included in the following conference series:

NASA Formal Methods Symposium

2380 Accesses
5 Citations

Abstract

Deep reinforcement learning is an increasingly popular technique for synthesising policies to control an agent’s interaction with its environment. There is also growing interest in formally verifying that such policies are correct and execute safely. Progress has been made in this area by building on existing work for verification of deep neural networks and of continuous-state dynamical systems. In this paper, we tackle the problem of verifying probabilistic policies for deep reinforcement learning, which are used to, for example, tackle adversarial environments, break symmetries and manage trade-offs. We propose an abstraction approach, based on interval Markov decision processes, that yields probabilistic guarantees on a policy’s execution, and present techniques to build and solve these models using abstract interpretation, mixed-integer linear programming, entropy-based refinement and probabilistic model checking. We implement our approach and illustrate its effectiveness on a selection of reinforcement learning benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Deep Reinforcement Learning with Temporal Logics

Probabilistic Guarantees for Safe Deep Reinforcement Learning

Differential Safety Testing of Deep RL Agents Enabled by Automata Learning

References

Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: Proceedings of 32nd AAAI Conference on Artificial Intelligence (AAAI 2018), pp. 2669–2678 (2018)
Google Scholar
Bacci, E.: Formal Verification of Deep Reinforcement Learning Agents. Ph.D. thesis, School of Computer Science, University of Birmingham (2022)
Google Scholar
Bacci, E., Giacobbe, M., Parker, D.: Verifying reinforcement learning up to infinity. In: Proceedings 30th International Joint Conference on Artificial Intelligence (IJCAI 2021), pp. 2154–2160 (2021)
Google Scholar
Bacci, E., Parker, D.: Probabilistic guarantees for safe deep reinforcement learning. In: Bertrand, N., Jansen, N. (eds.) FORMATS 2020. LNCS, vol. 12288, pp. 231–248. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57628-8_14
Chapter Google Scholar
Bastani, O.: Safe reinforcement learning with nonlinear dynamics via model predictive shielding. In: Proceedings of the American Control Conference, pp. 3488–3494 (2021)
Google Scholar
Bastani, O., Pu, Y., Solar-Lezama, A.: Verifiable reinforcement learning via policy extraction. In: Proceedings of 2018 Annual Conference on Neural Information Processing Systems (NeurIPS 2018), pp. 2499–2509 (2018)
Google Scholar
Bogomolov, S., Frehse, G., Giacobbe, M., Henzinger, T.A.: Counterexample-guided refinement of template polyhedra. In: TACAS (1), pp. 589–606 (2017)
Google Scholar
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: OpenAI Gym, June 2016
Google Scholar
Bunel, R., Turkaslan, I., Torr, P., Kohli, P., Kumar, P.: A unified view of piecewise linear neural network verification. In: Proceedings of 32nd International Conference on Neural Information Processing Systems (NIPS 2018), pp. 4795–4804 (2018)
Google Scholar
Carr, S., Jansen, N., Topcu, U.: Task-aware verifiable RNN-based policies for partially observable Markov decision processes. J. Artif. Intell. Res. 72, 819–847 (2021)
Article MathSciNet Google Scholar
Cauchi, N., Laurenti, L., Lahijanian, M., Abate, A., Kwiatkowska, M., Cardelli, L.: Efficiency through uncertainty: scalable formal synthesis for stochastic hybrid systems. In: 22nd ACM International Conference on Hybrid Systems: Computation and Control (2019)
Google Scholar
Cheng, C.-H., Nührenberg, G., Ruess, H.: Maximum resilience of artificial neural networks. In: D’Souza, D., Narayan Kumar, K. (eds.) ATVA 2017. LNCS, vol. 10482, pp. 251–268. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68167-2_18
Chapter Google Scholar
Cheng, R., Orosz, G., Murray, R.M., Burdick, J.W.: End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In: AAAI, pp. 3387–3395. AAAI Press (2019)
Google Scholar
Delgrange, F., Ann Now e, G.A.P.: Distillation of RL policies with formal guarantees via variational abstraction of Markov decision processes. In: Proceedings of 36th AAAI Conference on Artificial Intelligence (AAAI 2022) (2022)
Google Scholar
Fecher, H., Leucker, M., Wolf, V.: Don’t Know in probabilistic systems. In: Valmari, A. (ed.) SPIN 2006. LNCS, vol. 3925, pp. 71–88. Springer, Heidelberg (2006). https://doi.org/10.1007/11691617_5
Chapter Google Scholar
Frehse, G., Giacobbe, M., Henzinger, T.A.: Space-time interpolants. In: Chockler, H., Weissenbacher, G. (eds.) CAV 2018. LNCS, vol. 10981, pp. 468–486. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-96145-3_25
Chapter Google Scholar
Fulton, N., Platzer, A.: Safe reinforcement learning via formal methods: toward safe control through proof and learning. In: AAAI, pp. 6485–6492. AAAI Press (2018)
Google Scholar
García, J., Fernández, F.: Probabilistic policy reuse for safe reinforcement learning. ACM Trans. Autonomous Adaptive Syst. 13(3), 1–24 (2018)
Article Google Scholar
Gu, S., Holly, E., Lillicrap, T.P., Levine, S.: Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In: Proceedings of 2017 IEEE International Conference on Robotics and Automation (ICRA 2017), pp. 3389–3396 (2017)
Google Scholar
Gurobi Optimization, LLC: Gurobi Optimizer Reference Manual (2021)
Google Scholar
Hasanbeig, M., Abate, A., Kroening, D.: Logically-constrained neural fitted q-iteration. In: AAMAS, pp. 2012–2014. IFAAMAS (2019)
Google Scholar
Hasanbeig, M., Abate, A., Kroening, D.: Cautious reinforcement learning with logical constraints. In: AAMAS, pp. 483–491. International Foundation for Autonomous Agents and Multiagent Systems (2020)
Google Scholar
Hunt, N., Fulton, N., Magliacane, S., Hoang, T.N., Das, S., Solar-Lezama, A.: Verifiably safe exploration for end-to-end reinforcement learning. In: Proceedings of 24th International Conference on Hybrid Systems: Computation and Control (HSCC 2021) (2021)
Google Scholar
Jaeger, M., Jensen, P.G., Guldstrand Larsen, K., Legay, A., Sedwards, S., Taankvist, J.H.: Teaching stratego to play ball: optimal synthesis for continuous space MDPs. In: Chen, Y.-F., Cheng, C.-H., Esparza, J. (eds.) ATVA 2019. LNCS, vol. 11781, pp. 81–97. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-31784-3_5
Chapter Google Scholar
Jansen, N., Könighofer, B., Junges, S., Serban, A., Bloem, R.: Safe reinforcement learning using probabilistic shields. In: Proceedings of 31st International Conference on Concurrency Theory (CONCUR 2020), vol. 171, pp. 31–316 (2020)
Google Scholar
Jin, P., Zhang, M., Li, J., Han, L., Wen, X.: Learning on Abstract Domains: A New Approach for Verifiable Guarantee in Reinforcement Learning, June 2021
Google Scholar
Kattenbelt, M., Kwiatkowska, M., Norman, G., Parker, D.: A game-based abstraction-refinement framework for Markov decision processes. Formal Methods Syst. Des. 36(3), 246–280 (2010)
Article Google Scholar
Kazak, Y., Barrett, C.W., Katz, G., Schapira, M.: Verifying deep-RL-driven systems. In: Proceedings of the 2019 Workshop on Network Meets AI & ML, NetAI@SIGCOMM 2019, pp. 83–89. ACM (2019)
Google Scholar
Kemeny, J., Snell, J., Knapp, A.: Denumerable Markov Chains, 2nd edn. Springer (1976)
Google Scholar
Kendall, A., et al.: Learning to drive in a day. In: ICRA, pp. 8248–8254. IEEE (2019)
Google Scholar
Könighofer, B., Lorber, F., Jansen, N., Bloem, R.: Shield synthesis for reinforcement learning. In: Margaria, T., Steffen, B. (eds.) ISoLA 2020. LNCS, vol. 12476, pp. 290–306. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61362-4_16
Chapter Google Scholar
Kwiatkowska, M., Norman, G., Parker, D.: PRISM 4.0: verification of probabilistic real-time systems. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 585–591. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22110-1_47
Chapter Google Scholar
Lahijania, M., Andersson, S.B., Belta, C.: Formal verification and synthesis for discrete-time stochastic systems. IEEE Trans. Autom. Control 60(8), 2031–2045 (2015)
Article MathSciNet Google Scholar
Langford, J., Zhang, T.: The epoch-greedy algorithm for contextual multi-armed bandits. Adv. Neural. Inf. Process. Syst. 20(1), 96–1 (2007)
Google Scholar
Liang, E., et al.: RLlib: abstractions for distributed reinforcement learning. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 3053–3062. PMLR, 10–15 July 2018
Google Scholar
Lun, Y.Z., Wheatley, J., D’Innocenzo, A., Abate, A.: Approximate abstractions of Markov chains with interval decision processes. In: Proceedings of 6th IFAC Conference on Analysis and Design of Hybrid Systems (2018)
Google Scholar
Ma, H., Guan, Y., Li, S.E., Zhang, X., Zheng, S., Chen, J.: Feasible Actor-Critic: Constrained Reinforcement Learning for Ensuring Statewise Safety (2021)
Google Scholar
Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of 33rd International Conference on Machine Learning, vol. 48, pp. 1928–1937. PMLR (2016)
Google Scholar
Osborne, M.J., et al.: An Introduction to Game Theory, vol. 3. Oxford University Press, New York (2004)
Google Scholar
Papoudakis, G., Christianos, F., Albrecht, S.V.: Agent modelling under partial observability for deep reinforcement learning. In: Proceedings of the Neural Information Processing Systems (NeurIPS) (2021)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Sankaranarayanan, S., Sipma, H.B., Manna, Z.: Scalable analysis of linear systems using mathematical programming. In: Cousot, R. (ed.) VMCAI 2005. LNCS, vol. 3385, pp. 25–41. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-30579-8_2
Chapter Google Scholar
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv:1707.06347 (2017)
Smith, R.L.: Efficient Monte Carlo procedures for generating points uniformly distributed over bounded regions. Oper. Res. 32(6), 1296–1308 (1984)
Article MathSciNet Google Scholar
Srinivasan, K., Eysenbach, B., Ha, S., Tan, J., Finn, C.: Learning to be Safe: Deep RL with a Safety Critic (2020)
Google Scholar
Tjeng, V., Xiao, K., Tedrake, R.: Evaluating Robustness of Neural Networks with Mixed Integer Programming (2017)
Google Scholar
Vamplew, P., Dazeley, R., Barker, E., Kelarev, A.: Constructing stochastic mixture policies for episodic multiobjective reinforcement learning tasks. In: Nicholson, A., Li, X. (eds.) AI 2009. LNCS (LNAI), vol. 5866, pp. 340–349. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-10439-8_35
Chapter Google Scholar
Wolff, E., Topcu, U., Murray, R.: Robust control of uncertain Markov decision processes with temporal logic specifications. In: Proceedings of 51th IEEE Conference on Decision and Control (CDC 2012), pp. 3372–3379 (2012)
Google Scholar
Yu, C., Liu, J., Nemati, S., Yin, G.: Reinforcement learning in healthcare: a survey. ACM Comput. Surv. 55(1), 1–36 (2021)
Article Google Scholar
Networkx - network analysis in python. https://networkx.github.io/. Accessed 07 May 2020
Pytorch. https://pytorch.org/. Accessed 07 May 2020
Zhu, H., Magill, S., Xiong, Z., Jagannathan, S.: An inductive synthesis framework for verifiable reinforcement learning. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pp. 686–701. Association for Computing Machinery, June 2019
Google Scholar

Download references

Acknowledgements

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 834115, FUN2MODEL).

Author information

Authors and Affiliations

University of Birmingham, Birmingham, UK
Edoardo Bacci & David Parker

Authors

Edoardo Bacci
View author publications
You can also search for this author in PubMed Google Scholar
David Parker
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Parker .

Editor information

Editors and Affiliations

University of Southern California, Los Angeles, CA, USA
Jyotirmoy V. Deshmukh
Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, USA
Klaus Havelund
National Institute of Aerospace, Hampton, VA, USA
Ivan Perez

Appendix: Proof of Theorem 1

We provide here a proof of Theorem 1, from Sect. 3, which states that:

Given a state $s\in S$ of an RL execution model DTMP, and abstract state $\hat{s}\in \hat{S}$ of the corresponding controller abstraction IMDP for which $s\in \hat{s}$, we have:

$$ { Pr _{s}^{}}(\Diamond ^{\leqslant k} fail ) \ \leqslant \ { Pr _{\hat{s}}^{\max \max }}(\Diamond ^{\leqslant k} fail ) $$

By the definition of ${ Pr _{\hat{s}}^{\max \max }}(\cdot )$, it suffices to show that there is some policy $\sigma $ and some environment policy $\tau $ in the IMDP such that:

$$\begin{aligned} { Pr _{s}^{}}(\Diamond ^{\leqslant k} fail ) \ \leqslant \ { Pr _{\hat{s}}^{\sigma ,\tau }}(\Diamond ^{\leqslant k} fail ) \end{aligned}$$

(2)

Recall that, in the construction of the IMDP (see Definition 7), an abstract state $\hat{s}$ is associated with a partition of subsets $\hat{s}_j$ of $\hat{s}$, each of which is used to define the j-labelled choice in state $\hat{s}$. Let $\sigma $ be the policy that picks in each state s (regardless of history) the unique index $j_s$ such that $s\in \hat{s}_{j_s}$. Then, let $\tau $ be the environment policy that selects the upper bound of the interval for every transition probability. We use function $\hat{{\mathbf {P}}}_{\tau }$ to denote the chosen probabilities, i.e., we have $\hat{{\mathbf {P}}}_{\tau }(\hat{s},j_s,\hat{s}') = \hat{{\mathbf {P}}}_{U}(\hat{s},j_s,\hat{s}')$ for any $\hat{s},j_s,\hat{s}'$.

The probabilities ${ Pr _{\hat{s}}^{\sigma ,\tau }}(\Diamond ^{\leqslant k} fail )$ for these policies, starting in $\hat{s}$, are defined similarly to those for discrete-time Markov processes (see Sect. 2):

$$\begin{aligned} { Pr _{\hat{s}}^{\sigma ,\tau }}(\Diamond ^{\leqslant k} fail )= \left\{ \begin{array}{cl} 1 &{} \text {if } \hat{s}\models fail \\ 0 &{} \text {if } \hat{s}\not \models fail \wedge k{\,=\,}0 \\ \sum \limits _{\hat{s}'\in \text {supp}(\hat{{\mathbf {P}}}(\hat{s},j_s,\cdot ))}\hat{{\mathbf {P}}}(\hat{s},j_s,\hat{s}'){\cdot }{ Pr _{\hat{s}'}^{\sigma ,\tau }}(\Diamond ^{\leqslant k-1} fail ) &{} \text {otherwise.} \end{array}\right. \end{aligned}$$

Since this is defined recursively, we prove (2) by induction over k. For the case $k=0$, the definitions of ${ Pr _{s}^{}}(\Diamond ^{\leqslant 0} fail )$ and ${ Pr _{\hat{s}}^{}}(\Diamond ^{\leqslant 0} fail )$ are equivalent: they equal 1 if $s\models fail $ (or $\hat{s}\models { fail }$) and 0 otherwise. From Definition 7, $s\models { fail }$ implies $\hat{s}\models { fail }$. Therefore, ${ Pr _{s}^{}}(\Diamond ^{\leqslant 0} fail ) \ \leqslant \ { Pr _{\hat{s}}^{\sigma ,\tau }}(\Diamond ^{\leqslant 0} fail )$.

Next, for the inductive step, we will assume, as the inductive hypothesis, that ${ Pr _{s'}^{}}(\Diamond ^{\leqslant k-1} fail ) \ \leqslant \ { Pr _{\hat{s}'}^{\sigma ,\tau }}(\Diamond ^{\leqslant k-1} fail )$ for $s'\in S$ and $\hat{s}'\in \hat{S}$ with $s'\in \hat{s}'$. If $\hat{s}\models { fail }$ then ${ Pr _{\hat{s}}^{\sigma ,\tau }}(\Diamond ^{\leqslant k} fail )=1 \ \geqslant \ { Pr _{s}^{}}(\Diamond ^{\leqslant k} fail )$. Otherwise we have:

$$ {{\begin{array}{rcll} &{}&{} { Pr _{\hat{s}}^{\sigma ,\tau }}(\Diamond ^{\leqslant k} fail ) \\ &{} = &{} \sum \nolimits _{\hat{s}'\in \text {supp}(\hat{{\mathbf {P}}}_{\tau }(\hat{s},j_s,\cdot ))}\hat{{\mathbf {P}}}_{\tau }(\hat{s},j_s,\hat{s}')\cdot { Pr _{\hat{s}'}^{}}(\Diamond ^{\leqslant k-1} fail ) \ &{} \text {by defn. of } \sigma \text { and }{ Pr _{\hat{s}}^{\sigma ,\tau }}(\Diamond ^{\leqslant k} fail ) \\ &{} = &{} \sum \nolimits _{\hat{s}'\in \text {supp}(\hat{{\mathbf {P}}}_{U}(\hat{s},j_s,\cdot ))}\hat{{\mathbf {P}}}_{U}(\hat{s},j_s,\hat{s}')\cdot { Pr _{\hat{s}'}^{}}(\Diamond ^{\leqslant k-1} fail ) \ &{} \text {by defn. of } \tau \\ &{} = &{} \sum \nolimits _{a\in {A}}\pi _U(\hat{s},a)\cdot { Pr _{\hat{E}(\hat{s}_j,a)}^{}}(\Diamond ^{\leqslant k-1} fail ) \ &{} \text {by defn. of }\hat{{\mathbf {P}}}_{U}(\hat{s},j,\hat{s}') \\ &{} \geqslant &{} \sum \nolimits _{a\in {A}}\pi (s,a)\cdot { Pr _{\hat{E}(\hat{s}_j,a)}^{}}(\Diamond ^{\leqslant k-1} fail ) \ &{} \text {since } s\in \hat{s} \text { and by Defn.6 } \\ &{} \geqslant &{} \sum \nolimits _{a\in {A}} \pi (s,a)\cdot { Pr _{E(s,a)}^{}}(\Diamond ^{\leqslant k-1} fail ) &{} \text {by induction and since, by} \\ &{}&{}&{} \text {Defn. 5}, E(s,w)\in \hat{E}(\hat{s}_j,w) \\ &{} = &{} \sum \nolimits _{s'\in \text {supp}({\mathbf {P}}(s,\cdot ))}{\mathbf {P}}(s,s')\cdot { Pr _{s'}^{}}(\Diamond ^{\leqslant k-1} fail ) &{} \text {by defn. of } {\mathbf {P}}(s,s') \\ &{} = &{} { Pr _{s}^{}}(\Diamond ^{\leqslant k} fail ) &{} \text {by defn. of }{ Pr _{s}^{}}(\Diamond ^{\leqslant k} fail ) \\ \end{array}}} $$

which completes the proof.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bacci, E., Parker, D. (2022). Verified Probabilistic Policies for Deep Reinforcement Learning. In: Deshmukh, J.V., Havelund, K., Perez, I. (eds) NASA Formal Methods. NFM 2022. Lecture Notes in Computer Science, vol 13260. Springer, Cham. https://doi.org/10.1007/978-3-031-06773-0_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-06773-0_10
Published: 20 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06772-3
Online ISBN: 978-3-031-06773-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Verified Probabilistic Policies for Deep Reinforcement Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Deep Reinforcement Learning with Temporal Logics

Probabilistic Guarantees for Safe Deep Reinforcement Learning

Differential Safety Testing of Deep RL Agents Enabled by Automata Learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: Proof of Theorem 1

Appendix: Proof of Theorem 1

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us