Abstract
Deep reinforcement learning is an increasingly popular technique for synthesising policies to control an agent’s interaction with its environment. There is also growing interest in formally verifying that such policies are correct and execute safely. Progress has been made in this area by building on existing work for verification of deep neural networks and of continuous-state dynamical systems. In this paper, we tackle the problem of verifying probabilistic policies for deep reinforcement learning, which are used to, for example, tackle adversarial environments, break symmetries and manage trade-offs. We propose an abstraction approach, based on interval Markov decision processes, that yields probabilistic guarantees on a policy’s execution, and present techniques to build and solve these models using abstract interpretation, mixed-integer linear programming, entropy-based refinement and probabilistic model checking. We implement our approach and illustrate its effectiveness on a selection of reinforcement learning benchmarks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: Proceedings of 32nd AAAI Conference on Artificial Intelligence (AAAI 2018), pp. 2669–2678 (2018)
Bacci, E.: Formal Verification of Deep Reinforcement Learning Agents. Ph.D. thesis, School of Computer Science, University of Birmingham (2022)
Bacci, E., Giacobbe, M., Parker, D.: Verifying reinforcement learning up to infinity. In: Proceedings 30th International Joint Conference on Artificial Intelligence (IJCAI 2021), pp. 2154–2160 (2021)
Bacci, E., Parker, D.: Probabilistic guarantees for safe deep reinforcement learning. In: Bertrand, N., Jansen, N. (eds.) FORMATS 2020. LNCS, vol. 12288, pp. 231–248. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57628-8_14
Bastani, O.: Safe reinforcement learning with nonlinear dynamics via model predictive shielding. In: Proceedings of the American Control Conference, pp. 3488–3494 (2021)
Bastani, O., Pu, Y., Solar-Lezama, A.: Verifiable reinforcement learning via policy extraction. In: Proceedings of 2018 Annual Conference on Neural Information Processing Systems (NeurIPS 2018), pp. 2499–2509 (2018)
Bogomolov, S., Frehse, G., Giacobbe, M., Henzinger, T.A.: Counterexample-guided refinement of template polyhedra. In: TACAS (1), pp. 589–606 (2017)
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: OpenAI Gym, June 2016
Bunel, R., Turkaslan, I., Torr, P., Kohli, P., Kumar, P.: A unified view of piecewise linear neural network verification. In: Proceedings of 32nd International Conference on Neural Information Processing Systems (NIPS 2018), pp. 4795–4804 (2018)
Carr, S., Jansen, N., Topcu, U.: Task-aware verifiable RNN-based policies for partially observable Markov decision processes. J. Artif. Intell. Res. 72, 819–847 (2021)
Cauchi, N., Laurenti, L., Lahijanian, M., Abate, A., Kwiatkowska, M., Cardelli, L.: Efficiency through uncertainty: scalable formal synthesis for stochastic hybrid systems. In: 22nd ACM International Conference on Hybrid Systems: Computation and Control (2019)
Cheng, C.-H., Nührenberg, G., Ruess, H.: Maximum resilience of artificial neural networks. In: D’Souza, D., Narayan Kumar, K. (eds.) ATVA 2017. LNCS, vol. 10482, pp. 251–268. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68167-2_18
Cheng, R., Orosz, G., Murray, R.M., Burdick, J.W.: End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In: AAAI, pp. 3387–3395. AAAI Press (2019)
Delgrange, F., Ann Now e, G.A.P.: Distillation of RL policies with formal guarantees via variational abstraction of Markov decision processes. In: Proceedings of 36th AAAI Conference on Artificial Intelligence (AAAI 2022) (2022)
Fecher, H., Leucker, M., Wolf, V.: Don’t Know in probabilistic systems. In: Valmari, A. (ed.) SPIN 2006. LNCS, vol. 3925, pp. 71–88. Springer, Heidelberg (2006). https://doi.org/10.1007/11691617_5
Frehse, G., Giacobbe, M., Henzinger, T.A.: Space-time interpolants. In: Chockler, H., Weissenbacher, G. (eds.) CAV 2018. LNCS, vol. 10981, pp. 468–486. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-96145-3_25
Fulton, N., Platzer, A.: Safe reinforcement learning via formal methods: toward safe control through proof and learning. In: AAAI, pp. 6485–6492. AAAI Press (2018)
García, J., Fernández, F.: Probabilistic policy reuse for safe reinforcement learning. ACM Trans. Autonomous Adaptive Syst. 13(3), 1–24 (2018)
Gu, S., Holly, E., Lillicrap, T.P., Levine, S.: Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In: Proceedings of 2017 IEEE International Conference on Robotics and Automation (ICRA 2017), pp. 3389–3396 (2017)
Gurobi Optimization, LLC: Gurobi Optimizer Reference Manual (2021)
Hasanbeig, M., Abate, A., Kroening, D.: Logically-constrained neural fitted q-iteration. In: AAMAS, pp. 2012–2014. IFAAMAS (2019)
Hasanbeig, M., Abate, A., Kroening, D.: Cautious reinforcement learning with logical constraints. In: AAMAS, pp. 483–491. International Foundation for Autonomous Agents and Multiagent Systems (2020)
Hunt, N., Fulton, N., Magliacane, S., Hoang, T.N., Das, S., Solar-Lezama, A.: Verifiably safe exploration for end-to-end reinforcement learning. In: Proceedings of 24th International Conference on Hybrid Systems: Computation and Control (HSCC 2021) (2021)
Jaeger, M., Jensen, P.G., Guldstrand Larsen, K., Legay, A., Sedwards, S., Taankvist, J.H.: Teaching stratego to play ball: optimal synthesis for continuous space MDPs. In: Chen, Y.-F., Cheng, C.-H., Esparza, J. (eds.) ATVA 2019. LNCS, vol. 11781, pp. 81–97. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-31784-3_5
Jansen, N., Könighofer, B., Junges, S., Serban, A., Bloem, R.: Safe reinforcement learning using probabilistic shields. In: Proceedings of 31st International Conference on Concurrency Theory (CONCUR 2020), vol. 171, pp. 31–316 (2020)
Jin, P., Zhang, M., Li, J., Han, L., Wen, X.: Learning on Abstract Domains: A New Approach for Verifiable Guarantee in Reinforcement Learning, June 2021
Kattenbelt, M., Kwiatkowska, M., Norman, G., Parker, D.: A game-based abstraction-refinement framework for Markov decision processes. Formal Methods Syst. Des. 36(3), 246–280 (2010)
Kazak, Y., Barrett, C.W., Katz, G., Schapira, M.: Verifying deep-RL-driven systems. In: Proceedings of the 2019 Workshop on Network Meets AI & ML, NetAI@SIGCOMM 2019, pp. 83–89. ACM (2019)
Kemeny, J., Snell, J., Knapp, A.: Denumerable Markov Chains, 2nd edn. Springer (1976)
Kendall, A., et al.: Learning to drive in a day. In: ICRA, pp. 8248–8254. IEEE (2019)
Könighofer, B., Lorber, F., Jansen, N., Bloem, R.: Shield synthesis for reinforcement learning. In: Margaria, T., Steffen, B. (eds.) ISoLA 2020. LNCS, vol. 12476, pp. 290–306. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61362-4_16
Kwiatkowska, M., Norman, G., Parker, D.: PRISM 4.0: verification of probabilistic real-time systems. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 585–591. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22110-1_47
Lahijania, M., Andersson, S.B., Belta, C.: Formal verification and synthesis for discrete-time stochastic systems. IEEE Trans. Autom. Control 60(8), 2031–2045 (2015)
Langford, J., Zhang, T.: The epoch-greedy algorithm for contextual multi-armed bandits. Adv. Neural. Inf. Process. Syst. 20(1), 96–1 (2007)
Liang, E., et al.: RLlib: abstractions for distributed reinforcement learning. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 3053–3062. PMLR, 10–15 July 2018
Lun, Y.Z., Wheatley, J., D’Innocenzo, A., Abate, A.: Approximate abstractions of Markov chains with interval decision processes. In: Proceedings of 6th IFAC Conference on Analysis and Design of Hybrid Systems (2018)
Ma, H., Guan, Y., Li, S.E., Zhang, X., Zheng, S., Chen, J.: Feasible Actor-Critic: Constrained Reinforcement Learning for Ensuring Statewise Safety (2021)
Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of 33rd International Conference on Machine Learning, vol. 48, pp. 1928–1937. PMLR (2016)
Osborne, M.J., et al.: An Introduction to Game Theory, vol. 3. Oxford University Press, New York (2004)
Papoudakis, G., Christianos, F., Albrecht, S.V.: Agent modelling under partial observability for deep reinforcement learning. In: Proceedings of the Neural Information Processing Systems (NeurIPS) (2021)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Sankaranarayanan, S., Sipma, H.B., Manna, Z.: Scalable analysis of linear systems using mathematical programming. In: Cousot, R. (ed.) VMCAI 2005. LNCS, vol. 3385, pp. 25–41. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-30579-8_2
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv:1707.06347 (2017)
Smith, R.L.: Efficient Monte Carlo procedures for generating points uniformly distributed over bounded regions. Oper. Res. 32(6), 1296–1308 (1984)
Srinivasan, K., Eysenbach, B., Ha, S., Tan, J., Finn, C.: Learning to be Safe: Deep RL with a Safety Critic (2020)
Tjeng, V., Xiao, K., Tedrake, R.: Evaluating Robustness of Neural Networks with Mixed Integer Programming (2017)
Vamplew, P., Dazeley, R., Barker, E., Kelarev, A.: Constructing stochastic mixture policies for episodic multiobjective reinforcement learning tasks. In: Nicholson, A., Li, X. (eds.) AI 2009. LNCS (LNAI), vol. 5866, pp. 340–349. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-10439-8_35
Wolff, E., Topcu, U., Murray, R.: Robust control of uncertain Markov decision processes with temporal logic specifications. In: Proceedings of 51th IEEE Conference on Decision and Control (CDC 2012), pp. 3372–3379 (2012)
Yu, C., Liu, J., Nemati, S., Yin, G.: Reinforcement learning in healthcare: a survey. ACM Comput. Surv. 55(1), 1–36 (2021)
Networkx - network analysis in python. https://networkx.github.io/. Accessed 07 May 2020
Pytorch. https://pytorch.org/. Accessed 07 May 2020
Zhu, H., Magill, S., Xiong, Z., Jagannathan, S.: An inductive synthesis framework for verifiable reinforcement learning. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pp. 686–701. Association for Computing Machinery, June 2019
Acknowledgements
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 834115, FUN2MODEL).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix: Proof of Theorem 1
Appendix: Proof of Theorem 1
We provide here a proof of Theorem 1, from Sect. 3, which states that:
Given a state \(s\in S\) of an RL execution model DTMP, and abstract state \(\hat{s}\in \hat{S}\) of the corresponding controller abstraction IMDP for which \(s\in \hat{s}\), we have:
By the definition of \({ Pr _{\hat{s}}^{\max \max }}(\cdot )\), it suffices to show that there is some policy \(\sigma \) and some environment policy \(\tau \) in the IMDP such that:
Recall that, in the construction of the IMDP (see Definition 7), an abstract state \(\hat{s}\) is associated with a partition of subsets \(\hat{s}_j\) of \(\hat{s}\), each of which is used to define the j-labelled choice in state \(\hat{s}\). Let \(\sigma \) be the policy that picks in each state s (regardless of history) the unique index \(j_s\) such that \(s\in \hat{s}_{j_s}\). Then, let \(\tau \) be the environment policy that selects the upper bound of the interval for every transition probability. We use function \(\hat{{\mathbf {P}}}_{\tau }\) to denote the chosen probabilities, i.e., we have \(\hat{{\mathbf {P}}}_{\tau }(\hat{s},j_s,\hat{s}') = \hat{{\mathbf {P}}}_{U}(\hat{s},j_s,\hat{s}')\) for any \(\hat{s},j_s,\hat{s}'\).
The probabilities \({ Pr _{\hat{s}}^{\sigma ,\tau }}(\Diamond ^{\leqslant k} fail )\) for these policies, starting in \(\hat{s}\), are defined similarly to those for discrete-time Markov processes (see Sect. 2):
Since this is defined recursively, we prove (2) by induction over k. For the case \(k=0\), the definitions of \({ Pr _{s}^{}}(\Diamond ^{\leqslant 0} fail )\) and \({ Pr _{\hat{s}}^{}}(\Diamond ^{\leqslant 0} fail )\) are equivalent: they equal 1 if \(s\models fail \) (or \(\hat{s}\models { fail }\)) and 0 otherwise. From Definition 7, \(s\models { fail }\) implies \(\hat{s}\models { fail }\). Therefore, \({ Pr _{s}^{}}(\Diamond ^{\leqslant 0} fail ) \ \leqslant \ { Pr _{\hat{s}}^{\sigma ,\tau }}(\Diamond ^{\leqslant 0} fail )\).
Next, for the inductive step, we will assume, as the inductive hypothesis, that \({ Pr _{s'}^{}}(\Diamond ^{\leqslant k-1} fail ) \ \leqslant \ { Pr _{\hat{s}'}^{\sigma ,\tau }}(\Diamond ^{\leqslant k-1} fail )\) for \(s'\in S\) and \(\hat{s}'\in \hat{S}\) with \(s'\in \hat{s}'\). If \(\hat{s}\models { fail }\) then \({ Pr _{\hat{s}}^{\sigma ,\tau }}(\Diamond ^{\leqslant k} fail )=1 \ \geqslant \ { Pr _{s}^{}}(\Diamond ^{\leqslant k} fail )\). Otherwise we have:
which completes the proof.
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Bacci, E., Parker, D. (2022). Verified Probabilistic Policies for Deep Reinforcement Learning. In: Deshmukh, J.V., Havelund, K., Perez, I. (eds) NASA Formal Methods. NFM 2022. Lecture Notes in Computer Science, vol 13260. Springer, Cham. https://doi.org/10.1007/978-3-031-06773-0_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-06773-0_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06772-3
Online ISBN: 978-3-031-06773-0
eBook Packages: Computer ScienceComputer Science (R0)