Verification and repair of control policies for safe reinforcement learning

Pathak, Shashank; Pulina, Luca; Tacchella, Armando

doi:10.1007/s10489-017-0999-8

Verification and repair of control policies for safe reinforcement learning

Published: 05 August 2017

Volume 48, pages 886–908, (2018)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

1315 Accesses
14 Citations
3 Altmetric
Explore all metrics

Abstract

Reinforcement Learning is a well-known AI paradigm whereby control policies of autonomous agents can be synthesized in an incremental fashion with little or no knowledge about the properties of the environment. We are concerned with safety of agents whose policies are learned by reinforcement, i.e., we wish to bound the risk that, once learning is over, an agent damages either the environment or itself. We propose a general-purpose automated methodology to verify, i.e., establish risk bounds, and repair policies, i.e., fix policies to comply with stated risk bounds. Our approach is based on probabilistic model checking algorithms and tools, which provide theoretical and practical means to verify risk bounds and repair policies. Considering a taxonomy of potential repair approaches tested on an artificially-generated parametric domain, we show that our methodology is also more effective than comparable ones.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

International Electrotechnical Vocabulary, ref. 351-57-05, accessible on line at www.electropedia.org.
Notice that, given some state s ∈ S and some action a ∈ A, if π(s,a) = 0 then a∉A _s, whereas if π(s,a) = 1 then π is deterministic in state s.
In the literature, the acronym MDP is often found to refer also to Markov decision processes. In this paper, the acronym MDP always denotes the decision problem, and the associated decision process is always mentioned explicitly.
In all the case studies that we consider in this paper, the initial distribution of states in the domain \(\mathcal {D}\) is known in advance. If this were not the case, P ₁(s) could be estimated by Learn simply by logging at the beginning of each episode the state sensed by the agent, and then computing the sample probability of each state based on that log.
The use of P ^′ instead of P is a matter of notational convenience. All the results presented in this section can be recast in terms of P.
Values for σ _{i
n
i
t} and σ _{f
i
n
a
l} in Table 1 are computed by the probabilistic model checker mrmc [26].
The radius ρ is not needed because we keep it fixed throughout learning and simulation. In a pure defense play, this choice does not hamper the robot’s ability to defend the goal area.
Simulation and learning are performed on an Intel Core i5-480M quad core at 2.67 GHz with 4GB RAM, equipped with Ubuntu 12.04 LTS 64 bit.
Verification and repair are performed on a Intel Core i3-2330M quad core at 2.20 GHz with similar RAM and OS. Verification of policies is carried out with state-of-the-art probabilistic model checkers, namely comics [1] (version 1.0), mrmc [26] (version 1.4.1), and prism [24] (version 4.0.3). All tools are run in their default configuration with the exception of comics, for which the option --concrete is selected instead of the default --abstract.
mrmc does not implement counterexample generation, and in prism this is still a beta-stage feature.

References

Abrahám E, Jansen N, Wimmer R, Katoen J, Becker B (2010) Dtmc model checking by scc reduction. In: 2010 7th international conference on the quantitative evaluation of systems (QEST). IEEE, pp 37–46
Aziz A, Singhal V, Balarin F, Brayton RK, Sangiovanni-Vincentell AL (1995) It usually works: the temporal logic of stochastic systems. In: Computer aided verification. Springer, pp 155–165
Avriel M (2003) Nonlinear programming: analysis and methods. Courier Corporation
Bentivegna DC, Atkeson CG, Ude A, Cheng G (2004) Learning to act from observation and practice. Int J Human Robot 1(4)
Barto A, Crites RH (1996) Improving elevator performance using reinforcement learning. Adv Neural Inf Process Syst 8:1017–1023
Google Scholar
Boutilier C, Dean T, Hanks S (1999) Decision-theoretic planning: structural assumptions and computational leverage. J Artif Intell Res 11(1):94
MathSciNet MATH Google Scholar
Buccafurri F, Eiter T, Gottlob G, Leone N et al (1999) Enhancing model checking in verification by ai techniques. Artif Intell 112(1):57–104
Article MathSciNet MATH Google Scholar
Bartocci E, Grosu R, Katsaros P, Ramakrishnan C, Smolka S (2011) Model repair for probabilistic systems. Tools Algor Construct Anal Syst 326–340
Ben-Israel A, Greville TNE (2003) Generalized inverses: theory and applications, vol 15. Springer Science & Business Media
Barrett L, Narayanan S (2008) Learning all optimal policies with multiple criteria. In: Proceedings of the 25th international conference on machine learning. ACM, pp 41–47
Biegler LT, Zavala VM (2009) Large-scale nonlinear programming using ipopt: an integrating framework for enterprise-wide dynamic optimization. Comput Chem Eng 33(3):575–582
Article Google Scholar
Cicala G, Khalili A, Metta G, Natale L, Pathak S, Pulina L, Tacchella A (2014) Engineering approaches and methods to verify software in autonomous systems. In: 13th international conference on intelligent autonomous systems (IAS-13)
Courcoubetis C, Yannakakis M (1995) The complexity of probabilistic verification. J ACM (JACM) 42(4):857–907
Article MathSciNet MATH Google Scholar
Daws C (2005) Symbolic and parametric model checking of discrete-time Markov chains. In: Theoretical aspects of computing-ICTAC 2004. Springer, pp 280–294
Filieri A, Ghezzi C, Tamburrelli G (2011) Run-time efficient probabilistic model checking. In: Proceedings of the 33rd international conference on software engineering. ACM, pp 341–350
Garcıa J, Fernández F (2015) A comprehensive survey on safe reinforcement learning. J Mach Learn Res 16(1):1437–1480
MathSciNet MATH Google Scholar
Ghallab M, Nau D, Traverso P (2004) Automated planning: theory & practice. Elsevier
Gordon DF (2000) Asimovian adaptive agents. J Artif Intell Res 13(1):95–153
MathSciNet MATH Google Scholar
Grinstead CM, Snell JL (1988) Introduction to probability. American Mathematical Soc. Chapter 11
Gillula JH, Tomlin CJ (2012) Guaranteed safe online learning via reachability: tracking a ground target using a quadrotor. In: ICRA, pp 2723–2730
Geibel P, Wysotzki F (2005) Risk-sensitive reinforcement learning applied to control under constraints. J Artif Intell Res 24:81–108
MATH Google Scholar
Hahn EM, Hermanns H, Wachter B, Lijun Z (2010) PARAM: a model checker for parametric Markov models. In: Computer aided verification. Springer, pp 660–664
Jansen N, Ábrahám E, Volk M, Wimmer R, Katoen J-P, Becker B (2012) The comics tool–computing minimal counterexamples for dtmcs. In: Automated technology for verification and analysis. Springer, pp 349–353
Kwiatkowska M, Norman G, Parker D (2002) Prism: probabilistic symbolic model checker. In: Computer performance evaluation: modelling techniques and tools, pp 113–140
Kwiatkowska M, Norman G, Parker D (2007) Stochastic model checking. Formal Methods Perform Eval 220–270
Katoen JP, Zapreev IS, Hahn EM, Hermanns H, Jansen DN (2011) The ins and outs of the probabilistic model checker mrmc. Perform Eval 68(2):90–104
Article Google Scholar
Leofante F, Vuotto S, Ȧbrahȧm E, Tacchella A, Jansen N (2016) Combining static and runtime methods to achieve safe standing-up for humanoid robots. In: Leveraging applications of formal methods, verification and validation: foundational techniques - 7th international symposium, ISoLA 2016, Imperial, Corfu, Greece, October 10-14, 2016, Proceedings, Part I, pp 496–514
Morimoto J, Doya K (1998) Reinforcement learning of dynamic motor sequence Learning to stand up. In: Proceedings of the 1998 IEEE/RSJ international conference on intelligent robots and systems, vol 3, pp 1721–1726
Morimoto J, Doya K (2001) Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. Robot Auton Syst 36(1):37–51
Article MATH Google Scholar
Metta G, Natale L, Nori F, Sandini G, Vernon D, Fadiga L, von Hofsten C, Rosander K, Lopes M, Santos-Victor J et al (2010) The iCub humanoid robot: an open-systems platform for research in cognitive development. Neural networks: the official journal of the international neural network society
Metta G, Natale L, Pathak S, Pulina L, Tacchella A (2010) Safe and effective learning: a case study. In: 2010 IEEE international conference on robotics and automation, pp 4809–4814
Metta G, Pathak S, Pulina L, Tacchella A (2013) Ensuring safety of policies learned by reinforcement: reaching objects in the presence of obstacles with the iCub. In: IEEE/RSJ international conference on intelligent robots and systems, pp 170–175
Ng A, Coates A, Diel M, Ganapathi V, Schulte J, Tse B, Berger E, Liang E (2006) Autonomous inverted helicopter flight via reinforcement learning. Exper Robot IX 363–372
Natarajan S, Tadepalli P (2005) Dynamic preferences in multi-criteria reinforcement learning. In: Proceedings of the 22nd international conference on machine learning. ACM, pp 601–608
Pathak S, Abraham E, Jansen N, Tacchella A, Katoen JP (2015) A greedy approach for the efficient repair of stochastic models. In: Proc. NFM’15, volume 9058 of LNCS, pp 295–309
Perkins TJ, Barto AG (2003) Lyapunov design for safe reinforcement learning. J Mach Learn Res 3:803–832
MathSciNet MATH Google Scholar
Pathak S, Metta G, Tacchella A (2014) Is verification a requisite for safe adaptive robots? In: 2014 IEEE international conference on systems, man and cybernetics
Pathak S, Pulina L, Tacchella A (2015) Probabilistic model checking tools for verification of robot control policies. AI Commun. To appear
Puterman ML (2009) Markov decision processes: discrete stochastic dynamic programming, vol 414. Wiley
Rummery GA, Niranjan M (1994) On-line Q-learning using connectionist. University of Cambridge Department of Engineering
Russell S, Norvig P (2003) Artificial intelligence: a modern approach, 2nd edn. Prentice Hall
Sutton RS, Barto AG (1998) Reinforcement learning – an introduction. MIT Press
Singh S, Jaakkola T, Littman ML, Szepesvári C (2000) Convergence results for single-step on-policy reinforcement-learning algorithms. Mach Learn 38(3):287–308
Article MATH Google Scholar
Smith DJ, Simpson KGL (2004) Functional safety – a straightforward guide to applying IEC 61505 and related standards, 2nd edn. Elsevier
Tesauro G (1995) Temporal difference learning and td-gammon. Commun ACM 38(3):58–68
Article Google Scholar
Wächter A, Biegler LT (2006) On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Math Program 106(1):25–57
Article MathSciNet MATH Google Scholar
Watkins CJCH, Dayan P (1992) Q-learning. Mach Learn 8(3):279–292
MATH Google Scholar
Weld D, Etzioni O (1994) The first law of robotics (a call to arms). In: Proceedings of the 12th national conference on artificial intelligence (AAAI-94), pp 1042–1047
Zhang W, Dietterich TG (1995) A reinforcement learning approach to job-shop scheduling. In: IJCAI, vol 95. Citeseer, pp 1114–1120

Download references

Author information

Authors and Affiliations

Department of Aerospace Engineering, Technion - Israel Institute of Technology, Lady Davis Building, Technion City, Haifa, 32000, Israel
Shashank Pathak
POLCOMING, Università degli Studi di Sassari, Viale Mancini 5, 07100, Sassari, Italy
Luca Pulina
DIBRIS, Università degli Studi di Genova, Via Opera Pia, 13, 16145, Genova, Italy
Armando Tacchella

Authors

Shashank Pathak
View author publications
You can also search for this author in PubMed Google Scholar
Luca Pulina
View author publications
You can also search for this author in PubMed Google Scholar
Armando Tacchella
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Armando Tacchella.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pathak, S., Pulina, L. & Tacchella, A. Verification and repair of control policies for safe reinforcement learning. Appl Intell 48, 886–908 (2018). https://doi.org/10.1007/s10489-017-0999-8

Download citation

Published: 05 August 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s10489-017-0999-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Verification and repair of control policies for safe reinforcement learning

Abstract

Access this article

Similar content being viewed by others

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

A practical guide to multi-objective reinforcement learning and planning

Challenges of real-world reinforcement learning: definitions, benchmarks and analysis

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Verification and repair of control policies for safe reinforcement learning

Abstract

Access this article

Similar content being viewed by others

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

A practical guide to multi-objective reinforcement learning and planning

Challenges of real-world reinforcement learning: definitions, benchmarks and analysis

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation