Dynamic Shielding for Reinforcement Learning in Black-Box Environments

Waga, Masaki; Castellano, Ezequiel; Pruekprasert, Sasinee; Klikovits, Stefan; Takisaka, Toru; Hasuo, Ichiro

doi:10.1007/978-3-031-19992-9_2

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13505))

Included in the following conference series:

International Symposium on Automated Technology for Verification and Analysis

772 Accesses

Abstract

It is challenging to use reinforcement learning (RL) in cyber-physical systems due to the lack of safety guarantees during learning. Although there have been various proposals to reduce undesired behaviors during learning, most of these techniques require prior system knowledge, and their applicability is limited. This paper aims to reduce undesired behaviors during learning without requiring any prior system knowledge. We propose dynamic shielding: an extension of a model-based safe RL technique called shielding using automata learning. The dynamic shielding technique constructs an approximate system model in parallel with RL using a variant of the RPNI algorithm and suppresses undesired explorations due to the shield constructed from the learned model. Through this combination, potentially unsafe actions can be foreseen before the agent experiences them. Experiments show that our dynamic shield significantly decreases the number of undesired events during training.

S. Pruekprasert and T. Takisaka—The work was done during the employment of S.P. and T.T. at NII, Tokyo.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Automata Learning Meets Shielding

Formal Methods Assisted Training of Safe Reinforcement Learning Agents

Safe Exploration in Reinforcement Learning by Reachability Analysis over Learned Models

Notes

1.
The shield we use in this paper is the variant called preemptive shield in [1]. It is straightforward to apply our framework to the classic shield called post-posed shield.
2.
The artifact is publicly available at https://doi.org/10.5281/zenodo.6906673.

References

Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: McIlraith, S.A., Weinberger, K.Q. (eds.) Proceedings of the AAAI 2018, pp. 2669–2678. AAAI Press (2018)
Google Scholar
Avni, G., Bloem, R., Chatterjee, K., Henzinger, T.A., Könighofer, B., Pranger, S.: Run-time optimization for learned controllers through quantitative games. In: Dillig, I., Tasiran, S. (eds.) CAV 2019. LNCS, vol. 11561, pp. 630–649. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-25540-4_36
Chapter Google Scholar
Bharadwaj, S., Bloem, R., Dimitrova, R., Könighofer, B., Topcu, U.: Synthesis of minimum-cost shields for multi-agent systems. In: Proceedings of the ACC 2019, pp. 1048–1055. IEEE (2019)
Google Scholar
Bloem, R., Jensen, P.G., Könighofer, B., Larsen, K.G., Lorber, F., Palmisano, A.: It’s time to play safe: shield synthesis for timed systems. CoRR abs/2006.16688 (2020)
Google Scholar
Bloem, R., Könighofer, B., Könighofer, R., Wang, C.: Shield synthesis: runtime enforcement for reactive systems. In: Baier, C., Tinelli, C. (eds.) TACAS 2015. LNCS, vol. 9035, pp. 533–548. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-46681-0_51
Chapter MATH Google Scholar
Bouton, M., Karlsson, J., Nakhaei, A., Fujimura, K., Kochenderfer, M.J., Tumova, J.: Reinforcement learning with probabilistic guarantees for autonomous driving. CoRR abs/1904.07189 (2019)
Google Scholar
Brockman, G., et al.: OpenAI Gym. CoRR abs/1606.01540 (2016)
Google Scholar
Cheng, R., Orosz, G., Murray, R.M., Burdick, J.W.: End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In: Proceedings of the AAAI 2019, pp. 3387–3395. AAAI Press (2019)
Google Scholar
Chevalier-Boisvert, M.: Gym-MiniWorld Environment for OpenAI Gym (2018). https://github.com/maximecb/gym-miniworld
García, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16, 1437–1480 (2015)
MathSciNet MATH Google Scholar
Hasanbeig, M., Abate, A., Kroening, D.: Cautious reinforcement learning with logical constraints. In: Seghrouchni, A.E.F., Sukthankar, G., An, B., Yorke-Smith, N. (eds.) Proceedings of the AAMAS 2020, pp. 483–491. IFAAMS (2020)
Google Scholar
Hunt, N., Fulton, N., Magliacane, S., Hoang, T.N., Das, S., Solar-Lezama, A.: Verifiably safe exploration for end-to-end reinforcement learning. In: Bogomolov, S., Jungers, R.M. (eds.) Proceedings of the HSCC 2021, pp. 14:1–14:11. ACM (2021)
Google Scholar
Isberner, M., Howar, F., Steffen, B.: The open-source LearnLib - a framework for active automata learning. In: Kroening, D., Păsăreanu, C.S. (eds.) CAV 2015. LNCS, vol. 9206, pp. 487–495. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-21690-4_32
Chapter Google Scholar
Jansen, N., Könighofer, B., Junges, S., Serban, A., Bloem, R.: Safe reinforcement learning using probabilistic shields (invited paper). In: Konnov, I., Kovács, L. (eds.) Proceedings of the CONCUR 2020. LIPIcs, vol. 171, pp. 3:1–3:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2020)
Google Scholar
Kupferman, O., Lampert, R.: On the construction of fine automata for safety properties. In: Graf, S., Zhang, W. (eds.) ATVA 2006. LNCS, vol. 4218, pp. 110–124. Springer, Heidelberg (2006). https://doi.org/10.1007/11901914_11
Chapter MATH Google Scholar
Lang, K.J., Pearlmutter, B.A., Price, R.A.: Results of the Abbadingo one DFA learning competition and a new evidence-driven state merging algorithm. In: Honavar, V., Slutzki, G. (eds.) ICGI 1998. LNCS, vol. 1433, pp. 1–12. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0054059
Chapter Google Scholar
López, D., García, P.: On the inference of finite state automata from positive and negative data. In: Heinz, J., Sempere, J.M. (eds.) Topics in Grammatical Inference, pp. 73–112. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-48395-4_4
Chapter Google Scholar
Mao, H., Chen, Y., Jaeger, M., Nielsen, T.D., Larsen, K.G., Nielsen, B.: Learning Markov decision processes for model checking. In: Fahrenberg, U., Legay, A., Thrane, C.R. (eds.) Proceedings of the QFM 2012. EPTCS, vol. 103, pp. 49–63 (2012)
Google Scholar
Mnih, V., et al.: Playing Atari with deep reinforcement learning. CoRR abs/1312.5602 (2013)
Google Scholar
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Article Google Scholar
Oncina, J., García, P.: Identifying regular languages in polynomial time. Series in Machine Perception and Artificial Intelligence, pp. 99–108 (1993)
Google Scholar
Plappert, M.: Keras-RL (2016). https://github.com/keras-rl/keras-rl
Pranger, S., Könighofer, B., Tappler, M., Deixelberger, M., Jansen, N., Bloem, R.: Adaptive shielding under uncertainty. In: Proceedings of the ACC 2021, pp. 3467–3474. IEEE (2021)
Google Scholar
Raffin, A., Hill, A., Ernestus, M., Gleave, A., Kanervisto, A., Dormann, N.: Stable baselines3 (2019). https://github.com/DLR-RM/stable-baselines3
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. CoRR abs/1707.06347 (2017)
Google Scholar
Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017)
Article Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning - An Introduction. Adaptive Computation and Machine Learning. MIT Press (1998)
Google Scholar
Wu, M., Wang, J., Deshmukh, J., Wang, C.: Shield synthesis for real: Enforcing safety in cyber-physical systems. In: Barrett, C.W., Yang, J. (eds.) Proceedings of the FMCAD 2019, pp. 129–137. IEEE (2019)
Google Scholar

Download references

Acknowledgements

This work is partially supported by JST ERATO HASUO Metamathematics for Systems Design Project (No. JPMJER1603). Masaki Waga is also supported by JST ACT-X Grant No. JPMJAX200U. Stefan Klikovits is also supported by JSPS Grant-in-Aid No. 20K23334. Sasinee Pruekprasert is also supported by JSPS Grant-in-Aid No. 21K14191. Toru Takisaka is also supported by NSFC Research Fund for International Young Scientists No. 62150410437.

Author information

Authors and Affiliations

Graduate School of Informatics, Kyoto University, Kyoto, Japan
Masaki Waga
National Institute of Informatics, Tokyo, Japan
Ezequiel Castellano, Stefan Klikovits & Ichiro Hasuo
National Institute of Advanced Industrial Science and Technology, Tokyo, Japan
Sasinee Pruekprasert
University of Electronic Science and Technology of China, Chengdu, China
Toru Takisaka
The Graduate University for Advanced Studies, Tokyo, Japan
Ichiro Hasuo

Authors

Masaki Waga
View author publications
You can also search for this author in PubMed Google Scholar
Ezequiel Castellano
View author publications
You can also search for this author in PubMed Google Scholar
Sasinee Pruekprasert
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Klikovits
View author publications
You can also search for this author in PubMed Google Scholar
Toru Takisaka
View author publications
You can also search for this author in PubMed Google Scholar
Ichiro Hasuo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Masaki Waga .

Editor information

Editors and Affiliations

Université Paris Diderot, Paris, France
Ahmed Bouajjani
Brno University of Technology, Brno, Czech Republic
Lukáš Holík
Chinese Academy of Sciences, Beijing, China
Zhilin Wu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Waga, M., Castellano, E., Pruekprasert, S., Klikovits, S., Takisaka, T., Hasuo, I. (2022). Dynamic Shielding for Reinforcement Learning in Black-Box Environments. In: Bouajjani, A., Holík, L., Wu, Z. (eds) Automated Technology for Verification and Analysis. ATVA 2022. Lecture Notes in Computer Science, vol 13505. Springer, Cham. https://doi.org/10.1007/978-3-031-19992-9_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-19992-9_2
Published: 21 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19991-2
Online ISBN: 978-3-031-19992-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Dynamic Shielding for Reinforcement Learning in Black-Box Environments