Abstract
It is challenging to use reinforcement learning (RL) in cyber-physical systems due to the lack of safety guarantees during learning. Although there have been various proposals to reduce undesired behaviors during learning, most of these techniques require prior system knowledge, and their applicability is limited. This paper aims to reduce undesired behaviors during learning without requiring any prior system knowledge. We propose dynamic shielding: an extension of a model-based safe RL technique called shielding using automata learning. The dynamic shielding technique constructs an approximate system model in parallel with RL using a variant of the RPNI algorithm and suppresses undesired explorations due to the shield constructed from the learned model. Through this combination, potentially unsafe actions can be foreseen before the agent experiences them. Experiments show that our dynamic shield significantly decreases the number of undesired events during training.
S. Pruekprasert and T. TakisakaāThe work was done during the employment of S.P. and T.T. at NII, Tokyo.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The shield we use in this paper is the variant called preemptive shield in [1]. It is straightforward to apply our framework to the classic shield called post-posed shield.
- 2.
The artifact is publicly available at https://doi.org/10.5281/zenodo.6906673.
References
Alshiekh, M., Bloem, R., Ehlers, R., Kƶnighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: McIlraith, S.A., Weinberger, K.Q. (eds.) Proceedings of the AAAI 2018, pp. 2669ā2678. AAAI Press (2018)
Avni, G., Bloem, R., Chatterjee, K., Henzinger, T.A., Kƶnighofer, B., Pranger, S.: Run-time optimization for learned controllers through quantitative games. In: Dillig, I., Tasiran, S. (eds.) CAV 2019. LNCS, vol. 11561, pp. 630ā649. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-25540-4_36
Bharadwaj, S., Bloem, R., Dimitrova, R., Kƶnighofer, B., Topcu, U.: Synthesis of minimum-cost shields for multi-agent systems. In: Proceedings of the ACC 2019, pp. 1048ā1055. IEEE (2019)
Bloem, R., Jensen, P.G., Kƶnighofer, B., Larsen, K.G., Lorber, F., Palmisano, A.: Itās time to play safe: shield synthesis for timed systems. CoRR abs/2006.16688 (2020)
Bloem, R., Kƶnighofer, B., Kƶnighofer, R., Wang, C.: Shield synthesis: runtime enforcement for reactive systems. In: Baier, C., Tinelli, C. (eds.) TACAS 2015. LNCS, vol. 9035, pp. 533ā548. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-46681-0_51
Bouton, M., Karlsson, J., Nakhaei, A., Fujimura, K., Kochenderfer, M.J., Tumova, J.: Reinforcement learning with probabilistic guarantees for autonomous driving. CoRR abs/1904.07189 (2019)
Brockman, G., et al.: OpenAI Gym. CoRR abs/1606.01540 (2016)
Cheng, R., Orosz, G., Murray, R.M., Burdick, J.W.: End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In: Proceedings of the AAAI 2019, pp. 3387ā3395. AAAI Press (2019)
Chevalier-Boisvert, M.: Gym-MiniWorld Environment for OpenAI Gym (2018). https://github.com/maximecb/gym-miniworld
GarcĆa, J., FernĆ”ndez, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16, 1437ā1480 (2015)
Hasanbeig, M., Abate, A., Kroening, D.: Cautious reinforcement learning with logical constraints. In: Seghrouchni, A.E.F., Sukthankar, G., An, B., Yorke-Smith, N. (eds.) Proceedings of the AAMAS 2020, pp. 483ā491. IFAAMS (2020)
Hunt, N., Fulton, N., Magliacane, S., Hoang, T.N., Das, S., Solar-Lezama, A.: Verifiably safe exploration for end-to-end reinforcement learning. In: Bogomolov, S., Jungers, R.M. (eds.) Proceedings of the HSCC 2021, pp. 14:1ā14:11. ACM (2021)
Isberner, M., Howar, F., Steffen, B.: The open-source LearnLib - a framework for active automata learning. In: Kroening, D., PÄsÄreanu, C.S. (eds.) CAV 2015. LNCS, vol. 9206, pp. 487ā495. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-21690-4_32
Jansen, N., Kƶnighofer, B., Junges, S., Serban, A., Bloem, R.: Safe reinforcement learning using probabilistic shields (invited paper). In: Konnov, I., KovĆ”cs, L. (eds.) Proceedings of the CONCUR 2020. LIPIcs, vol. 171, pp. 3:1ā3:16. Schloss Dagstuhl - Leibniz-Zentrum fĆ¼r Informatik (2020)
Kupferman, O., Lampert, R.: On the construction of fine automata for safety properties. In: Graf, S., Zhang, W. (eds.) ATVA 2006. LNCS, vol. 4218, pp. 110ā124. Springer, Heidelberg (2006). https://doi.org/10.1007/11901914_11
Lang, K.J., Pearlmutter, B.A., Price, R.A.: Results of the Abbadingo one DFA learning competition and a new evidence-driven state merging algorithm. In: Honavar, V., Slutzki, G. (eds.) ICGI 1998. LNCS, vol. 1433, pp. 1ā12. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0054059
LĆ³pez, D., GarcĆa, P.: On the inference of finite state automata from positive and negative data. In: Heinz, J., Sempere, J.M. (eds.) Topics in Grammatical Inference, pp. 73ā112. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-48395-4_4
Mao, H., Chen, Y., Jaeger, M., Nielsen, T.D., Larsen, K.G., Nielsen, B.: Learning Markov decision processes for model checking. In: Fahrenberg, U., Legay, A., Thrane, C.R. (eds.) Proceedings of the QFM 2012. EPTCS, vol. 103, pp. 49ā63 (2012)
Mnih, V., et al.: Playing Atari with deep reinforcement learning. CoRR abs/1312.5602 (2013)
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529ā533 (2015)
Oncina, J., GarcĆa, P.: Identifying regular languages in polynomial time. Series in Machine Perception and Artificial Intelligence, pp. 99ā108 (1993)
Plappert, M.: Keras-RL (2016). https://github.com/keras-rl/keras-rl
Pranger, S., Kƶnighofer, B., Tappler, M., Deixelberger, M., Jansen, N., Bloem, R.: Adaptive shielding under uncertainty. In: Proceedings of the ACC 2021, pp. 3467ā3474. IEEE (2021)
Raffin, A., Hill, A., Ernestus, M., Gleave, A., Kanervisto, A., Dormann, N.: Stable baselines3 (2019). https://github.com/DLR-RM/stable-baselines3
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. CoRR abs/1707.06347 (2017)
Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354ā359 (2017)
Sutton, R.S., Barto, A.G.: Reinforcement Learning - An Introduction. Adaptive Computation and Machine Learning. MIT Press (1998)
Wu, M., Wang, J., Deshmukh, J., Wang, C.: Shield synthesis for real: Enforcing safety in cyber-physical systems. In: Barrett, C.W., Yang, J. (eds.) Proceedings of the FMCAD 2019, pp. 129ā137. IEEE (2019)
Acknowledgements
This work is partially supported by JST ERATO HASUO Metamathematics for Systems Design Project (No. JPMJER1603). Masaki Waga is also supported by JST ACT-X Grant No. JPMJAX200U. Stefan Klikovits is also supported by JSPS Grant-in-Aid No. 20K23334. Sasinee Pruekprasert is also supported by JSPS Grant-in-Aid No. 21K14191. Toru Takisaka is also supported by NSFC Research Fund for International Young Scientists No. 62150410437.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Waga, M., Castellano, E., Pruekprasert, S., Klikovits, S., Takisaka, T., Hasuo, I. (2022). Dynamic Shielding for Reinforcement Learning in Black-Box Environments. In: Bouajjani, A., HolĆk, L., Wu, Z. (eds) Automated Technology for Verification and Analysis. ATVA 2022. Lecture Notes in Computer Science, vol 13505. Springer, Cham. https://doi.org/10.1007/978-3-031-19992-9_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-19992-9_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19991-2
Online ISBN: 978-3-031-19992-9
eBook Packages: Computer ScienceComputer Science (R0)