Skip to main content

On Normative Reinforcement Learning via Safe Reinforcement Learning

  • Conference paper
  • First Online:
PRIMA 2022: Principles and Practice of Multi-Agent Systems (PRIMA 2022)

Abstract

Reinforcement learning (RL) has proven a successful technique for teaching autonomous agents goal-directed behaviour. As RL agents further integrate with our society, they must learn to comply with ethical, social, or legal norms. Defeasible deontic logics are natural formal frameworks to specify and reason about such norms in a transparent way. However, their effective and efficient integration in RL agents remains an open problem. On the other hand, linear temporal logic (LTL) has been successfully employed to synthesize RL policies satisfying, e.g., safety requirements. In this paper, we investigate the extent to which the established machinery for safe reinforcement learning can be leveraged for directing normative behaviour for RL agents. We analyze some of the difficulties that arise from attempting to represent norms with LTL, provide an algorithm for synthesizing LTL specifications from certain normative systems, and analyze its power and limits with a case study.

This work was supported by the DC-RES run by the TU Wien’s Faculty of Informatics and the FH-Technikum Wien and by the project WWTF MA16–028.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Any defeasible deontic logic equipped with a theorem prover could in theory be used.

  2. 2.

    An implementation of Algorithms 1 and 2 can be found here: https://github.com/lexeree/normative-player-characters.

  3. 3.

    The LTL specifications have not been implemented as shields, since the shielding tool TEMPEST [25] is still under development. We instead manually chose optimal paths from among those paths obeying the compliance specifications.

  4. 4.

    Note that Spec. (6) is semantically equivalent to \(G(\lnot at\_danger)\).

  5. 5.

    Note that Spec. (7) is semantically equivalent to \(G(at\_danger\rightarrow (\lnot empty \wedge X(empty))\).

References

  1. Alechina, N., Dastani, M., Logan, B.: Norm specification and verification in multi-agent systems. J. Appl. Logics 5(2), 457 (2018)

    MathSciNet  MATH  Google Scholar 

  2. Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: Proceedigs of AAAI, pp. 2669–2678 (2018)

    Google Scholar 

  3. Boella, G., van der Torre, L.: Permissions and obligations in hierarchical normative systems. In: Proceedings of ICAIL, pp. 81–82 (2003)

    Google Scholar 

  4. Boella, G., van der Torre, L.: Regulative and constitutive norms in normative multiagent systems. In: Proceedings of KR 2004, pp. 255–266. AAAI Press (2004)

    Google Scholar 

  5. De Giacomo, G., De Masellis, R., Grasso, M., Maggi, F.M., Montali, M.: Monitoring business metaconstraints based on LTL and LDL for finite traces. In: Sadiq, S., Soffer, P., Völzer, H. (eds.) Business Process Management, pp. 1–17 (2014)

    Google Scholar 

  6. De Giacomo, G., Iocchi, L., Favorito, M., Patrizi, F.: Foundations for restraining bolts: reinforcement learning with LTLf/LDLf restraining specifications. In: Proceedings of ICAPS, vol. 29, pp. 128–136 (2019)

    Google Scholar 

  7. Esparza, J., Křetínský, J.: From LTL to deterministic automata: a safraless compositional approach. In: Biere, A., Bloem, R. (eds.) CAV 2014. LNCS, vol. 8559, pp. 192–208. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08867-9_13

    Chapter  Google Scholar 

  8. Forrester, J.W.: Gentle murder, or the adverbial samaritan. J. Philos. 81(4), 193–197 (1984)

    Article  MathSciNet  Google Scholar 

  9. Fu, J., Topcu, U.: Probably approximately correct MDP learning and control with temporal logic constraints. In: Proceedings of RSS (2014)

    Google Scholar 

  10. Governatori, G.: Thou shalt is not you will. In: Proceedings of ICAIL, pp. 63–68 (2015)

    Google Scholar 

  11. Governatori, G.: Practical normative reasoning with defeasible deontic logic. In: d’Amato, C., Theobald, M. (eds.) Reasoning Web 2018. LNCS, vol. 11078, pp. 1–25. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00338-8_1

    Chapter  Google Scholar 

  12. Governatori, G., Hashmi, M.: No time for compliance. In: Proceedings of EDOC, pp. 9–18. IEEE (2015)

    Google Scholar 

  13. Governatori, G., Hulstijn, J., Riveret, R., Rotolo, A.: Characterising deadlines in temporal modal defeasible logic. In: Orgun, M.A., Thornton, J. (eds.) AI 2007. LNCS (LNAI), vol. 4830, pp. 486–496. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76928-6_50

    Chapter  Google Scholar 

  14. Governatori, G., Olivieri, F., Rotolo, A., Scannapieco, S.: Computing strong and weak permissions in defeasible logic. J. Philos. Logic 42(6), 799–829 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  15. Governatori, G., Rotolo, A.: BIO logical agents: norms, beliefs, intentions in defeasible logic. J. Auton. Agents Multi Agent Syst. 17(1), 36–69 (2008)

    Article  Google Scholar 

  16. Hasanbeig, M., Abate, A., Kroening, D.: Cautious reinforcement learning with logical constraints. In: Proceedings of AAMAS, pp. 483–491 (2020)

    Google Scholar 

  17. Hodkinson, I., Reynolds, M.: Temporal logic. In: Blackburn, P., Van Benthem, J., Wolter, F. (eds.) Handbook of Modal Logic, vol. 3, pp. 655–720. Elsevier (2007)

    Google Scholar 

  18. Jansen, N., Könighofer, B., Junges, S., Serban, A., Bloem, R.: Safe Reinforcement Learning Using Probabilistic Shields. In: Proceedings of CONCUR. LIPIcs, vol. 171, pp. 1–16 (2020)

    Google Scholar 

  19. Lam, H.P., Governatori, G.: The making of SPINdle. In: Proc. of RuleML. LNCS, vol. 5858, pp. 315–322 (2009)

    Google Scholar 

  20. Neufeld, E., Bartocci, E., Ciabattoni, A., Governatori, G.: A normative supervisor for reinforcement learning agents. In: Proceedings of CADE, pp. 565–576 (2021)

    Google Scholar 

  21. Neufeld, E.A., Bartocci, E., Ciabattoni, A., Governatori, G.: Enforcing ethical goals over reinforcement-learning policies. J. Ethics Inf. Technol. 24, 43 (2022). https://doi.org/10.1007/s10676-022-09665-8

  22. Noothigattu, R., et al.: Teaching AI agents ethical values using reinforcement learning and policy orchestration. In: Proceedings of IJCAI, LNCS, vol. 12158, pp. 217–234 (2019)

    Google Scholar 

  23. Panagiotidi, S., Alvarez-Napagao, S., Vázquez-Salceda, J.: Towards the norm-aware agent: bridging the gap between deontic specifications and practical mechanisms for norm monitoring and norm-aware planning. In: Balke, T., Dignum, F., van Riemsdijk, M.B., Chopra, A.K. (eds.) COIN 2013. LNCS (LNAI), vol. 8386, pp. 346–363. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07314-9_19

    Chapter  Google Scholar 

  24. Pnueli, A.: The temporal logic of programs. In: Proceedings of FOCS, pp. 46–57 (1977)

    Google Scholar 

  25. Pranger, S., Könighofer, B., Posch, L., Bloem, R.: TEMPEST - synthesis tool for reactive systems and shields in probabilistic environments. In: Hou, Z., Ganesh, V. (eds.) ATVA 2021. LNCS, vol. 12971, pp. 222–228. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88885-5_15

    Chapter  Google Scholar 

  26. Rodriguez-Soto, M., Lopez-Sanchez, M., Rodriguez Aguilar, J.A.: Multi-objective reinforcement learning for designing ethical environments. In: Proceedings of IJCAI, pp. 545–551 (2021)

    Google Scholar 

  27. Sadigh, D., Kim, E.S., Coogan, S., Sastry, S.S., Seshia, S.A.: A learning based approach to control synthesis of markov decision processes for linear temporal logic specifications. In: Proceedings of CDC, pp. 1091–1096 (2014)

    Google Scholar 

  28. Searle, J.R.: Speech acts: an essay in the philosophy of language. Cambridge University Press, Cambridge, England (1969)

    Google Scholar 

  29. Sickert, S., Esparza, J., Jaax, S., Křetínskỳ, J.: Limit-deterministic büchi automata for linear temporal logic. In: Proceedings of CAV, LNCS, vol. 9780, pp. 312–332 (2016)

    Google Scholar 

  30. Watkins, C.J.C.H.: Learning from Delayed Rewards. Ph.D. thesis, King’s College, Cambridge, UK (1989). https://www.cs.rhul.ac.uk/~chrisw/thesis.pdf

  31. Wen, M., Ehlers, R., Topcu, U.: Correct-by-synthesis reinforcement learning with temporal logic constraints. In: Procedings of IROS, pp. 4983–4990. IEEE (2015)

    Google Scholar 

  32. Wu, Y.H., Lin, S.D.: A low-cost ethics shaping approach for designing reinforcement learning agents. In: Proceedings of AAAI, pp. 1687–1694 (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Emery A. Neufeld .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Neufeld, E.A., Bartocci, E., Ciabattoni, A. (2023). On Normative Reinforcement Learning via Safe Reinforcement Learning. In: Aydoğan, R., Criado, N., Lang, J., Sanchez-Anguix, V., Serramia, M. (eds) PRIMA 2022: Principles and Practice of Multi-Agent Systems. PRIMA 2022. Lecture Notes in Computer Science(), vol 13753. Springer, Cham. https://doi.org/10.1007/978-3-031-21203-1_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-21203-1_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-21202-4

  • Online ISBN: 978-3-031-21203-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics