No Free Lunch: Overcoming Reward Gaming in AI Safety Gridworlds

Tsvarkaleva, Mariya; Dennis, Louise A.

doi:10.1007/978-3-030-83906-2_18

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 12853))

Included in the following conference series:

International Conference on Computer Safety, Reliability, and Security

Abstract

We present two heuristics for tackling the problem of reward gaming by self-modification in Reinforcement Learning agents. Reward gaming occurs when the agent’s reward function is mis-specified and the agent can achieve a high reward by altering or fooling, in some way, its sensors rather than by performing the desired actions. Our first heuristic tracks the rewards encountered in the environment and converts high rewards that fall outside the normal distribution into penalities. Our second heuristic relies on the existence of some validation action that an agent can take to check the reward. In this heuristic, on encountering an abnormally high reward, the agent performs a validation step before either accepting the reward as it is, or converting it into a penalty. We evaluate the performance of these heuristics on variants of the tomato watering problem from the AI Safety Gridworlds suite.

Work supported by EPSRC Grant EP/V026801/1 Trustworthy Autonomous Systems Verifiability Node.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Kronecker delta \(\delta _{ij}\) is a function of two variables i, j that returns 1 if the variables are equal, and 0 otherwise.

References

Armstrong, S., Levinstein, B.: Low impact artificial intelligences. CoRR abs/1705.10720 (2017). http://arxiv.org/abs/1705.10720
Clark, J., Amodei, D.: Faulty reward functions in the wild (2016). https://blog.openai.com/faulty-reward-functions/
Hadfield-Menell, D., Milli, S., Abbeel, P., Russell, S.J., Dragan, A.: Inverse reward design. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017)
Google Scholar
Hessel, M., et al.: Rainbow: combining improvements in deep reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, April 2018
Google Scholar
Krakovna, V., Orseau, L., Martic, M., Legg, S.: Measuring and avoiding side effects using relative reachability. CoRR abs/1806.01186 (2018). http://arxiv.org/abs/1806.01186
Leike, J., et al.: AI safety gridworlds. CoRR abs/1711.09883 (2017). http://arxiv.org/abs/1711.09883
Mnih, V., et al.: Playing atari with deep reinforcement learning. CoRR abs/1312.5602 (2013). http://arxiv.org/abs/1312.5602
Santara, A., et al.: Rail: risk-averse imitation learning. In: Proceedings of AAMAS 2018, pp. 2062–2063 (2018)
Google Scholar
Silver, D., et al.: Mastering the game of Go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016). https://doi.org/10.1038/nature16961
Article Google Scholar
Silver, D., et al.: A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362(6419), 1140–1144 (2018). https://doi.org/10.1126/science.aar6404
Article MathSciNet MATH Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. The MIT Press, Cambridge (2018)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, The University of Manchester, Manchester, UK
Mariya Tsvarkaleva & Louise A. Dennis

Authors

Mariya Tsvarkaleva
View author publications
You can also search for this author in PubMed Google Scholar
Louise A. Dennis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Louise A. Dennis .

Editor information

Editors and Affiliations

University of York, York, UK
Ibrahim Habli
Human Factors Everywhere Ltd., Woking, UK
Mark Sujan
University of York, York, UK
Simos Gerasimou
AIT Austrian Institute of Technology GmbH, Vienna, Austria
Erwin Schoitsch
Thales Deutschland GmbH, Ditzingen, Germany
Friedemann Bitsch

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tsvarkaleva, M., Dennis, L.A. (2021). No Free Lunch: Overcoming Reward Gaming in AI Safety Gridworlds. In: Habli, I., Sujan, M., Gerasimou, S., Schoitsch, E., Bitsch, F. (eds) Computer Safety, Reliability, and Security. SAFECOMP 2021 Workshops. SAFECOMP 2021. Lecture Notes in Computer Science(), vol 12853. Springer, Cham. https://doi.org/10.1007/978-3-030-83906-2_18

Download citation

DOI: https://doi.org/10.1007/978-3-030-83906-2_18
Published: 25 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-83905-5
Online ISBN: 978-3-030-83906-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics