Skip to main content

No Free Lunch: Overcoming Reward Gaming in AI Safety Gridworlds

  • Conference paper
  • First Online:
Computer Safety, Reliability, and Security. SAFECOMP 2021 Workshops (SAFECOMP 2021)

Abstract

We present two heuristics for tackling the problem of reward gaming by self-modification in Reinforcement Learning agents. Reward gaming occurs when the agent’s reward function is mis-specified and the agent can achieve a high reward by altering or fooling, in some way, its sensors rather than by performing the desired actions. Our first heuristic tracks the rewards encountered in the environment and converts high rewards that fall outside the normal distribution into penalities. Our second heuristic relies on the existence of some validation action that an agent can take to check the reward. In this heuristic, on encountering an abnormally high reward, the agent performs a validation step before either accepting the reward as it is, or converting it into a penalty. We evaluate the performance of these heuristics on variants of the tomato watering problem from the AI Safety Gridworlds suite.

Work supported by EPSRC Grant EP/V026801/1 Trustworthy Autonomous Systems Verifiability Node.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Kronecker delta \(\delta _{ij}\) is a function of two variables i, j that returns 1 if the variables are equal, and 0 otherwise.

References

  1. Armstrong, S., Levinstein, B.: Low impact artificial intelligences. CoRR abs/1705.10720 (2017). http://arxiv.org/abs/1705.10720

  2. Clark, J., Amodei, D.: Faulty reward functions in the wild (2016). https://blog.openai.com/faulty-reward-functions/

  3. Hadfield-Menell, D., Milli, S., Abbeel, P., Russell, S.J., Dragan, A.: Inverse reward design. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017)

    Google Scholar 

  4. Hessel, M., et al.: Rainbow: combining improvements in deep reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, April 2018

    Google Scholar 

  5. Krakovna, V., Orseau, L., Martic, M., Legg, S.: Measuring and avoiding side effects using relative reachability. CoRR abs/1806.01186 (2018). http://arxiv.org/abs/1806.01186

  6. Leike, J., et al.: AI safety gridworlds. CoRR abs/1711.09883 (2017). http://arxiv.org/abs/1711.09883

  7. Mnih, V., et al.: Playing atari with deep reinforcement learning. CoRR abs/1312.5602 (2013). http://arxiv.org/abs/1312.5602

  8. Santara, A., et al.: Rail: risk-averse imitation learning. In: Proceedings of AAMAS 2018, pp. 2062–2063 (2018)

    Google Scholar 

  9. Silver, D., et al.: Mastering the game of Go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016). https://doi.org/10.1038/nature16961

    Article  Google Scholar 

  10. Silver, D., et al.: A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362(6419), 1140–1144 (2018). https://doi.org/10.1126/science.aar6404

    Article  MathSciNet  MATH  Google Scholar 

  11. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. The MIT Press, Cambridge (2018)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Louise A. Dennis .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tsvarkaleva, M., Dennis, L.A. (2021). No Free Lunch: Overcoming Reward Gaming in AI Safety Gridworlds. In: Habli, I., Sujan, M., Gerasimou, S., Schoitsch, E., Bitsch, F. (eds) Computer Safety, Reliability, and Security. SAFECOMP 2021 Workshops. SAFECOMP 2021. Lecture Notes in Computer Science(), vol 12853. Springer, Cham. https://doi.org/10.1007/978-3-030-83906-2_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-83906-2_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-83905-5

  • Online ISBN: 978-3-030-83906-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics