Skip to main content

An Evaluation Study of Intrinsic Motivation Techniques Applied to Reinforcement Learning over Hard Exploration Environments

  • Conference paper
  • First Online:
Machine Learning and Knowledge Extraction (CD-MAKE 2022)

Abstract

In the last few years, the research activity around reinforcement learning tasks formulated over environments with sparse rewards has been especially notable. Among the numerous approaches proposed to deal with these hard exploration problems, intrinsic motivation mechanisms are arguably among the most studied alternatives to date. Advances reported in this area over time have tackled the exploration issue by proposing new algorithmic ideas to generate alternative mechanisms to measure the novelty. However, most efforts in this direction have overlooked the influence of different design choices and parameter settings that have also been introduced to improve the effect of the generated intrinsic bonus, forgetting the application of those choices to other intrinsic motivation techniques that may also benefit of them. Furthermore, some of those intrinsic methods are applied with different base reinforcement algorithms (e.g. PPO, IMPALA) and neural network architectures, being hard to fairly compare the provided results and the actual progress provided by each solution. The goal of this work is to stress on this crucial matter in reinforcement learning over hard exploration environments, exposing the variability and susceptibility of avant-garde intrinsic motivation techniques to diverse design factors. Ultimately, our experiments herein reported underscore the importance of a careful selection of these design aspects coupled with the exploration requirements of the environment and the task in question under the same setup, so that fair comparisons can be guaranteed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Depending on the task under consideration, the novelty can be associated to the very last performed action and/or the next state visited by the agent in the trajectory.

  2. 2.

    Rollout is denoted as \(\tau \), whereas the i-th rollout is denoted as \(\tau _i\).

  3. 3.

    We note that the choice of the neural network architecture is not just for the actor-critic modules, but also for IM approaches that hinge on neural computation.

  4. 4.

    In this case, we take advantage of the 2D grid (discrete state space) and map each state directly to a dictionary when using COUNTS. Nevertheless, when facing more complex state spaces pseudo-counts [15] can be applied as an alternative as in [22].

  5. 5.

    Even with different neural architectures and base RL algorithms, they successfully solve the same tasks in MiniGrid.

  6. 6.

    We note that the number of parameters is slightly increased, but they also differ in the type of layers that are used in each network (the two-headed network uses CNNs while the independent actor-critic only uses dense layers).

References

  1. Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017)

    Article  Google Scholar 

  2. Baker, B., et al.: Emergent tool use from multi-agent autocurricula. arXiv:1909.07528 (2019)

  3. Holzinger, A.: Introduction to machine learning & knowledge extraction (make). Mach. Learn. Knowl. Extr. 1(1), 1–20 (2019)

    Google Scholar 

  4. Aubret, A., Matignon, L., Hassas, S.: A survey on intrinsic motivation in reinforcement learning. arXiv:1908.06976 (2019)

  5. Ho, J., Ermon, S.: Generative adversarial imitation learning. In: Advances in Neural Information Processing Systems, vol. 29 (2016)

    Google Scholar 

  6. Finn, C., Levine, S., Abbeel, P.: Guided cost learning: deep inverse optimal control via policy optimization (2016)

    Google Scholar 

  7. Grigorescu, D.: Curiosity, intrinsic motivation and the pleasure of knowledge. J. Educ. Sci. Psychol. 10(1) (2020)

    Google Scholar 

  8. Raileanu, R., Rocktäschel, T.: Ride: rewarding impact-driven exploration for procedurally-generated environments. arXiv:2002.12292 (2020)

  9. Badia, A.P., et al.: Never give up: learning directed exploration strategies. arXiv:2002.06038 (2020)

  10. Flet-Berliac, Y., Ferret, J., Pietquin, O., Preux, P., Geist, M.: Adversarially guided actor-critic. arXiv:2102.04376 (2021)

  11. Pathak, D., Agrawal, P., Efros, A.A., Darrell, T.: Curiosity-driven exploration by self-supervised prediction. In: International Conference on Machine Learning, pp. 2778–2787 (2017)

    Google Scholar 

  12. Burda, Y., Edwards, H., Storkey, A., Klimov, O.: Exploration by random network distillation. arXiv:1810.12894 (2018)

  13. Andrychowicz, M., et al.: What matters in on-policy reinforcement learning? A large-scale empirical study. arXiv:2006.05990 (2020)

  14. Andrychowicz, M., et al.: What matters for on-policy deep actor-critic methods? A large-scale study. In: International Conference on Learning Representations (2020)

    Google Scholar 

  15. Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., Munos, R.: Unifying count-based exploration and intrinsic motivation. In: Advances in Neural Information Processing Systems, vol. 29 (2016)

    Google Scholar 

  16. Tang, H., et al.: # exploration: a study of count-based exploration for deep reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 2753–2762 (2017)

    Google Scholar 

  17. Machado, M.C., Bellemare, M.G., Bowling, M.: Count-based exploration with the successor representation. In: AAAI Conference on Artificial Intelligence, vol. 34, no. 4, pp. 5125–5133 (2020)

    Google Scholar 

  18. Pîslar, M., Szepesvari, D., Ostrovski, G., Borsa, D., Schaul, T.: When should agents explore? arXiv:2108.11811 (2021)

  19. Zhang, T., et al.: NovelD: a simple yet effective exploration criterion. In: Advances in Neural Information Processing Systems, vol. 34 (2021)

    Google Scholar 

  20. Bougie, N., Ichise, R.: Fast and slow curiosity for high-level exploration in reinforcement learning. Appl. Intell. 51(2), 1086–1107 (2020). https://doi.org/10.1007/s10489-020-01849-3

    Article  MATH  Google Scholar 

  21. Campero, A., Raileanu, R., Küttler, H., Tenenbaum, J.B., Rocktäschel, T., Grefenstette, E.: Learning with amigo: adversarially motivated intrinsic goals. arXiv:2006.12122 (2020)

  22. Taiga, A.A., Fedus, W., Machado, M.C., Courville, A., Bellemare, M.G.: On bonus-based exploration methods in the arcade learning environment. arXiv:2109.11052 (2021)

  23. Hessel, M., et al.: Rainbow: combining improvements in deep reinforcement learning. In: AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  24. Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T., Efros, A.A.: Large-scale study of curiosity-driven learning. In: ICLR (2019)

    Google Scholar 

  25. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv:1707.06347 (2017)

  26. Orsini, M., et al.: What matters for adversarial imitation learning? In: Advances in Neural Information Processing Systems, vol. 34 (2021)

    Google Scholar 

  27. Jing, X., et al.: Divide and explore: multi-agent separate exploration with shared intrinsic motivations (2022)

    Google Scholar 

  28. Seurin, M., Strub, F., Preux, P., Pietquin, O.: Don’t do what doesn’t matter: intrinsic motivation with action usefulness. arXiv:2105.09992 (2021)

  29. Zha, D., Ma, W., Yuan, L., Hu, X., Liu, J.: Rank the episodes: a simple approach for exploration in procedurally-generated environments. arXiv:2101.08152 (2021)

  30. Espeholt, L., et al.: IMPALA: scalable distributed deep-RL with importance weighted actor-learner architectures. In: International Conference on Machine Learning, pp. 1407–1416 (2018)

    Google Scholar 

  31. Chevalier-Boisvert, M., Willems, L., Pal, S.: Minimalistic gridworld environment for OpenAI gym. http://github.com/maximecb/gym-minigrid (2018)

  32. Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. arXiv:1506.02438 (2015)

Download references

Acknowledgments

A. Andres and J. Del Ser would like to thank the Basque Government for its funding support through the research group MATHMODE (T1294-19) and the BIKAINTEK PhD support program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alain Andres .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Andres, A., Villar-Rodriguez, E., Del Ser, J. (2022). An Evaluation Study of Intrinsic Motivation Techniques Applied to Reinforcement Learning over Hard Exploration Environments. In: Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl, E. (eds) Machine Learning and Knowledge Extraction. CD-MAKE 2022. Lecture Notes in Computer Science, vol 13480. Springer, Cham. https://doi.org/10.1007/978-3-031-14463-9_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-14463-9_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-14462-2

  • Online ISBN: 978-3-031-14463-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics