skip to main content
research-article

PP-PG: Combining Parameter Perturbation with Policy Gradient Methods for Effective and Efficient Explorations in Deep Reinforcement Learning

Authors Info & Claims
Published:03 June 2021Publication History
Skip Abstract Section

Abstract

Efficient and stable exploration remains a key challenge for deep reinforcement learning (DRL) operating in high-dimensional action and state spaces. Recently, a more promising approach by combining the exploration in the action space with the exploration in the parameters space has been proposed to get the best of both methods. In this article, we propose a new iterative and close-loop framework by combining the evolutionary algorithm (EA), which does explorations in a gradient-free manner directly in the parameters space with an actor-critic, and the deep deterministic policy gradient (DDPG) reinforcement learning algorithm, which does explorations in a gradient-based manner in the action space to make these two methods cooperate in a more balanced and efficient way. In our framework, the policies represented by the EA population (the parametric perturbation part) can evolve in a guided manner by utilizing the gradient information provided by the DDPG and the policy gradient part (DDPG) is used only as a fine-tuning tool for the best individual in the EA population to improve the sample efficiency. In particular, we propose a criterion to determine the training steps required for the DDPG to ensure that useful gradient information can be generated from the EA generated samples and the DDPG and EA part can work together in a more balanced way during each generation. Furthermore, within the DDPG part, our algorithm can flexibly switch between fine-tuning the same previous RL-Actor and fine-tuning a new one generated by the EA according to different situations to further improve the efficiency. Experiments on a range of challenging continuous control benchmarks demonstrate that our algorithm outperforms related works and offers a satisfactory trade-off between stability and sample efficiency.

References

  1. Volodymyr Mnih, Kavukcuoglu Koray, Silver David, Rusu Andrei A. Veness, Joel Bellemare, Marc G. Graves, Alex Riedmiller, Martin Fidjeland, Andreas K. Ostrovski, Georg 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529–533.Google ScholarGoogle Scholar
  2. David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. 2016. Mastering the game of go with deep neural networks and tree search. Nature 529, 7587 (2016), 484--489.Google ScholarGoogle Scholar
  3. Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. Retrieved from https://arXiv:1509.02971.Google ScholarGoogle Scholar
  4. Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems. MIT Press, 5048–5058. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P. Lillicrap, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning. 1928–1937. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. 2015. Trust region policy optimization. Retrieved from https://abs/1502.05477. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. Retrieved from https://arXiv:1707.06347.Google ScholarGoogle Scholar
  8. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Retrieved from https://arXiv:1801.01290.Google ScholarGoogle Scholar
  9. Scott Fujimoto, Herke van Hoof, and David Meger 2018. Addressing function approximation error in actor-critic methods. Retrieved from https://arXiv:1802.09477.Google ScholarGoogle Scholar
  10. Georg Ostrovski, Marc G. Bellemare, Aaron van den Oord, and Remi Munos. 2017. Count-based exploration with neural density models. Retrieved from https://arXiv:1703.01310. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman, Filip De Turck, Pieter Abbeel. 2017. # exploration: A study of count-based exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems. MIT Press, 2750–2759. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Marc G. Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. 2016. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems. MIT Press, 1471–1479. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Sainbayar Sukhbaatar, Ilya Kostrikov, Arthur Szlam, and Rob Fergus. 2017. Intrinsic motivation and automatic curricula via asymmetric self-play. Retrieved from https://arXiv:1703.05407.Google ScholarGoogle Scholar
  14. Nick Haber, Damian Mrowca, Li Fei-Fei, and Daniel L. K. Yamins. 2018. Learning to play with intrinsically motivated self-aware agents. Retrieved from https://abs/1802.07442. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ildefons Magrans de Abril, Ryota Kanai. 2018. Curiosity-driven reinforcement learning with homeostatic regulation. Retrieved from https://abs/1801.07440.Google ScholarGoogle Scholar
  16. Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. 2017. Curiosity-driven exploration by self-supervised prediction. Retrieved from https://abs/1705.05363. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. 2016. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems. MIT Press, 1109–1117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y. Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. 2017. Parameter space noise for exploration. Retrieved from https://arXiv:1706.01905.Google ScholarGoogle Scholar
  19. Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, et al. 2017. Noisy networks for exploration. Retrieved from https://arXiv:1706.10295.Google ScholarGoogle Scholar
  20. Stulp, Freek and Sigaud Olivier. 2013. Robot skill learning: From reinforcement learning to evolution strategies. Paladyn J. Behav. Robot. 4, 1 (Aug. 2013), 49–61. doi: 10.2478/pjbr-2013-0003.Google ScholarGoogle Scholar
  21. Cédric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. 2018. GEP-PG: Decoupling exploration and exploitation in deep reinforcement learning algorithms. Retrieved from https://arXiv:1802.05054.Google ScholarGoogle Scholar
  22. Alois Pourchot, Olivier Sigaud. 2018. CEM-RL: Combining evolutionary and gradient-based methods for policy search. Retrieved from https://arXiv: 1810.01222.Google ScholarGoogle Scholar
  23. Dean A. Pomerleau. 1989. Alvinn: An autonomous land vehicle in a neural network. In Proceedings of the Conference on Neural Information Processing Systems (NIPS’89), 305–313. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Levine Sergey and Koltun Vladlen. 2013. Guided policy search. In Proceedings of the 30th International Conference on Machine Learning. 1–9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. 2016. End to end learning for self-driving cars. Retrieved from https://arXiv:1604.07316.Google ScholarGoogle Scholar
  26. Jonathan Ho and Stefano Ermon. 2016. Generative adversarial imitation learning. Retrieved from https://arXiv:1606.03476. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Matej Večerík, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, and Martin Riedmiller. 2017. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. Retrieved from https://arxiv:1707.08817.Google ScholarGoogle Scholar
  28. Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Gabriel Dulac-Arnold, Ian Osband, John Agapiou, Joel Z. Leibo, and Audrunas Gruslys. 2017. Learning from demonstrations for real world reinforcement learning. Retrieved from https://rxiv:1704.03732.Google ScholarGoogle Scholar
  29. Nair Ashvin, McGrew Bob, Andrychowicz Marcin, Zaremba Wojciech, and Abbeel Pieter. 2017. Overcoming exploration in reinforcement learning with demonstrations. Retrieved from https://arXiv:1709.10089.Google ScholarGoogle Scholar
  30. Thomas Rückstieß, Martin Felder, and Jürgen Schmidhuber. 2008. State-dependent exploration for policy gradient methods. Mach. Learn. Knowl. Discov. Databases 234–249.Google ScholarGoogle Scholar
  31. Ingo Rechenberg and Manfred Eigen. 1973. Evolutionsstrategie: Optimierung technischer systeme nach prinzipien der biologishen evolution. Frommann-Holzboog, Stuttgart.Google ScholarGoogle Scholar
  32. Hans-Paul Schwefel. 1977. Numerische optimierung von computermodellen mittels der evolutionsstrategie, vol. 1. Birkhäuser, Basel, Switzerland.Google ScholarGoogle Scholar
  33. Salimans Tim, Ho Jonathan, Chen Xi, and Sutskever Ilya. 2017. Evolution strategies as a scalable alternative to reinforcement learning. Retrieved from https://arXiv:1703.03864.Google ScholarGoogle Scholar
  34. Edoardo Conti, Vashisht Madhavan, Felipe Petroski Such, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. Retrieved from https://arXiv:1712.06560. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. 2017. Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. Retrieved from https://arXiv:1712.06567.Google ScholarGoogle Scholar
  36. Joshua Achiam and Shankar Sastry. 2017. Surprise-based intrinsic motivation for deep reinforcement learning. Retrieved from https://arXiv:1703.01732.Google ScholarGoogle Scholar
  37. Bradly C. Stadie, Sergey Levine, and Pieter Abbeel. 2015. Incentivizing exploration in reinforcement learning with deep predictive models. Retrieved from https://arXiv:1507.00814.Google ScholarGoogle Scholar
  38. Niru Maheswaranathan, Luke Metz, George Tucker, Dami Choi, and Jascha Sohl-Dickstein. 2018. Guided evolutionary strategies: Escaping the curse of dimensionality in random search. Retrieved from https://arXiv:1806.10230.Google ScholarGoogle Scholar
  39. Justin K. Pugh, L. B. Soros, Paul A. Szerlip, and Kenneth O. Stanley. Confronting the challenge of quality diversity. In Proceedings of the Annual Conference on Genetic and Evolutionary Computation. ACM, 967–974. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Cully Antoine and Demiris Yiannis. 2017. Quality and diversity optimization: A unifying modular framework. IEEE Trans. Evolution. Comput. 22, 2 (2017), 245–259.Google ScholarGoogle Scholar
  41. Baranes Adrien and Oudeyer Pierre-Yves. 2013. Active learning of inverse models with intrinsically motivated goal exploration in robots. Robot. Auton. Syst. 61, 1 (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Sebastien Forestier and Pierre-Yves Oudeyer. 2016. Modular active curiosity-driven discovery of tool use. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’16). IEEE, 3965–3972.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Sébastien Forestier, Rémy Portelas, Yoan Mollard, and Pierre-Yves Oudeyer. 2017. Intrinsically motivated goal exploration processes with automatic curriculum learning. Retrieved from https://arXiv:1708.02190.Google ScholarGoogle Scholar
  44. Lilian Weng. Exploration Strategies in Deep Reinforcement Learning. Retrieved July 10, 2020 from https://lilianweng.github.io/lil-log/2020/06/07/exploration-strategies-in-deep-reinforcement-learning.html.Google ScholarGoogle Scholar
  45. William M. Spears, Kenneth A. De Jong, Thomas Bäck, David B. Fogel, Hugo de Garis. 1993. An overview of evolutionary computation. In Proceedings of the European Conference on Machine Learning. Springer, Berlin, 442–459. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. David B. Fogel. 2006. Evolutionary Computation: Toward a New Philosophy of Machine Intelligence, vol. 1. John Wiley & Sons, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Tanmay Gangwani and Jian Peng. 2017. Genetic policy optimization. Retrieved from https://arXiv:1711.01012.Google ScholarGoogle Scholar
  48. Madalina M. Drugan. 2018. Reinforcement learning versus evolutionary computation: A survey on hybrid algorithms. Swarm Evolution. Comput. 44 (2018), 228–246. https://doi.org/10.1016/j.swevo.2018.03.011Google ScholarGoogle ScholarCross RefCross Ref
  49. Dario Floreano, Peter Dürr and Claudio Mattiussi. 2008. Neuroevolution: From architectures to learning. Evolution. Intell. 1, 1 (2008), 47–62.Google ScholarGoogle ScholarCross RefCross Ref
  50. Benno Lüders, Mikkel Schläger, Aleksandra Korach, and Sebastian Risi. 2017. Continual and one-shot learning through neural networks with dynamic external memory. In Proceedings of the European Conference on the Applications of Evolutionary Computation. Springer, Berlin, 886–901.Google ScholarGoogle ScholarCross RefCross Ref
  51. Kenneth O. Stanley and Risto Miikkulainen. 2002. Evolving neural networks through augmenting topologies. Evolution. Comput. 10, 2 (2002), 99–127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Shimon Whiteson and Peter Stone. 2006. Evolutionary function approximation for reinforcement learning. J. Mach. Learn. Res. 7 (2006), 877–917. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Shauharda Khadka and Kagan Tumer. 2018. Evolution-guided policy gradient in reinforcement learning. Retrieved from https://arXiv: 1805.07917. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. S. Risi and J. Togelius. 2017. Neuroevolution in games: State of the art and open challenges. IEEE Trans. Comput. Intell. AI Games 9, 1 (2017), 25–41.Google ScholarGoogle ScholarCross RefCross Ref
  55. Kenneth O. Stanley, Jeff Clune, and Joel Lehman, et al. 2019. Designing neural networks through neuroevolution. Nature Mach. Intell. 1 (2019), 24–35.Google ScholarGoogle ScholarCross RefCross Ref
  56. Reuven Y. Rubenstein and Dirk P. Kroese. 2004. The Cross-entropy Method: A Unified Approach to Combinatorial Optimization, Monte Carlo Simulation and Machine Learning. Springer, Berlin. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. Openai gym. Retrieved from https://arXiv:1606.01540.Google ScholarGoogle Scholar

Index Terms

  1. PP-PG: Combining Parameter Perturbation with Policy Gradient Methods for Effective and Efficient Explorations in Deep Reinforcement Learning

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Intelligent Systems and Technology
          ACM Transactions on Intelligent Systems and Technology  Volume 12, Issue 3
          June 2021
          218 pages
          ISSN:2157-6904
          EISSN:2157-6912
          DOI:10.1145/3460499
          Issue’s Table of Contents

          Copyright © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 3 June 2021
          • Accepted: 1 February 2021
          • Revised: 1 January 2021
          • Received: 1 July 2020
          Published in tist Volume 12, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format