research-article

PP-PG: Combining Parameter Perturbation with Policy Gradient Methods for Effective and Efficient Explorations in Deep Reinforcement Learning

Authors:
Shilei Li

Department of Information Security, Naval University of Engineering, Wuhan, China

Department of Information Security, Naval University of Engineering, Wuhan, China

0000-0003-4870-881X
View Profile

,
Meng Li

Army Academy of Artillery and Air Defense, Hefei, China

Army Academy of Artillery and Air Defense, Hefei, China
View Profile

,
Jiongming Su

College of Intelligence Science and Technology, National University of Defense Technology, Changsha, China

College of Intelligence Science and Technology, National University of Defense Technology, Changsha, China
View Profile

,
Shaofei Chen

College of Intelligence Science and Technology, National University of Defense Technology, Changsha, China

College of Intelligence Science and Technology, National University of Defense Technology, Changsha, China
View Profile

,
Zhimin Yuan

Department of Information Security, Naval University of Engineering, Wuhan, China

Department of Information Security, Naval University of Engineering, Wuhan, China
View Profile

,
Qing Ye

Department of Information Security, Naval University of Engineering, Wuhan, China

Department of Information Security, Naval University of Engineering, Wuhan, China
View Profile

ACM Transactions on Intelligent Systems and Technology Volume 12 Issue 3Article No.: 35pp 1–21https://doi.org/10.1145/3452008

Published:03 June 2021Publication History

ACM Transactions on Intelligent Systems and Technology

Abstract

Efficient and stable exploration remains a key challenge for deep reinforcement learning (DRL) operating in high-dimensional action and state spaces. Recently, a more promising approach by combining the exploration in the action space with the exploration in the parameters space has been proposed to get the best of both methods. In this article, we propose a new iterative and close-loop framework by combining the evolutionary algorithm (EA), which does explorations in a gradient-free manner directly in the parameters space with an actor-critic, and the deep deterministic policy gradient (DDPG) reinforcement learning algorithm, which does explorations in a gradient-based manner in the action space to make these two methods cooperate in a more balanced and efficient way. In our framework, the policies represented by the EA population (the parametric perturbation part) can evolve in a guided manner by utilizing the gradient information provided by the DDPG and the policy gradient part (DDPG) is used only as a fine-tuning tool for the best individual in the EA population to improve the sample efficiency. In particular, we propose a criterion to determine the training steps required for the DDPG to ensure that useful gradient information can be generated from the EA generated samples and the DDPG and EA part can work together in a more balanced way during each generation. Furthermore, within the DDPG part, our algorithm can flexibly switch between fine-tuning the same previous RL-Actor and fine-tuning a new one generated by the EA according to different situations to further improve the efficiency. Experiments on a range of challenging continuous control benchmarks demonstrate that our algorithm outperforms related works and offers a satisfactory trade-off between stability and sample efficiency.

References

Volodymyr Mnih, Kavukcuoglu Koray, Silver David, Rusu Andrei A. Veness, Joel Bellemare, Marc G. Graves, Alex Riedmiller, Martin Fidjeland, Andreas K. Ostrovski, Georg 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529–533.Google Scholar
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. 2016. Mastering the game of go with deep neural networks and tree search. Nature 529, 7587 (2016), 484--489.Google Scholar
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. Retrieved from https://arXiv:1509.02971.Google Scholar
Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems. MIT Press, 5048–5058. Google ScholarDigital Library
Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P. Lillicrap, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning. 1928–1937. Google ScholarDigital Library
John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. 2015. Trust region policy optimization. Retrieved from https://abs/1502.05477. Google ScholarDigital Library
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. Retrieved from https://arXiv:1707.06347.Google Scholar
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Retrieved from https://arXiv:1801.01290.Google Scholar
Scott Fujimoto, Herke van Hoof, and David Meger 2018. Addressing function approximation error in actor-critic methods. Retrieved from https://arXiv:1802.09477.Google Scholar
Georg Ostrovski, Marc G. Bellemare, Aaron van den Oord, and Remi Munos. 2017. Count-based exploration with neural density models. Retrieved from https://arXiv:1703.01310. Google ScholarDigital Library
Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman, Filip De Turck, Pieter Abbeel. 2017. # exploration: A study of count-based exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems. MIT Press, 2750–2759. Google ScholarDigital Library
Marc G. Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. 2016. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems. MIT Press, 1471–1479. Google ScholarDigital Library
Sainbayar Sukhbaatar, Ilya Kostrikov, Arthur Szlam, and Rob Fergus. 2017. Intrinsic motivation and automatic curricula via asymmetric self-play. Retrieved from https://arXiv:1703.05407.Google Scholar
Nick Haber, Damian Mrowca, Li Fei-Fei, and Daniel L. K. Yamins. 2018. Learning to play with intrinsically motivated self-aware agents. Retrieved from https://abs/1802.07442. Google ScholarDigital Library
Ildefons Magrans de Abril, Ryota Kanai. 2018. Curiosity-driven reinforcement learning with homeostatic regulation. Retrieved from https://abs/1801.07440.Google Scholar
Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. 2017. Curiosity-driven exploration by self-supervised prediction. Retrieved from https://abs/1705.05363. Google ScholarDigital Library
Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. 2016. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems. MIT Press, 1109–1117. Google ScholarDigital Library
Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y. Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. 2017. Parameter space noise for exploration. Retrieved from https://arXiv:1706.01905.Google Scholar
Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, et al. 2017. Noisy networks for exploration. Retrieved from https://arXiv:1706.10295.Google Scholar
Stulp, Freek and Sigaud Olivier. 2013. Robot skill learning: From reinforcement learning to evolution strategies. Paladyn J. Behav. Robot. 4, 1 (Aug. 2013), 49–61. doi: 10.2478/pjbr-2013-0003.Google Scholar
Cédric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. 2018. GEP-PG: Decoupling exploration and exploitation in deep reinforcement learning algorithms. Retrieved from https://arXiv:1802.05054.Google Scholar
Alois Pourchot, Olivier Sigaud. 2018. CEM-RL: Combining evolutionary and gradient-based methods for policy search. Retrieved from https://arXiv: 1810.01222.Google Scholar
Dean A. Pomerleau. 1989. Alvinn: An autonomous land vehicle in a neural network. In Proceedings of the Conference on Neural Information Processing Systems (NIPS’89), 305–313. Google ScholarDigital Library
Levine Sergey and Koltun Vladlen. 2013. Guided policy search. In Proceedings of the 30th International Conference on Machine Learning. 1–9. Google ScholarDigital Library
Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. 2016. End to end learning for self-driving cars. Retrieved from https://arXiv:1604.07316.Google Scholar
Jonathan Ho and Stefano Ermon. 2016. Generative adversarial imitation learning. Retrieved from https://arXiv:1606.03476. Google ScholarDigital Library
Matej Večerík, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, and Martin Riedmiller. 2017. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. Retrieved from https://arxiv:1707.08817.Google Scholar
Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Gabriel Dulac-Arnold, Ian Osband, John Agapiou, Joel Z. Leibo, and Audrunas Gruslys. 2017. Learning from demonstrations for real world reinforcement learning. Retrieved from https://rxiv:1704.03732.Google Scholar
Nair Ashvin, McGrew Bob, Andrychowicz Marcin, Zaremba Wojciech, and Abbeel Pieter. 2017. Overcoming exploration in reinforcement learning with demonstrations. Retrieved from https://arXiv:1709.10089.Google Scholar
Thomas Rückstieß, Martin Felder, and Jürgen Schmidhuber. 2008. State-dependent exploration for policy gradient methods. Mach. Learn. Knowl. Discov. Databases 234–249.Google Scholar
Ingo Rechenberg and Manfred Eigen. 1973. Evolutionsstrategie: Optimierung technischer systeme nach prinzipien der biologishen evolution. Frommann-Holzboog, Stuttgart.Google Scholar
Hans-Paul Schwefel. 1977. Numerische optimierung von computermodellen mittels der evolutionsstrategie, vol. 1. Birkhäuser, Basel, Switzerland.Google Scholar
Salimans Tim, Ho Jonathan, Chen Xi, and Sutskever Ilya. 2017. Evolution strategies as a scalable alternative to reinforcement learning. Retrieved from https://arXiv:1703.03864.Google Scholar
Edoardo Conti, Vashisht Madhavan, Felipe Petroski Such, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. Retrieved from https://arXiv:1712.06560. Google ScholarDigital Library
Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. 2017. Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. Retrieved from https://arXiv:1712.06567.Google Scholar
Joshua Achiam and Shankar Sastry. 2017. Surprise-based intrinsic motivation for deep reinforcement learning. Retrieved from https://arXiv:1703.01732.Google Scholar
Bradly C. Stadie, Sergey Levine, and Pieter Abbeel. 2015. Incentivizing exploration in reinforcement learning with deep predictive models. Retrieved from https://arXiv:1507.00814.Google Scholar
Niru Maheswaranathan, Luke Metz, George Tucker, Dami Choi, and Jascha Sohl-Dickstein. 2018. Guided evolutionary strategies: Escaping the curse of dimensionality in random search. Retrieved from https://arXiv:1806.10230.Google Scholar
Justin K. Pugh, L. B. Soros, Paul A. Szerlip, and Kenneth O. Stanley. Confronting the challenge of quality diversity. In Proceedings of the Annual Conference on Genetic and Evolutionary Computation. ACM, 967–974. Google ScholarDigital Library
Cully Antoine and Demiris Yiannis. 2017. Quality and diversity optimization: A unifying modular framework. IEEE Trans. Evolution. Comput. 22, 2 (2017), 245–259.Google Scholar
Baranes Adrien and Oudeyer Pierre-Yves. 2013. Active learning of inverse models with intrinsically motivated goal exploration in robots. Robot. Auton. Syst. 61, 1 (2013). Google ScholarDigital Library
Sebastien Forestier and Pierre-Yves Oudeyer. 2016. Modular active curiosity-driven discovery of tool use. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’16). IEEE, 3965–3972.Google ScholarDigital Library
Sébastien Forestier, Rémy Portelas, Yoan Mollard, and Pierre-Yves Oudeyer. 2017. Intrinsically motivated goal exploration processes with automatic curriculum learning. Retrieved from https://arXiv:1708.02190.Google Scholar
Lilian Weng. Exploration Strategies in Deep Reinforcement Learning. Retrieved July 10, 2020 from https://lilianweng.github.io/lil-log/2020/06/07/exploration-strategies-in-deep-reinforcement-learning.html.Google Scholar
William M. Spears, Kenneth A. De Jong, Thomas Bäck, David B. Fogel, Hugo de Garis. 1993. An overview of evolutionary computation. In Proceedings of the European Conference on Machine Learning. Springer, Berlin, 442–459. Google ScholarDigital Library
David B. Fogel. 2006. Evolutionary Computation: Toward a New Philosophy of Machine Intelligence, vol. 1. John Wiley & Sons, New York, NY. Google ScholarDigital Library
Tanmay Gangwani and Jian Peng. 2017. Genetic policy optimization. Retrieved from https://arXiv:1711.01012.Google Scholar
Madalina M. Drugan. 2018. Reinforcement learning versus evolutionary computation: A survey on hybrid algorithms. Swarm Evolution. Comput. 44 (2018), 228–246. https://doi.org/10.1016/j.swevo.2018.03.011Google ScholarCross Ref
Dario Floreano, Peter Dürr and Claudio Mattiussi. 2008. Neuroevolution: From architectures to learning. Evolution. Intell. 1, 1 (2008), 47–62.Google ScholarCross Ref
Benno Lüders, Mikkel Schläger, Aleksandra Korach, and Sebastian Risi. 2017. Continual and one-shot learning through neural networks with dynamic external memory. In Proceedings of the European Conference on the Applications of Evolutionary Computation. Springer, Berlin, 886–901.Google ScholarCross Ref
Kenneth O. Stanley and Risto Miikkulainen. 2002. Evolving neural networks through augmenting topologies. Evolution. Comput. 10, 2 (2002), 99–127. Google ScholarDigital Library
Shimon Whiteson and Peter Stone. 2006. Evolutionary function approximation for reinforcement learning. J. Mach. Learn. Res. 7 (2006), 877–917. Google ScholarDigital Library
Shauharda Khadka and Kagan Tumer. 2018. Evolution-guided policy gradient in reinforcement learning. Retrieved from https://arXiv: 1805.07917. Google ScholarDigital Library
S. Risi and J. Togelius. 2017. Neuroevolution in games: State of the art and open challenges. IEEE Trans. Comput. Intell. AI Games 9, 1 (2017), 25–41.Google ScholarCross Ref
Kenneth O. Stanley, Jeff Clune, and Joel Lehman, et al. 2019. Designing neural networks through neuroevolution. Nature Mach. Intell. 1 (2019), 24–35.Google ScholarCross Ref
Reuven Y. Rubenstein and Dirk P. Kroese. 2004. The Cross-entropy Method: A Unified Approach to Combinatorial Optimization, Monte Carlo Simulation and Machine Learning. Springer, Berlin. Google ScholarDigital Library
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. Openai gym. Retrieved from https://arXiv:1606.01540.Google Scholar

Index Terms

PP-PG: Combining Parameter Perturbation with Policy Gradient Methods for Effective and Efficient Explorations in Deep Reinforcement Learning
1. Security and privacy

Recommendations

Policy Feedback in Deep Reinforcement Learning to Exploit Expert Knowledge
Machine Learning, Optimization, and Data Science
Abstract
In Deep Reinforcement Learning (DRL), agents learn by sampling transitions from a batch of stored data called Experience Replay. In most DRL algorithms, the Experience Replay is filled by experiences gathered by the learning agent itself. However, ...
Read More
Gradient Bias to Solve the Generalization Limit of Genetic Algorithms Through Hybridization with Reinforcement Learning
Machine Learning, Optimization, and Data Science
Abstract
Genetic Algorithms have recently been successfully applied to the Machine Learning framework, being able to train autonomous agents and proving to be valid alternatives to state-of-the-art Reinforcement Learning techniques. Their attractiveness ...
Read More
A deep actor critic reinforcement learning framework for learning to rank
Abstract
In this paper, we propose a Deep Reinforcement learning based approach for Learning to rank task. Reinforcement Learning has been applied in the ranking task with good success, but the existing Policy Gradient based approaches suffer ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Intelligent Systems and Technology Volume 12, Issue 3
June 2021
218 pages
ISSN:2157-6904
EISSN:2157-6912
DOI:10.1145/3460499
Editor:
Yu Zheng
JD Digits, China
Issue’s Table of Contents
Copyright © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 June 2021
- Accepted: 1 February 2021
- Revised: 1 January 2021
- Received: 1 July 2020
Published in tist Volume 12, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Computing methodologies →Artificial intelligence
Machine learning
Modeling and simulation
Deep reinforcement learning
parameter perturbation
policy gradient
EA
DDPG
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 189
  Total Downloads
- Downloads (Last 12 months)35
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

PP-PG: Combining Parameter Perturbation with Policy Gradient Methods for Effective and Efficient Explorations in Deep Reinforcement Learning

ACM Transactions on Intelligent Systems and Technology

Abstract

References

Cited By

Index Terms

Recommendations

Policy Feedback in Deep Reinforcement Learning to Exploit Expert Knowledge

Gradient Bias to Solve the Generalization Limit of Genetic Algorithms Through Hybridization with Reinforcement Learning

A deep actor critic reinforcement learning framework for learning to rank