Abstract
Efficient and stable exploration remains a key challenge for deep reinforcement learning (DRL) operating in high-dimensional action and state spaces. Recently, a more promising approach by combining the exploration in the action space with the exploration in the parameters space has been proposed to get the best of both methods. In this article, we propose a new iterative and close-loop framework by combining the evolutionary algorithm (EA), which does explorations in a gradient-free manner directly in the parameters space with an actor-critic, and the deep deterministic policy gradient (DDPG) reinforcement learning algorithm, which does explorations in a gradient-based manner in the action space to make these two methods cooperate in a more balanced and efficient way. In our framework, the policies represented by the EA population (the parametric perturbation part) can evolve in a guided manner by utilizing the gradient information provided by the DDPG and the policy gradient part (DDPG) is used only as a fine-tuning tool for the best individual in the EA population to improve the sample efficiency. In particular, we propose a criterion to determine the training steps required for the DDPG to ensure that useful gradient information can be generated from the EA generated samples and the DDPG and EA part can work together in a more balanced way during each generation. Furthermore, within the DDPG part, our algorithm can flexibly switch between fine-tuning the same previous RL-Actor and fine-tuning a new one generated by the EA according to different situations to further improve the efficiency. Experiments on a range of challenging continuous control benchmarks demonstrate that our algorithm outperforms related works and offers a satisfactory trade-off between stability and sample efficiency.
- Volodymyr Mnih, Kavukcuoglu Koray, Silver David, Rusu Andrei A. Veness, Joel Bellemare, Marc G. Graves, Alex Riedmiller, Martin Fidjeland, Andreas K. Ostrovski, Georg 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529–533.Google Scholar
- David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. 2016. Mastering the game of go with deep neural networks and tree search. Nature 529, 7587 (2016), 484--489.Google Scholar
- Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. Retrieved from https://arXiv:1509.02971.Google Scholar
- Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems. MIT Press, 5048–5058. Google ScholarDigital Library
- Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P. Lillicrap, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning. 1928–1937. Google ScholarDigital Library
- John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. 2015. Trust region policy optimization. Retrieved from https://abs/1502.05477. Google ScholarDigital Library
- John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. Retrieved from https://arXiv:1707.06347.Google Scholar
- Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Retrieved from https://arXiv:1801.01290.Google Scholar
- Scott Fujimoto, Herke van Hoof, and David Meger 2018. Addressing function approximation error in actor-critic methods. Retrieved from https://arXiv:1802.09477.Google Scholar
- Georg Ostrovski, Marc G. Bellemare, Aaron van den Oord, and Remi Munos. 2017. Count-based exploration with neural density models. Retrieved from https://arXiv:1703.01310. Google ScholarDigital Library
- Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman, Filip De Turck, Pieter Abbeel. 2017. # exploration: A study of count-based exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems. MIT Press, 2750–2759. Google ScholarDigital Library
- Marc G. Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. 2016. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems. MIT Press, 1471–1479. Google ScholarDigital Library
- Sainbayar Sukhbaatar, Ilya Kostrikov, Arthur Szlam, and Rob Fergus. 2017. Intrinsic motivation and automatic curricula via asymmetric self-play. Retrieved from https://arXiv:1703.05407.Google Scholar
- Nick Haber, Damian Mrowca, Li Fei-Fei, and Daniel L. K. Yamins. 2018. Learning to play with intrinsically motivated self-aware agents. Retrieved from https://abs/1802.07442. Google ScholarDigital Library
- Ildefons Magrans de Abril, Ryota Kanai. 2018. Curiosity-driven reinforcement learning with homeostatic regulation. Retrieved from https://abs/1801.07440.Google Scholar
- Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. 2017. Curiosity-driven exploration by self-supervised prediction. Retrieved from https://abs/1705.05363. Google ScholarDigital Library
- Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. 2016. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems. MIT Press, 1109–1117. Google ScholarDigital Library
- Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y. Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. 2017. Parameter space noise for exploration. Retrieved from https://arXiv:1706.01905.Google Scholar
- Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, et al. 2017. Noisy networks for exploration. Retrieved from https://arXiv:1706.10295.Google Scholar
- Stulp, Freek and Sigaud Olivier. 2013. Robot skill learning: From reinforcement learning to evolution strategies. Paladyn J. Behav. Robot. 4, 1 (Aug. 2013), 49–61. doi: 10.2478/pjbr-2013-0003.Google Scholar
- Cédric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. 2018. GEP-PG: Decoupling exploration and exploitation in deep reinforcement learning algorithms. Retrieved from https://arXiv:1802.05054.Google Scholar
- Alois Pourchot, Olivier Sigaud. 2018. CEM-RL: Combining evolutionary and gradient-based methods for policy search. Retrieved from https://arXiv: 1810.01222.Google Scholar
- Dean A. Pomerleau. 1989. Alvinn: An autonomous land vehicle in a neural network. In Proceedings of the Conference on Neural Information Processing Systems (NIPS’89), 305–313. Google ScholarDigital Library
- Levine Sergey and Koltun Vladlen. 2013. Guided policy search. In Proceedings of the 30th International Conference on Machine Learning. 1–9. Google ScholarDigital Library
- Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. 2016. End to end learning for self-driving cars. Retrieved from https://arXiv:1604.07316.Google Scholar
- Jonathan Ho and Stefano Ermon. 2016. Generative adversarial imitation learning. Retrieved from https://arXiv:1606.03476. Google ScholarDigital Library
- Matej Večerík, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, and Martin Riedmiller. 2017. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. Retrieved from https://arxiv:1707.08817.Google Scholar
- Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Gabriel Dulac-Arnold, Ian Osband, John Agapiou, Joel Z. Leibo, and Audrunas Gruslys. 2017. Learning from demonstrations for real world reinforcement learning. Retrieved from https://rxiv:1704.03732.Google Scholar
- Nair Ashvin, McGrew Bob, Andrychowicz Marcin, Zaremba Wojciech, and Abbeel Pieter. 2017. Overcoming exploration in reinforcement learning with demonstrations. Retrieved from https://arXiv:1709.10089.Google Scholar
- Thomas Rückstieß, Martin Felder, and Jürgen Schmidhuber. 2008. State-dependent exploration for policy gradient methods. Mach. Learn. Knowl. Discov. Databases 234–249.Google Scholar
- Ingo Rechenberg and Manfred Eigen. 1973. Evolutionsstrategie: Optimierung technischer systeme nach prinzipien der biologishen evolution. Frommann-Holzboog, Stuttgart.Google Scholar
- Hans-Paul Schwefel. 1977. Numerische optimierung von computermodellen mittels der evolutionsstrategie, vol. 1. Birkhäuser, Basel, Switzerland.Google Scholar
- Salimans Tim, Ho Jonathan, Chen Xi, and Sutskever Ilya. 2017. Evolution strategies as a scalable alternative to reinforcement learning. Retrieved from https://arXiv:1703.03864.Google Scholar
- Edoardo Conti, Vashisht Madhavan, Felipe Petroski Such, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. Retrieved from https://arXiv:1712.06560. Google ScholarDigital Library
- Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. 2017. Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. Retrieved from https://arXiv:1712.06567.Google Scholar
- Joshua Achiam and Shankar Sastry. 2017. Surprise-based intrinsic motivation for deep reinforcement learning. Retrieved from https://arXiv:1703.01732.Google Scholar
- Bradly C. Stadie, Sergey Levine, and Pieter Abbeel. 2015. Incentivizing exploration in reinforcement learning with deep predictive models. Retrieved from https://arXiv:1507.00814.Google Scholar
- Niru Maheswaranathan, Luke Metz, George Tucker, Dami Choi, and Jascha Sohl-Dickstein. 2018. Guided evolutionary strategies: Escaping the curse of dimensionality in random search. Retrieved from https://arXiv:1806.10230.Google Scholar
- Justin K. Pugh, L. B. Soros, Paul A. Szerlip, and Kenneth O. Stanley. Confronting the challenge of quality diversity. In Proceedings of the Annual Conference on Genetic and Evolutionary Computation. ACM, 967–974. Google ScholarDigital Library
- Cully Antoine and Demiris Yiannis. 2017. Quality and diversity optimization: A unifying modular framework. IEEE Trans. Evolution. Comput. 22, 2 (2017), 245–259.Google Scholar
- Baranes Adrien and Oudeyer Pierre-Yves. 2013. Active learning of inverse models with intrinsically motivated goal exploration in robots. Robot. Auton. Syst. 61, 1 (2013). Google ScholarDigital Library
- Sebastien Forestier and Pierre-Yves Oudeyer. 2016. Modular active curiosity-driven discovery of tool use. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’16). IEEE, 3965–3972.Google ScholarDigital Library
- Sébastien Forestier, Rémy Portelas, Yoan Mollard, and Pierre-Yves Oudeyer. 2017. Intrinsically motivated goal exploration processes with automatic curriculum learning. Retrieved from https://arXiv:1708.02190.Google Scholar
- Lilian Weng. Exploration Strategies in Deep Reinforcement Learning. Retrieved July 10, 2020 from https://lilianweng.github.io/lil-log/2020/06/07/exploration-strategies-in-deep-reinforcement-learning.html.Google Scholar
- William M. Spears, Kenneth A. De Jong, Thomas Bäck, David B. Fogel, Hugo de Garis. 1993. An overview of evolutionary computation. In Proceedings of the European Conference on Machine Learning. Springer, Berlin, 442–459. Google ScholarDigital Library
- David B. Fogel. 2006. Evolutionary Computation: Toward a New Philosophy of Machine Intelligence, vol. 1. John Wiley & Sons, New York, NY. Google ScholarDigital Library
- Tanmay Gangwani and Jian Peng. 2017. Genetic policy optimization. Retrieved from https://arXiv:1711.01012.Google Scholar
- Madalina M. Drugan. 2018. Reinforcement learning versus evolutionary computation: A survey on hybrid algorithms. Swarm Evolution. Comput. 44 (2018), 228–246. https://doi.org/10.1016/j.swevo.2018.03.011Google ScholarCross Ref
- Dario Floreano, Peter Dürr and Claudio Mattiussi. 2008. Neuroevolution: From architectures to learning. Evolution. Intell. 1, 1 (2008), 47–62.Google ScholarCross Ref
- Benno Lüders, Mikkel Schläger, Aleksandra Korach, and Sebastian Risi. 2017. Continual and one-shot learning through neural networks with dynamic external memory. In Proceedings of the European Conference on the Applications of Evolutionary Computation. Springer, Berlin, 886–901.Google ScholarCross Ref
- Kenneth O. Stanley and Risto Miikkulainen. 2002. Evolving neural networks through augmenting topologies. Evolution. Comput. 10, 2 (2002), 99–127. Google ScholarDigital Library
- Shimon Whiteson and Peter Stone. 2006. Evolutionary function approximation for reinforcement learning. J. Mach. Learn. Res. 7 (2006), 877–917. Google ScholarDigital Library
- Shauharda Khadka and Kagan Tumer. 2018. Evolution-guided policy gradient in reinforcement learning. Retrieved from https://arXiv: 1805.07917. Google ScholarDigital Library
- S. Risi and J. Togelius. 2017. Neuroevolution in games: State of the art and open challenges. IEEE Trans. Comput. Intell. AI Games 9, 1 (2017), 25–41.Google ScholarCross Ref
- Kenneth O. Stanley, Jeff Clune, and Joel Lehman, et al. 2019. Designing neural networks through neuroevolution. Nature Mach. Intell. 1 (2019), 24–35.Google ScholarCross Ref
- Reuven Y. Rubenstein and Dirk P. Kroese. 2004. The Cross-entropy Method: A Unified Approach to Combinatorial Optimization, Monte Carlo Simulation and Machine Learning. Springer, Berlin. Google ScholarDigital Library
- Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. Openai gym. Retrieved from https://arXiv:1606.01540.Google Scholar
Index Terms
- PP-PG: Combining Parameter Perturbation with Policy Gradient Methods for Effective and Efficient Explorations in Deep Reinforcement Learning
Recommendations
Policy Feedback in Deep Reinforcement Learning to Exploit Expert Knowledge
Machine Learning, Optimization, and Data ScienceAbstractIn Deep Reinforcement Learning (DRL), agents learn by sampling transitions from a batch of stored data called Experience Replay. In most DRL algorithms, the Experience Replay is filled by experiences gathered by the learning agent itself. However, ...
Gradient Bias to Solve the Generalization Limit of Genetic Algorithms Through Hybridization with Reinforcement Learning
Machine Learning, Optimization, and Data ScienceAbstractGenetic Algorithms have recently been successfully applied to the Machine Learning framework, being able to train autonomous agents and proving to be valid alternatives to state-of-the-art Reinforcement Learning techniques. Their attractiveness ...
A deep actor critic reinforcement learning framework for learning to rank
AbstractIn this paper, we propose a Deep Reinforcement learning based approach for Learning to rank task. Reinforcement Learning has been applied in the ranking task with good success, but the existing Policy Gradient based approaches suffer ...
Comments