Exploration in policy optimization through multiple paths

Pan, Ling; Cai, Qingpeng; Huang, Longbo

doi:10.1007/s10458-021-09518-6

Exploration in policy optimization through multiple paths

Published: 26 June 2021

Volume 35, article number 33, (2021)
Cite this article

Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

611 Accesses
1 Citation
Explore all metrics

Abstract

Recent years have witnessed a tremendous improvement of deep reinforcement learning. However, a challenging problem is that an agent may suffer from inefficient exploration, particularly for on-policy methods. Previous exploration methods either rely on complex structure to estimate the novelty of states, or incur sensitive hyper-parameters causing instability. We propose an efficient exploration method, Multi-Path Policy Optimization (MP-PO), which does not incur high computation cost and ensures stability. MP-PO maintains an efficient mechanism that effectively utilizes a population of diverse policies to enable better exploration, especially in sparse environments. We also give a theoretical guarantee of the stable performance. We build our scheme upon two widely-adopted on-policy methods, the Trust-Region Policy Optimization algorithm and Proximal Policy Optimization algorithm. We conduct extensive experiments on several MuJoCo tasks and their sparsified variants to fairly evaluate the proposed method. Results show that MP-PO significantly outperforms state-of-the-art exploration methods in terms of both sample efficiency and final performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Policy Adaptive Multi-agent Deep Deterministic Policy Gradient

IOB: integrating optimization transfer and behavior transfer for multi-policy reuse

Article 09 December 2023

Siyuan Li, Hao Li, … Chongjie Zhang

A Novel State Space Exploration Method for the Sparse-Reward Reinforcement Learning Environment

Notes

Note that MP-PO aims to optimize the picked policy instead of all policies in the population.

References

Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2–3), 235–256.
Article Google Scholar
Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., & Munos, R. (2016). Unifying count-based exploration and intrinsic motivation. In: Advances in neural information processing systems, pp. 1471–1479.
Buckman, J., Hafner, D., Tucker, G., Brevdo, E., & Lee, H. (2018). Sample-efficient reinforcement learning with stochastic ensemble value expansion. In: Advances in neural information processing systems, pp. 8224–8234.
Chang, S., Yang, J., Choi, J., & Kwak, N. (2018). Genetic-gated networks for deep reinforcement learning. In: Advances in neural information processing systems, pp. 1747–1756.
Chow, Y., Nachum, O., & Ghavamzadeh, M. (2018). Path consistency learning in tsallis entropy regularized mdps. In: International conference on machine learning, pp. 978–987.
Colas, C., Sigaud, O., & Oudeyer, P.Y. (2018). Gep-pg: Decoupling exploration and exploitation in deep reinforcement learning algorithms. In: International conference on machine learning, pp. 1038–1047.
Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P. (2016): Benchmarking deep reinforcement learning for continuous control. In: International conference on machine learning, pp. 1329–1338.
Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., & Dunning, I., et al. (2018). Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561
Fortunato, M., Azar, M.G., Piot, B., Menick, J., Osband, I., Graves, A., Mnih, V., Munos, R., Hassabis, D., & Pietquin, O., et al. (2018). Noisy networks for exploration. In: International conference on learning representations.
Fu, J., Co-Reyes, J., & Levine, S. (2017). Ex2: Exploration with exemplar models for deep reinforcement learning. In: Advances in neural information processing systems, pp. 2577–2587.
Fujimoto, S., Hoof, H., & Meger, D. (2018). Addressing function approximation error in actor-critic methods. In: International conference on machine learning, pp. 1587–1596. PMLR.
Gangwani, T., Liu, Q., & Peng, J. (2019). Learning self-imitating diverse policies. In: International conference on learning representations.
Gangwani, T., & Peng, J. (2017). Policy optimization by genetic distillation. In: International conference on learning representations.
Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R.E., & Levine, S. (2017). Q-prop: Sample-efficient policy gradient with an off-policy critic. In: International conference on learning representations.
Gu, S.S., Lillicrap, T., Turner, R.E., Ghahramani, Z., Schölkopf, B., & Levine, S. (2017). Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. In: Advances in neural information processing systems, pp. 3846–3855.
Haarnoja, T., Tang, H., Abbeel, P., & Levine, S. (2017). Reinforcement learning with deep energy-based policies. In: International conference on machine learning, pp. 1352–1361.
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, pp. 1856–1865.
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., & Meger, D. (2018). Deep reinforcement learning that matters. In: Thirty-second AAAI conference on artificial intelligence.
Hong, Z.W., Shann, T.Y., Su, S.Y., Chang, Y.H., Fu, T.J., & Lee, C.Y. (2018). Diversity-driven exploration strategy for deep reinforcement learning. In: Advances in neural information processing systems, pp. 10489–10500.
Horvitz, D. G., & Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260), 663–685.
Article MathSciNet Google Scholar
Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., & Abbeel, P. (2016). Vime: Variational information maximizing exploration. In: Advances in neural information processing systems, pp. 1109–1117.
Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W.M., Donahue, J., Razavi, A., Vinyals, O., Green, T., Dunning, I., & Simonyan, K., et al. (2017). Population based training of neural networks. arXiv preprint arXiv:1711.09846.
Kang, B., Jie, Z., & Feng, J. (2018). Policy optimization with demonstrations. In: International conference on machine learning, pp. 2474–2483.
Khadka, S., Majumdar, S., Nassar, T., Dwiel, Z., Tumer, E., Miret, S., Liu, Y., & Tumer, K. (2019). Collaborative evolutionary reinforcement learning. In: International conference on machine learning, pp. 3341–3350.
Khadka, S., & Tumer, K. (2018). Evolution-guided policy gradient in reinforcement learning. In: Advances in neural information processing systems, pp. 1188–1200.
Leibo, J.Z., Perolat, J., Hughes, E., Wheelwright, S., Marblestone, A.H., Duéñez-Guzmán, E., Sunehag, P., Dunning, I., & Graepel, T. (2019). Malthusian reinforcement learning. In: Proceedings of the 18th international conference on autonomous agents and multiagent systems, pp. 1099–1107. International foundation for autonomous agents and multiagent systems.
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
Masood, M., & Doshi-Velez, F. (2019). Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: Proceedings of the twenty-eighth international joint conference on artificial intelligence, IJCAI-19, pp. 5923–5929.
Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, pp. 1928–1937.
Nachum, O., Norouzi, M., Xu, K., & Schuurmans, D. (2018). Trust-pcl: An off-policy trust region method for continuous control. In: International conference on learning representations.
Osband, I., Blundell, C., Pritzel, A., & Van Roy, B. (2016). Deep exploration via bootstrapped dqn. In: Advances in neural information processing systems, pp. 4026–4034.
Ostrovski, G., Bellemare, M.G., van den Oord, A., & Munos, R. (2017). Count-based exploration with neural density models. In: International conference on machine learning, pp. 2721–2730.
Pan, L., Cai, Q., & Huang, L. (2020). Multi-path policy optimization. In: Proceedings of the 19th international conference on autonomous agents and multiagent systems, pp. 1001–1009.
Pathak, D., Agrawal, P., Efros, A.A., & Darrell, T. (2017). Curiosity-driven exploration by self-supervised prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 16–17.
Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen, R.Y., Chen, X., Asfour, T., Abbeel, P., & Andrychowicz, M. (2018). Parameter space noise for exploration. In: International conference on learning representations.
Pourchot, A., & Sigaud, O. (2019). Cem-rl: Combining evolutionary and gradient-based methods for policy search. In: International conference on learning representations.
Salimans, T., Ho, J., Chen, X., Sidor, S., & Sutskever, I. (2017). Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. In: International conference on machine learning, pp. 1889–1897.
Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. In: International conference on learning representations.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
Sigaud, O., & Stulp, F. (2019). Policy search in continuous action domains: An overview. Neural Networks, 113, 28–40.
Article Google Scholar
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014). Deterministic policy gradient algorithms. In: International conference on machine learning, pp. 387–395. PMLR.
Spears, W.M., De Jong, K.A., Bäck, T., Fogel, D.B., & De Garis, H. (1993). An overview of evolutionary computation. In: European conference on machine learning, pp. 442–459. Springer.
Such, F.P., Madhavan, V., Conti, E., Lehman, J., Stanley, K.O., & Clune, J. (2017). Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv preprint arXiv:1712.06567.
Tang, H., Houthooft, R., Foote, D., Stooke, A., Chen, O.X., Duan, Y., Schulman, J., DeTurck, F., & Abbeel, P. (2017). # exploration: A study of count-based exploration for deep reinforcement learning. In: Advances in neural information processing systems, pp. 2753–2762.
Todorov, E., Erez, T., & Tassa, Y. (2012). Mujoco: A physics engine for model-based control. In: 2012 IEEE/RSJ International conference on intelligent robots and systems, pp. 5026–5033. IEEE.
Xu, Z., van Hasselt, H.P., & Silver, D. (2018). Meta-gradient reinforcement learning. In: Advances in neural information processing systems, pp. 2396–2407.
Zames, G., Ajlouni, N., Ajlouni, N., Ajlouni, N., Holland, J., Hills, W., & Goldberg, D. (1981). Genetic algorithms in search, optimization and machine learning. Information Technology Journal, 3(1), 301–302.
Google Scholar
Zhang, S., & Yao, H. (2019). Ace: An actor ensemble algorithm for continuous control with tree search. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 5789–5796.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Tsinghua University, Beijing, China
Ling Pan, Qingpeng Cai & Longbo Huang

Authors

Ling Pan
View author publications
You can also search for this author in PubMed Google Scholar
Qingpeng Cai
View author publications
You can also search for this author in PubMed Google Scholar
Longbo Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ling Pan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pan, L., Cai, Q. & Huang, L. Exploration in policy optimization through multiple paths. Auton Agent Multi-Agent Syst 35, 33 (2021). https://doi.org/10.1007/s10458-021-09518-6

Download citation

Accepted: 15 June 2021
Published: 26 June 2021
DOI: https://doi.org/10.1007/s10458-021-09518-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploration in policy optimization through multiple paths

Abstract

Access this article

Similar content being viewed by others

Policy Adaptive Multi-agent Deep Deterministic Policy Gradient

IOB: integrating optimization transfer and behavior transfer for multi-policy reuse

A Novel State Space Exploration Method for the Sparse-Reward Reinforcement Learning Environment

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exploration in policy optimization through multiple paths

Abstract

Access this article

Similar content being viewed by others

Policy Adaptive Multi-agent Deep Deterministic Policy Gradient

IOB: integrating optimization transfer and behavior transfer for multi-policy reuse

A Novel State Space Exploration Method for the Sparse-Reward Reinforcement Learning Environment

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation