Skip to main content
Log in

Exploration in policy optimization through multiple paths

  • Published:
Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Abstract

Recent years have witnessed a tremendous improvement of deep reinforcement learning. However, a challenging problem is that an agent may suffer from inefficient exploration, particularly for on-policy methods. Previous exploration methods either rely on complex structure to estimate the novelty of states, or incur sensitive hyper-parameters causing instability. We propose an efficient exploration method, Multi-Path Policy Optimization (MP-PO), which does not incur high computation cost and ensures stability. MP-PO maintains an efficient mechanism that effectively utilizes a population of diverse policies to enable better exploration, especially in sparse environments. We also give a theoretical guarantee of the stable performance. We build our scheme upon two widely-adopted on-policy methods, the Trust-Region Policy Optimization algorithm and Proximal Policy Optimization algorithm. We conduct extensive experiments on several MuJoCo tasks and their sparsified variants to fairly evaluate the proposed method. Results show that MP-PO significantly outperforms state-of-the-art exploration methods in terms of both sample efficiency and final performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. Note that MP-PO aims to optimize the picked policy instead of all policies in the population.

References

  1. Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2–3), 235–256.

    Article  Google Scholar 

  2. Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., & Munos, R. (2016). Unifying count-based exploration and intrinsic motivation. In: Advances in neural information processing systems, pp. 1471–1479.

  3. Buckman, J., Hafner, D., Tucker, G., Brevdo, E., & Lee, H. (2018). Sample-efficient reinforcement learning with stochastic ensemble value expansion. In: Advances in neural information processing systems, pp. 8224–8234.

  4. Chang, S., Yang, J., Choi, J., & Kwak, N. (2018). Genetic-gated networks for deep reinforcement learning. In: Advances in neural information processing systems, pp. 1747–1756.

  5. Chow, Y., Nachum, O., & Ghavamzadeh, M. (2018). Path consistency learning in tsallis entropy regularized mdps. In: International conference on machine learning, pp. 978–987.

  6. Colas, C., Sigaud, O., & Oudeyer, P.Y. (2018). Gep-pg: Decoupling exploration and exploitation in deep reinforcement learning algorithms. In: International conference on machine learning, pp. 1038–1047.

  7. Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P. (2016): Benchmarking deep reinforcement learning for continuous control. In: International conference on machine learning, pp. 1329–1338.

  8. Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., & Dunning, I., et al. (2018). Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561

  9. Fortunato, M., Azar, M.G., Piot, B., Menick, J., Osband, I., Graves, A., Mnih, V., Munos, R., Hassabis, D., & Pietquin, O., et al. (2018). Noisy networks for exploration. In: International conference on learning representations.

  10. Fu, J., Co-Reyes, J., & Levine, S. (2017). Ex2: Exploration with exemplar models for deep reinforcement learning. In: Advances in neural information processing systems, pp. 2577–2587.

  11. Fujimoto, S., Hoof, H., & Meger, D. (2018). Addressing function approximation error in actor-critic methods. In: International conference on machine learning, pp. 1587–1596. PMLR.

  12. Gangwani, T., Liu, Q., & Peng, J. (2019). Learning self-imitating diverse policies. In: International conference on learning representations.

  13. Gangwani, T., & Peng, J. (2017). Policy optimization by genetic distillation. In: International conference on learning representations.

  14. Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R.E., & Levine, S. (2017). Q-prop: Sample-efficient policy gradient with an off-policy critic. In: International conference on learning representations.

  15. Gu, S.S., Lillicrap, T., Turner, R.E., Ghahramani, Z., Schölkopf, B., & Levine, S. (2017). Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. In: Advances in neural information processing systems, pp. 3846–3855.

  16. Haarnoja, T., Tang, H., Abbeel, P., & Levine, S. (2017). Reinforcement learning with deep energy-based policies. In: International conference on machine learning, pp. 1352–1361.

  17. Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, pp. 1856–1865.

  18. Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., & Meger, D. (2018). Deep reinforcement learning that matters. In: Thirty-second AAAI conference on artificial intelligence.

  19. Hong, Z.W., Shann, T.Y., Su, S.Y., Chang, Y.H., Fu, T.J., & Lee, C.Y. (2018). Diversity-driven exploration strategy for deep reinforcement learning. In: Advances in neural information processing systems, pp. 10489–10500.

  20. Horvitz, D. G., & Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260), 663–685.

    Article  MathSciNet  Google Scholar 

  21. Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., & Abbeel, P. (2016). Vime: Variational information maximizing exploration. In: Advances in neural information processing systems, pp. 1109–1117.

  22. Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W.M., Donahue, J., Razavi, A., Vinyals, O., Green, T., Dunning, I., & Simonyan, K., et al. (2017). Population based training of neural networks. arXiv preprint arXiv:1711.09846.

  23. Kang, B., Jie, Z., & Feng, J. (2018). Policy optimization with demonstrations. In: International conference on machine learning, pp. 2474–2483.

  24. Khadka, S., Majumdar, S., Nassar, T., Dwiel, Z., Tumer, E., Miret, S., Liu, Y., & Tumer, K. (2019). Collaborative evolutionary reinforcement learning. In: International conference on machine learning, pp. 3341–3350.

  25. Khadka, S., & Tumer, K. (2018). Evolution-guided policy gradient in reinforcement learning. In: Advances in neural information processing systems, pp. 1188–1200.

  26. Leibo, J.Z., Perolat, J., Hughes, E., Wheelwright, S., Marblestone, A.H., Duéñez-Guzmán, E., Sunehag, P., Dunning, I., & Graepel, T. (2019). Malthusian reinforcement learning. In: Proceedings of the 18th international conference on autonomous agents and multiagent systems, pp. 1099–1107. International foundation for autonomous agents and multiagent systems.

  27. Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.

  28. Masood, M., & Doshi-Velez, F. (2019). Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: Proceedings of the twenty-eighth international joint conference on artificial intelligence, IJCAI-19, pp. 5923–5929.

  29. Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, pp. 1928–1937.

  30. Nachum, O., Norouzi, M., Xu, K., & Schuurmans, D. (2018). Trust-pcl: An off-policy trust region method for continuous control. In: International conference on learning representations.

  31. Osband, I., Blundell, C., Pritzel, A., & Van Roy, B. (2016). Deep exploration via bootstrapped dqn. In: Advances in neural information processing systems, pp. 4026–4034.

  32. Ostrovski, G., Bellemare, M.G., van den Oord, A., & Munos, R. (2017). Count-based exploration with neural density models. In: International conference on machine learning, pp. 2721–2730.

  33. Pan, L., Cai, Q., & Huang, L. (2020). Multi-path policy optimization. In: Proceedings of the 19th international conference on autonomous agents and multiagent systems, pp. 1001–1009.

  34. Pathak, D., Agrawal, P., Efros, A.A., & Darrell, T. (2017). Curiosity-driven exploration by self-supervised prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 16–17.

  35. Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen, R.Y., Chen, X., Asfour, T., Abbeel, P., & Andrychowicz, M. (2018). Parameter space noise for exploration. In: International conference on learning representations.

  36. Pourchot, A., & Sigaud, O. (2019). Cem-rl: Combining evolutionary and gradient-based methods for policy search. In: International conference on learning representations.

  37. Salimans, T., Ho, J., Chen, X., Sidor, S., & Sutskever, I. (2017). Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864.

  38. Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. In: International conference on machine learning, pp. 1889–1897.

  39. Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. In: International conference on learning representations.

  40. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

  41. Sigaud, O., & Stulp, F. (2019). Policy search in continuous action domains: An overview. Neural Networks, 113, 28–40.

    Article  Google Scholar 

  42. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014). Deterministic policy gradient algorithms. In: International conference on machine learning, pp. 387–395. PMLR.

  43. Spears, W.M., De Jong, K.A., Bäck, T., Fogel, D.B., & De Garis, H. (1993). An overview of evolutionary computation. In: European conference on machine learning, pp. 442–459. Springer.

  44. Such, F.P., Madhavan, V., Conti, E., Lehman, J., Stanley, K.O., & Clune, J. (2017). Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv preprint arXiv:1712.06567.

  45. Tang, H., Houthooft, R., Foote, D., Stooke, A., Chen, O.X., Duan, Y., Schulman, J., DeTurck, F., & Abbeel, P. (2017). # exploration: A study of count-based exploration for deep reinforcement learning. In: Advances in neural information processing systems, pp. 2753–2762.

  46. Todorov, E., Erez, T., & Tassa, Y. (2012). Mujoco: A physics engine for model-based control. In: 2012 IEEE/RSJ International conference on intelligent robots and systems, pp. 5026–5033. IEEE.

  47. Xu, Z., van Hasselt, H.P., & Silver, D. (2018). Meta-gradient reinforcement learning. In: Advances in neural information processing systems, pp. 2396–2407.

  48. Zames, G., Ajlouni, N., Ajlouni, N., Ajlouni, N., Holland, J., Hills, W., & Goldberg, D. (1981). Genetic algorithms in search, optimization and machine learning. Information Technology Journal, 3(1), 301–302.

    Google Scholar 

  49. Zhang, S., & Yao, H. (2019). Ace: An actor ensemble algorithm for continuous control with tree search. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 5789–5796.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ling Pan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pan, L., Cai, Q. & Huang, L. Exploration in policy optimization through multiple paths. Auton Agent Multi-Agent Syst 35, 33 (2021). https://doi.org/10.1007/s10458-021-09518-6

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10458-021-09518-6

Keywords

Navigation