Abstract
Gradient-based reinforcement learning has gained more and more attention. As one of the most important methods, Deep Deterministic Policy Gradient (DDPG) has achieved remarkable success and has been applied to many challenging continuous scenarios. However, it still suffers from instable training on off-policy data and premature convergence to a local optimum. To deal with these problems, in this paper, we combine Boltzmann exploration with deterministic policy gradient. The candidate policy is represented by a Boltzmann distribution, and updated by Kullback-Leibler (KL) projection. By introducing the Boltzmann policy, the exploration is encouraged to effectively prevent the policy to collapse quickly. Experimental results show that the proposed algorithm outperforms DDPG on most tasks in MuJoCo continuous benchmark.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Achbany, Y., Fouss, F., Yen, L., Pirotte, A., Saerens, M.: Tuning continual exploration in reinforcement learning: an optimality property of the Boltzmann strategy. Neurocomputing 71(13–15), 2507–2520 (2008)
Barth-Maron, G., et al.: Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617 (2018)
Brockman, G., et al.: OpenAI gym. arXiv preprint arXiv:1606.01540 (2016)
Cesa-Bianchi, N., Gentile, C., Lugosi, G., Neu, G.: Boltzmann exploration done right. In: Advances in Neural Information Processing Systems, pp. 6284–6293 (2017)
Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1582–1591 (2018)
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290 (2018)
Hunt, J., Barreto, A., Lillicrap, T., Heess, N.: Composing entropic policies using divergence correction. In: International Conference on Machine Learning, pp. 2911–2920 (2019)
Kianercy, A., Galstyan, A.: Dynamics of Boltzmann Q learning in two-player two-action games. Phys. Rev. E 85(4), 041145 (2012)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)
Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937 (2016)
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529 (2015)
Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms. In: International Conference on Machine Learning, pp. 387–395 (2014)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3–4), 229–256 (1992)
Acknowledgement
The work is partially supported by the National Natural Science Foundation of China under grand No. U19B2044 and No. 61836011.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, S., Pu, Y., Yang, S., Yao, X., Li, B. (2020). Boltzmann Exploration for Deterministic Policy Optimization. In: Yang, H., Pasupa, K., Leung, A.CS., Kwok, J.T., Chan, J.H., King, I. (eds) Neural Information Processing. ICONIP 2020. Lecture Notes in Computer Science(), vol 12533. Springer, Cham. https://doi.org/10.1007/978-3-030-63833-7_18
Download citation
DOI: https://doi.org/10.1007/978-3-030-63833-7_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63832-0
Online ISBN: 978-3-030-63833-7
eBook Packages: Computer ScienceComputer Science (R0)