Boltzmann Exploration for Deterministic Policy Optimization

Wang, Shaochen; Pu, Yuan; Yang, Shangtong; Yao, Xin; Li, Bin

doi:10.1007/978-3-030-63833-7_18

Shaochen Wang¹⁴,
Yuan Pu¹⁴,
Shangtong Yang¹⁴,
Xin Yao¹⁴ &
…
Bin Li¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12533))

Included in the following conference series:

International Conference on Neural Information Processing

2666 Accesses

Abstract

Gradient-based reinforcement learning has gained more and more attention. As one of the most important methods, Deep Deterministic Policy Gradient (DDPG) has achieved remarkable success and has been applied to many challenging continuous scenarios. However, it still suffers from instable training on off-policy data and premature convergence to a local optimum. To deal with these problems, in this paper, we combine Boltzmann exploration with deterministic policy gradient. The candidate policy is represented by a Boltzmann distribution, and updated by Kullback-Leibler (KL) projection. By introducing the Boltzmann policy, the exploration is encouraged to effectively prevent the policy to collapse quickly. Experimental results show that the proposed algorithm outperforms DDPG on most tasks in MuJoCo continuous benchmark.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Achbany, Y., Fouss, F., Yen, L., Pirotte, A., Saerens, M.: Tuning continual exploration in reinforcement learning: an optimality property of the Boltzmann strategy. Neurocomputing 71(13–15), 2507–2520 (2008)
Article Google Scholar
Barth-Maron, G., et al.: Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617 (2018)
Brockman, G., et al.: OpenAI gym. arXiv preprint arXiv:1606.01540 (2016)
Cesa-Bianchi, N., Gentile, C., Lugosi, G., Neu, G.: Boltzmann exploration done right. In: Advances in Neural Information Processing Systems, pp. 6284–6293 (2017)
Google Scholar
Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1582–1591 (2018)
Google Scholar
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290 (2018)
Hunt, J., Barreto, A., Lillicrap, T., Heess, N.: Composing entropic policies using divergence correction. In: International Conference on Machine Learning, pp. 2911–2920 (2019)
Google Scholar
Kianercy, A., Galstyan, A.: Dynamics of Boltzmann Q learning in two-player two-action games. Phys. Rev. E 85(4), 041145 (2012)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)
Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937 (2016)
Google Scholar
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529 (2015)
Article Google Scholar
Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015)
Google Scholar
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms. In: International Conference on Machine Learning, pp. 387–395 (2014)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)
MATH Google Scholar
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3–4), 229–256 (1992)
MATH Google Scholar

Download references

Acknowledgement

The work is partially supported by the National Natural Science Foundation of China under grand No. U19B2044 and No. 61836011.

Author information

Authors and Affiliations

University of Science and Technology of China, Hefei, China
Shaochen Wang, Yuan Pu, Shangtong Yang, Xin Yao & Bin Li

Authors

Shaochen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Pu
View author publications
You can also search for this author in PubMed Google Scholar
Shangtong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xin Yao
View author publications
You can also search for this author in PubMed Google Scholar
Bin Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bin Li .

Editor information

Editors and Affiliations

Department of AI, Ping An Life, Shenzhen, China
Haiqin Yang
Faculty of Information Technology, King Mongkut’s Institute of Technology Ladkrabang, Bangkok, Thailand
Kitsuchart Pasupa
City University of Hong Kong, Kowloon, China
Andrew Chi-Sing Leung
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, Hong Kong
James T. Kwok
School of Information Technology, King Mongkut’s University of Technology Thonburi, Bangkok, Thailand
Jonathan H. Chan
The Chinese University of Hong Kong, New Territories, Hong Kong
Irwin King

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, S., Pu, Y., Yang, S., Yao, X., Li, B. (2020). Boltzmann Exploration for Deterministic Policy Optimization. In: Yang, H., Pasupa, K., Leung, A.CS., Kwok, J.T., Chan, J.H., King, I. (eds) Neural Information Processing. ICONIP 2020. Lecture Notes in Computer Science(), vol 12533. Springer, Cham. https://doi.org/10.1007/978-3-030-63833-7_18

Download citation

DOI: https://doi.org/10.1007/978-3-030-63833-7_18
Published: 20 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63832-0
Online ISBN: 978-3-030-63833-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics