Induced Exploration on Policy Gradients by Increasing Actor Entropy Using Advantage Target Regions

Labao, Alfonso B.; Raquel, Carlo R.; Naval, Prospero C.

doi:10.1007/978-3-030-04179-3_58

Alfonso B. Labao¹⁶,
Carlo R. Raquel¹⁶ &
Prospero C. Naval Jr.¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11302))

Included in the following conference series:

International Conference on Neural Information Processing

2140 Accesses
1 Citations

Abstract

We propose a policy gradient actor-critic algorithm with a built-in exploration mechanism. Unlike existing policy gradient methods that use several actors asynchronously for exploration, our algorithm uses only a single actor that can robustly search for the optimal path. Our algorithm uses modified advantage targets that increase entropy in an actor’s predicted advantage probability distribution. We do this using a two-step process, where the first step modifies advantage targets from points to regions, by sampling particles in neighborhoods along the direction of the critic value function. This step increases entropy in an actor’s estimates and explicitly induces the actor to perform actions outside of past policies for exploration. The second step controls for variance increase due to sampling, where shortest-path dynamic programming selects particles from the regions with minimum inter-state movements. We present an analysis of our method and compare it with other exploration policy gradient algorithm, i.e. A3C, and report faster convergence in some VizDoom and Atari benchmarks given the same number of backpropagation steps on a deep network function approximator.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anschel, O., Baram, N., Shimkin, N.: Averaged-DQN: variance reduction and stabilization for deep reinforcement learning. In: International Conference on Machine Learning, pp. 176–185 (2017)
Google Scholar
Gu, S., Lillicrap, T., Turner, R.E., Ghahramani, Z., Schölkopf, B., Levine, S.: Interpolated policy gradient: merging on-policy and off-policy gradient estimation for deep reinforcement learning. In: Advances in Neural Information Processing Systems 30
Google Scholar
Hessel, M., et al.: Rainbow: combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298 (2017)
Kempka, M., Wydmuch, M., Runc, G., Toczek, J., Jaśkowski, W.: ViZDoom: a doom-based AI research platform for visual reinforcement learning. In: 2016 IEEE Conference on Computational Intelligence and Games (CIG), pp. 1–8. IEEE (2016)
Google Scholar
Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. In: Advances in Neural Information Processing Systems, pp. 1008–1014 (2000)
Google Scholar
Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937 (2016)
Google Scholar
Mnih, V., et al.: Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Article Google Scholar
Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. In: Proceedings of the International Conference on Learning Representations (ICLR) (2016)
Google Scholar
Schulze, C., Schulze, M.: ViZDoom: DRQN with prioritized experience replay, double-Q learning, & snapshot ensembling. arXiv preprint arXiv:1801.01000 (2018)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, vol. 1. MIT Press, Cambridge (1998)
Google Scholar
Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., Freitas, N.: Dueling network architectures for deep reinforcement learning. In: Proceedings of The 33rd International Conference on Machine Learning, pp. 1995–2003 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Vision and Machine Intelligence Group, Department of Computer Science, University of the Philippines Diliman, Quezon City, Philippines
Alfonso B. Labao, Carlo R. Raquel & Prospero C. Naval Jr.

Authors

Alfonso B. Labao
View author publications
You can also search for this author in PubMed Google Scholar
Carlo R. Raquel
View author publications
You can also search for this author in PubMed Google Scholar
Prospero C. Naval Jr.
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Prospero C. Naval Jr. .

Editor information

Editors and Affiliations

The Chinese Academy of Sciences, Beijing, China
Long Cheng
City University of Hong Kong, Kowloon, Hong Kong
Andrew Chi Sing Leung
Kobe University, Kobe, Japan
Seiichi Ozawa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Labao, A.B., Raquel, C.R., Naval, P.C. (2018). Induced Exploration on Policy Gradients by Increasing Actor Entropy Using Advantage Target Regions. In: Cheng, L., Leung, A., Ozawa, S. (eds) Neural Information Processing. ICONIP 2018. Lecture Notes in Computer Science(), vol 11302. Springer, Cham. https://doi.org/10.1007/978-3-030-04179-3_58

Download citation

DOI: https://doi.org/10.1007/978-3-030-04179-3_58
Published: 18 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04178-6
Online ISBN: 978-3-030-04179-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics