Skip to main content

Induced Exploration on Policy Gradients by Increasing Actor Entropy Using Advantage Target Regions

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2018)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11302))

Included in the following conference series:

Abstract

We propose a policy gradient actor-critic algorithm with a built-in exploration mechanism. Unlike existing policy gradient methods that use several actors asynchronously for exploration, our algorithm uses only a single actor that can robustly search for the optimal path. Our algorithm uses modified advantage targets that increase entropy in an actor’s predicted advantage probability distribution. We do this using a two-step process, where the first step modifies advantage targets from points to regions, by sampling particles in neighborhoods along the direction of the critic value function. This step increases entropy in an actor’s estimates and explicitly induces the actor to perform actions outside of past policies for exploration. The second step controls for variance increase due to sampling, where shortest-path dynamic programming selects particles from the regions with minimum inter-state movements. We present an analysis of our method and compare it with other exploration policy gradient algorithm, i.e. A3C, and report faster convergence in some VizDoom and Atari benchmarks given the same number of backpropagation steps on a deep network function approximator.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Anschel, O., Baram, N., Shimkin, N.: Averaged-DQN: variance reduction and stabilization for deep reinforcement learning. In: International Conference on Machine Learning, pp. 176–185 (2017)

    Google Scholar 

  2. Gu, S., Lillicrap, T., Turner, R.E., Ghahramani, Z., Schölkopf, B., Levine, S.: Interpolated policy gradient: merging on-policy and off-policy gradient estimation for deep reinforcement learning. In: Advances in Neural Information Processing Systems 30

    Google Scholar 

  3. Hessel, M., et al.: Rainbow: combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298 (2017)

  4. Kempka, M., Wydmuch, M., Runc, G., Toczek, J., Jaśkowski, W.: ViZDoom: a doom-based AI research platform for visual reinforcement learning. In: 2016 IEEE Conference on Computational Intelligence and Games (CIG), pp. 1–8. IEEE (2016)

    Google Scholar 

  5. Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. In: Advances in Neural Information Processing Systems, pp. 1008–1014 (2000)

    Google Scholar 

  6. Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937 (2016)

    Google Scholar 

  7. Mnih, V., et al.: Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)

  8. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)

    Article  Google Scholar 

  9. Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. In: Proceedings of the International Conference on Learning Representations (ICLR) (2016)

    Google Scholar 

  10. Schulze, C., Schulze, M.: ViZDoom: DRQN with prioritized experience replay, double-Q learning, & snapshot ensembling. arXiv preprint arXiv:1801.01000 (2018)

  11. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, vol. 1. MIT Press, Cambridge (1998)

    Google Scholar 

  12. Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., Freitas, N.: Dueling network architectures for deep reinforcement learning. In: Proceedings of The 33rd International Conference on Machine Learning, pp. 1995–2003 (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Prospero C. Naval Jr. .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Labao, A.B., Raquel, C.R., Naval, P.C. (2018). Induced Exploration on Policy Gradients by Increasing Actor Entropy Using Advantage Target Regions. In: Cheng, L., Leung, A., Ozawa, S. (eds) Neural Information Processing. ICONIP 2018. Lecture Notes in Computer Science(), vol 11302. Springer, Cham. https://doi.org/10.1007/978-3-030-04179-3_58

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-04179-3_58

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-04178-6

  • Online ISBN: 978-3-030-04179-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics