Skip to main content

Actor-Critic Algorithms for Variance Minimization

  • Conference paper
  • First Online:
Book cover Technological Developments in Education and Automation
  • 2404 Accesses

Abstract

We consider the framework of a set of recently proposed two-timescale actor-critic algorithms for reinforcement-learning (RL) using the long-run average-reward criterion and linear feature-based value-function approximation. The actor and critic updates are based on stochastic policy-gradient ascent and temporal-difference algorithms, respectively. Unlike conventional RL algorithms, policy-gradient-based algorithms guarantee convergence even with value-function approximation but suffer due to high variance of the policy-gradient estimator. To minimize this variance for an existing algorithm, we derive a stochasticgradient-based novel critic update. We propose a novel baseline structure for variance minimization of an estimator and derive an optimal baseline which makes the covariance matrix a zero matrix – the best achievable. We derive a novel actor update based on the optimal baseline deduced for an existing algorithm. We derive another novel actor update using the optimal baseline for an unbiased policy-gradient estimator which we deduce from the Policy-Gradient Theorem with Function Approximation. We obtain a novel variance-minimization-based interpretation for an existing algorithm. The computational results demonstrate that the proposed algorithms outperform the state-of-the-art on Garnet problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. D.P. Bertsekas and J.N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, MA, 1996.

    Google Scholar 

  2. L. Baird, “Residual algorithms: reinforcement learning with function approximation”, Proc. 12 th International Conf. on Machine Learning, 1995, pp30-37.

    Google Scholar 

  3. R. Sutton, D. McAllester, S. Singh and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation”, Adv. in Neural Info. Proc. Systems, 2000, 12:1057-1063.

    Google Scholar 

  4. P. Marbach and J.N. Tsitsiklis, “Simulation-based optimization of Markov reward processes”, IEEE Trans. on Automatic Control, 2001, 46:191-209.

    Article  Google Scholar 

  5. J. Baxter and P.L. Bartlett, “Infinite-horizon policy-gradient estimation”, Journal of Artificial Intelligence Research, 2001, 15:319-350.

    Article  Google Scholar 

  6. E. Greensmith, P.L. Bartlett and J. Baxter, “Variance reduction techniques for gradient estimates in reinforcement learning” Journal of Machine Learning Research, 2004, Vol. 5, pp. 1471–1530.

    Google Scholar 

  7. S. Bhatnagar, R.S. Sutton, M. Ghavamzadeh and M. Lee, “Naturalgradient actor-critic algorithms” Automatica, 2007 (to appear, http://drona.csa.iisc.ernet.in/∼shalabh/pubs/ac_bhatnagar.pdf).

  8. S. Bhatnagar, R.S. Sutton, M. Ghavamzadeh and M. Lee, “Incremental natural actor-critic algorithms”, Proc. 21 st Annual Conference on Neural Information Processing Systems, 2007.

    Google Scholar 

  9. S. Kakade, “A natural policy gradient”, Adv. in Neural Info. Proc. Systems, 2002, 14.

    Google Scholar 

  10. J. Peters, S. Vijayakumar and S. Schaal, “Natural actor-critic”, Proc. 16 th European Conference on Machine Learning, 2005, pp. 280-291.

    Google Scholar 

  11. S. Amari, K. Kurata and H. Nagaoka, “Information geometry of Boltzmann machines” IEEE Trans. on Neural Networks, 1992, Vol. 3, No. 2, pp 260-271.

    Article  Google Scholar 

  12. S. Amari, “Natural gradient works efficiently in learning”, Neural Computation, 1998, 10(2):251-276.

    Article  Google Scholar 

  13. V.S. Borkar, “Stochastic approximation with two timescales”, Systems and Control Letters, 1997, 29:291-294.

    Article  Google Scholar 

  14. V.R. Konda and J.N. Tsitsiklis, “On actor-critic algorithms”, SIAM Journal on Control and Optimization, 2003, 42(4):1143-1166.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yogesh P. Awate .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer Science+Business Media B.V.

About this paper

Cite this paper

Awate, Y.P. (2010). Actor-Critic Algorithms for Variance Minimization. In: Iskander, M., Kapila, V., Karim, M. (eds) Technological Developments in Education and Automation. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-3656-8_82

Download citation

Publish with us

Policies and ethics