Actor-Critic Algorithms for Variance Minimization

Awate, Yogesh P.

doi:10.1007/978-90-481-3656-8_82

Yogesh P. Awate⁴

2404 Accesses

Abstract

We consider the framework of a set of recently proposed two-timescale actor-critic algorithms for reinforcement-learning (RL) using the long-run average-reward criterion and linear feature-based value-function approximation. The actor and critic updates are based on stochastic policy-gradient ascent and temporal-difference algorithms, respectively. Unlike conventional RL algorithms, policy-gradient-based algorithms guarantee convergence even with value-function approximation but suffer due to high variance of the policy-gradient estimator. To minimize this variance for an existing algorithm, we derive a stochasticgradient-based novel critic update. We propose a novel baseline structure for variance minimization of an estimator and derive an optimal baseline which makes the covariance matrix a zero matrix – the best achievable. We derive a novel actor update based on the optimal baseline deduced for an existing algorithm. We derive another novel actor update using the optimal baseline for an unbiased policy-gradient estimator which we deduce from the Policy-Gradient Theorem with Function Approximation. We obtain a novel variance-minimization-based interpretation for an existing algorithm. The computational results demonstrate that the proposed algorithms outperform the state-of-the-art on Garnet problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

D.P. Bertsekas and J.N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, MA, 1996.
Google Scholar
L. Baird, “Residual algorithms: reinforcement learning with function approximation”, Proc. 12 ^th International Conf. on Machine Learning, 1995, pp30-37.
Google Scholar
R. Sutton, D. McAllester, S. Singh and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation”, Adv. in Neural Info. Proc. Systems, 2000, 12:1057-1063.
Google Scholar
P. Marbach and J.N. Tsitsiklis, “Simulation-based optimization of Markov reward processes”, IEEE Trans. on Automatic Control, 2001, 46:191-209.
Article Google Scholar
J. Baxter and P.L. Bartlett, “Infinite-horizon policy-gradient estimation”, Journal of Artificial Intelligence Research, 2001, 15:319-350.
Article Google Scholar
E. Greensmith, P.L. Bartlett and J. Baxter, “Variance reduction techniques for gradient estimates in reinforcement learning” Journal of Machine Learning Research, 2004, Vol. 5, pp. 1471–1530.
Google Scholar
S. Bhatnagar, R.S. Sutton, M. Ghavamzadeh and M. Lee, “Naturalgradient actor-critic algorithms” Automatica, 2007 (to appear, http://drona.csa.iisc.ernet.in/∼shalabh/pubs/ac_bhatnagar.pdf).
S. Bhatnagar, R.S. Sutton, M. Ghavamzadeh and M. Lee, “Incremental natural actor-critic algorithms”, Proc. 21 ^st Annual Conference on Neural Information Processing Systems, 2007.
Google Scholar
S. Kakade, “A natural policy gradient”, Adv. in Neural Info. Proc. Systems, 2002, 14.
Google Scholar
J. Peters, S. Vijayakumar and S. Schaal, “Natural actor-critic”, Proc. 16 ^th European Conference on Machine Learning, 2005, pp. 280-291.
Google Scholar
S. Amari, K. Kurata and H. Nagaoka, “Information geometry of Boltzmann machines” IEEE Trans. on Neural Networks, 1992, Vol. 3, No. 2, pp 260-271.
Article Google Scholar
S. Amari, “Natural gradient works efficiently in learning”, Neural Computation, 1998, 10(2):251-276.
Article Google Scholar
V.S. Borkar, “Stochastic approximation with two timescales”, Systems and Control Letters, 1997, 29:291-294.
Article Google Scholar
V.R. Konda and J.N. Tsitsiklis, “On actor-critic algorithms”, SIAM Journal on Control and Optimization, 2003, 42(4):1143-1166.
Article Google Scholar

Download references

Author information

Authors and Affiliations

MarketRx - A Cognizant Company, Gurgaon, 122002, India
Yogesh P. Awate

Authors

Yogesh P. Awate
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yogesh P. Awate .

Editor information

Editors and Affiliations

Dept. Civil & Environmental Engineering, Polytechnic University, Jay Street 333, Brooklyn, 11201, U.S.A.
Magued Iskander
Dept. Civil & Environmental Engineering, Polytechnic University, Jay Street 333, Brooklyn, 11201, U.S.A.
Vikram Kapila
Old Dominion University, Monarch Way 4111, Norfolk, 23508, U.S.A.
Mohammad A. Karim

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Awate, Y.P. (2010). Actor-Critic Algorithms for Variance Minimization. In: Iskander, M., Kapila, V., Karim, M. (eds) Technological Developments in Education and Automation. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-3656-8_82

Download citation

DOI: https://doi.org/10.1007/978-90-481-3656-8_82
Published: 15 December 2009
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-3655-1
Online ISBN: 978-90-481-3656-8
eBook Packages: Humanities, Social Sciences and LawEducation (R0)

Publish with us

Policies and ethics