An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes

Bhatnagar, Shalabh; Lakshmanan, K.

doi:10.1007/s10957-012-9989-5

An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes

Published: 19 January 2012

Volume 153, pages 688–708, (2012)
Cite this article

Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Shalabh Bhatnagar¹ &
K. Lakshmanan¹

2058 Accesses
Explore all metrics

Abstract

We develop an online actor–critic reinforcement learning algorithm with function approximation for a problem of control under inequality constraints. We consider the long-run average cost Markov decision process (MDP) framework in which both the objective and the constraint functions are suitable policy-dependent long-run averages of certain sample path functions. The Lagrange multiplier method is used to handle the inequality constraints. We prove the asymptotic almost sure convergence of our algorithm to a locally optimal solution. We also provide the results of numerical experiments on a problem of routing in a multi-stage queueing network with constraints on long-run average queue lengths. We observe that our algorithm exhibits good performance on this setting and converges to a feasible point.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the Analysis of Model-Free Methods for the Linear Quadratic Regulator

Article 09 July 2024

Multi-agent Gradient-Based Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning

Article Open access 24 June 2024

On the sample complexity of actor-critic method for reinforcement learning with function approximation

Article 16 February 2023

References

Altman, E.: Constrained Markov Decision Processes. Chapman and Hall/CRC Press, London (1999)
MATH Google Scholar
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)
MATH Google Scholar
Sutton, R.S., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
Google Scholar
Konda, V.R., Tsitsiklis, J.N.: On actor–critic algorithms. SIAM J. Control Optim. 42(4), 1143–1166 (2003)
Article MathSciNet MATH Google Scholar
Bhatnagar, S., Sutton, R.S., Ghavamzadeh, M., Lee, M.: Natural actor–critic algorithms. Automatica 45, 2471–2482 (2009)
Article MATH Google Scholar
Tsitsiklis, J.N., Van Roy, B.: Average cost temporal-difference learning. Automatica 35, 1799–1808 (1999)
Article MATH Google Scholar
Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems (NIPS), vol. 12, pp. 1057–1063. MIT Press, Cambridge (2000)
Google Scholar
Marbach, P., Tsitsiklis, J.N.: Simulation-based optimization of Markov reward processes. IEEE Trans. Autom. Control 46, 191–209 (2001)
Article MathSciNet MATH Google Scholar
Lazar, A.: Optimal flow control of a class of queuing networks in equilibrium. IEEE Trans. Autom. Control 28, 1001–1007 (1983)
Article MathSciNet MATH Google Scholar
Bhatnagar, S.: An actor–critic algorithm with function approximation for discounted cost constrained Markov decision processes. Syst. Control Lett. 59, 760–766 (2010)
Article MathSciNet MATH Google Scholar
Spall, J.C.: Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Trans. Autom. Control 37(3), 332–341 (1992)
Article MathSciNet MATH Google Scholar
Borkar, V.S.: An actor–critic algorithm for constrained Markov decision processes. Syst. Control Lett. 54, 207–213 (2005)
Article MathSciNet MATH Google Scholar
Walrand, J.: An Introduction to Queueing Networks. Prentice Hall, New Jersey (1988)
MATH Google Scholar
Bhatnagar, S.: The Borkar-Meyn theorem for asynchronous stochastic approximations. Syst. Control Lett. 60, 472–478 (2011)
Article MathSciNet MATH Google Scholar
Borkar, V.S.: Asynchronous stochastic approximations. SIAM J. Control Optim. 36(3), 840–851 (1998)
Article MathSciNet MATH Google Scholar
Borkar, V.S.: Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press and Hindustan Book Agency, Cambridge (2008)
Google Scholar
Benveniste, A., Métivier, M., Priouret, P.: Adaptive Algorithms and Stochastic Approximations. Springer, Berlin (1990)
MATH Google Scholar
Borkar, V.S., Meyn, S.P.: The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim. 38(2), 447–469 (2000)
Article MathSciNet MATH Google Scholar
Schweitzer, P.J.: Perturbation theory and finite Markov chains. J. Appl. Probab. 5, 401–413 (1968)
Article MathSciNet MATH Google Scholar
Hirsch, M.W.: Convergent activation dynamics in continuous time networks. Neural Netw. 2, 331–349 (1989)
Article Google Scholar
Mas-Colell, A., Whinston, M.D., Green, J.R.: Microeconomic Theory. Oxford University Press, Oxford (1995)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Automation, Indian Institute of Science, Bangalore, 560 012, India
Shalabh Bhatnagar & K. Lakshmanan

Authors

Shalabh Bhatnagar
View author publications
You can also search for this author in PubMed Google Scholar
K. Lakshmanan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shalabh Bhatnagar.

Additional information

Communicated by Mark J. Balas.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bhatnagar, S., Lakshmanan, K. An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes. J Optim Theory Appl 153, 688–708 (2012). https://doi.org/10.1007/s10957-012-9989-5

Download citation

Received: 19 September 2011
Accepted: 04 January 2012
Published: 19 January 2012
Issue Date: June 2012
DOI: https://doi.org/10.1007/s10957-012-9989-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

On the Analysis of Model-Free Methods for the Linear Quadratic Regulator

Multi-agent Gradient-Based Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning

On the sample complexity of actor-critic method for reinforcement learning with function approximation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

On the Analysis of Model-Free Methods for the Linear Quadratic Regulator

Multi-agent Gradient-Based Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning

On the sample complexity of actor-critic method for reinforcement learning with function approximation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation