Skip to main content
Log in

An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes

  • Published:
Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Abstract

We develop an online actor–critic reinforcement learning algorithm with function approximation for a problem of control under inequality constraints. We consider the long-run average cost Markov decision process (MDP) framework in which both the objective and the constraint functions are suitable policy-dependent long-run averages of certain sample path functions. The Lagrange multiplier method is used to handle the inequality constraints. We prove the asymptotic almost sure convergence of our algorithm to a locally optimal solution. We also provide the results of numerical experiments on a problem of routing in a multi-stage queueing network with constraints on long-run average queue lengths. We observe that our algorithm exhibits good performance on this setting and converges to a feasible point.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Altman, E.: Constrained Markov Decision Processes. Chapman and Hall/CRC Press, London (1999)

    MATH  Google Scholar 

  2. Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)

    MATH  Google Scholar 

  3. Sutton, R.S., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)

    Google Scholar 

  4. Konda, V.R., Tsitsiklis, J.N.: On actor–critic algorithms. SIAM J. Control Optim. 42(4), 1143–1166 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  5. Bhatnagar, S., Sutton, R.S., Ghavamzadeh, M., Lee, M.: Natural actor–critic algorithms. Automatica 45, 2471–2482 (2009)

    Article  MATH  Google Scholar 

  6. Tsitsiklis, J.N., Van Roy, B.: Average cost temporal-difference learning. Automatica 35, 1799–1808 (1999)

    Article  MATH  Google Scholar 

  7. Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems (NIPS), vol. 12, pp. 1057–1063. MIT Press, Cambridge (2000)

    Google Scholar 

  8. Marbach, P., Tsitsiklis, J.N.: Simulation-based optimization of Markov reward processes. IEEE Trans. Autom. Control 46, 191–209 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  9. Lazar, A.: Optimal flow control of a class of queuing networks in equilibrium. IEEE Trans. Autom. Control 28, 1001–1007 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  10. Bhatnagar, S.: An actor–critic algorithm with function approximation for discounted cost constrained Markov decision processes. Syst. Control Lett. 59, 760–766 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  11. Spall, J.C.: Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Trans. Autom. Control 37(3), 332–341 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  12. Borkar, V.S.: An actor–critic algorithm for constrained Markov decision processes. Syst. Control Lett. 54, 207–213 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  13. Walrand, J.: An Introduction to Queueing Networks. Prentice Hall, New Jersey (1988)

    MATH  Google Scholar 

  14. Bhatnagar, S.: The Borkar-Meyn theorem for asynchronous stochastic approximations. Syst. Control Lett. 60, 472–478 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  15. Borkar, V.S.: Asynchronous stochastic approximations. SIAM J. Control Optim. 36(3), 840–851 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  16. Borkar, V.S.: Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press and Hindustan Book Agency, Cambridge (2008)

    Google Scholar 

  17. Benveniste, A., Métivier, M., Priouret, P.: Adaptive Algorithms and Stochastic Approximations. Springer, Berlin (1990)

    MATH  Google Scholar 

  18. Borkar, V.S., Meyn, S.P.: The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim. 38(2), 447–469 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  19. Schweitzer, P.J.: Perturbation theory and finite Markov chains. J. Appl. Probab. 5, 401–413 (1968)

    Article  MathSciNet  MATH  Google Scholar 

  20. Hirsch, M.W.: Convergent activation dynamics in continuous time networks. Neural Netw. 2, 331–349 (1989)

    Article  Google Scholar 

  21. Mas-Colell, A., Whinston, M.D., Green, J.R.: Microeconomic Theory. Oxford University Press, Oxford (1995)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shalabh Bhatnagar.

Additional information

Communicated by Mark J. Balas.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bhatnagar, S., Lakshmanan, K. An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes. J Optim Theory Appl 153, 688–708 (2012). https://doi.org/10.1007/s10957-012-9989-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10957-012-9989-5

Keywords

Navigation