Abstract
We develop an online actor–critic reinforcement learning algorithm with function approximation for a problem of control under inequality constraints. We consider the long-run average cost Markov decision process (MDP) framework in which both the objective and the constraint functions are suitable policy-dependent long-run averages of certain sample path functions. The Lagrange multiplier method is used to handle the inequality constraints. We prove the asymptotic almost sure convergence of our algorithm to a locally optimal solution. We also provide the results of numerical experiments on a problem of routing in a multi-stage queueing network with constraints on long-run average queue lengths. We observe that our algorithm exhibits good performance on this setting and converges to a feasible point.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10957-012-9989-5/MediaObjects/10957_2012_9989_Fig1_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10957-012-9989-5/MediaObjects/10957_2012_9989_Fig2_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10957-012-9989-5/MediaObjects/10957_2012_9989_Fig3_HTML.gif)
Similar content being viewed by others
References
Altman, E.: Constrained Markov Decision Processes. Chapman and Hall/CRC Press, London (1999)
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)
Sutton, R.S., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
Konda, V.R., Tsitsiklis, J.N.: On actor–critic algorithms. SIAM J. Control Optim. 42(4), 1143–1166 (2003)
Bhatnagar, S., Sutton, R.S., Ghavamzadeh, M., Lee, M.: Natural actor–critic algorithms. Automatica 45, 2471–2482 (2009)
Tsitsiklis, J.N., Van Roy, B.: Average cost temporal-difference learning. Automatica 35, 1799–1808 (1999)
Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems (NIPS), vol. 12, pp. 1057–1063. MIT Press, Cambridge (2000)
Marbach, P., Tsitsiklis, J.N.: Simulation-based optimization of Markov reward processes. IEEE Trans. Autom. Control 46, 191–209 (2001)
Lazar, A.: Optimal flow control of a class of queuing networks in equilibrium. IEEE Trans. Autom. Control 28, 1001–1007 (1983)
Bhatnagar, S.: An actor–critic algorithm with function approximation for discounted cost constrained Markov decision processes. Syst. Control Lett. 59, 760–766 (2010)
Spall, J.C.: Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Trans. Autom. Control 37(3), 332–341 (1992)
Borkar, V.S.: An actor–critic algorithm for constrained Markov decision processes. Syst. Control Lett. 54, 207–213 (2005)
Walrand, J.: An Introduction to Queueing Networks. Prentice Hall, New Jersey (1988)
Bhatnagar, S.: The Borkar-Meyn theorem for asynchronous stochastic approximations. Syst. Control Lett. 60, 472–478 (2011)
Borkar, V.S.: Asynchronous stochastic approximations. SIAM J. Control Optim. 36(3), 840–851 (1998)
Borkar, V.S.: Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press and Hindustan Book Agency, Cambridge (2008)
Benveniste, A., Métivier, M., Priouret, P.: Adaptive Algorithms and Stochastic Approximations. Springer, Berlin (1990)
Borkar, V.S., Meyn, S.P.: The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim. 38(2), 447–469 (2000)
Schweitzer, P.J.: Perturbation theory and finite Markov chains. J. Appl. Probab. 5, 401–413 (1968)
Hirsch, M.W.: Convergent activation dynamics in continuous time networks. Neural Netw. 2, 331–349 (1989)
Mas-Colell, A., Whinston, M.D., Green, J.R.: Microeconomic Theory. Oxford University Press, Oxford (1995)
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Mark J. Balas.
Rights and permissions
About this article
Cite this article
Bhatnagar, S., Lakshmanan, K. An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes. J Optim Theory Appl 153, 688–708 (2012). https://doi.org/10.1007/s10957-012-9989-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10957-012-9989-5