Skip to main content
Log in

Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes

  • Published:
Discrete Event Dynamic Systems Aims and scope Submit manuscript

Abstract

This article proposes several two-timescale simulation-based actor-critic algorithms for solution of infinite horizon Markov Decision Processes with finite state-space under the average cost criterion. Two of the algorithms are for the compact (non-discrete) action setting while the rest are for finite-action spaces. On the slower timescale, all the algorithms perform a gradient search over corresponding policy spaces using two different Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimates. On the faster timescale, the differential cost function corresponding to a given stationary policy is updated and an additional averaging is performed for enhanced performance. A proof of convergence to a locally optimal policy is presented. Next, we discuss a memory efficient implementation that uses a feature-based representation of the state-space and performs TD(0) learning along the faster timescale. The TD(0) algorithm does not follow an on-line sampling of states but is observed to do well on our setting. Numerical experiments on a problem of rate based flow control are presented using the proposed algorithms. We consider here the model of a single bottleneck node in the continuous time queueing framework. We show performance comparisons of our algorithms with the two-timescale actor-critic algorithms of Konda and Borkar (1999) and Bhatnagar and Kumar (2004). Our algorithms exhibit more than an order of magnitude better performance over those of Konda and Borkar (1999).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Altman E (2001) Applications of Markov decision processes in communication networks. A survey. In: Feinberg E, Shwartz A (eds) Handbook of Markov Decision Processes Methods and Applications. Kluwer, Dordrecht

    Google Scholar 

  • Bertsekas DP (1976) Dynamic programming and stochastic control. Academic Press, New York

    MATH  Google Scholar 

  • Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming. Belmont MA: Athena Scientific

    MATH  Google Scholar 

  • Bhatnagar S (2005) Adaptive multivariate three-timescale stochastic approximation algorithms for simulation based optimization. ACM Transactions on Modeling and Computer Simulation 15(1):74–107

    Article  Google Scholar 

  • Bhatnagar S, Abdulla MS (2005) Reinforcement Learning Based Algorithms for Finite Horizon Markov Decision Processes (submitted)

  • Bhatnagar S, Fu MC, Marcus SI, Fard PJ (2001) Optimal structured feedback policies for ABR flow control using two-timescale SPSA. IEEE/ACM Transactions on Networking 9(4):479–491

    Article  Google Scholar 

  • Bhatnagar S, Fu MC, Marcus SI, Bhatnagar S (2001) Two timescale algorithms for simulation optimization of hidden Markov models. IIE Transactions (Pritsker special issue on simulation) 3:245–258

    Google Scholar 

  • Bhatnagar S, Fu MC, Marcus SI, Wang I-J (2003) Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences. ACM Transactions on Modeling and Computer Simulation 13(4):180–209

    Article  Google Scholar 

  • Bhatnagar S, Kumar S (2004) A simultaneous perturbation stochastic approximation-based actor-critic algorithm for Markov decision processes. IEEE Transactions on Automatic Control 49(4):592–598

    Article  MathSciNet  Google Scholar 

  • Bhatnagar S, Panigrahi JR (2006) Actor-critic algorithms for hierarchical Markov decision processes. Automatica 42(4):637–644

    Article  MATH  MathSciNet  Google Scholar 

  • Borkar VS (1998) Asynchronous stochastic approximation. SIAM Journal on Control and Optimization 36:840–851

    Article  MATH  MathSciNet  Google Scholar 

  • Borkar VS, Konda VR (1997) Actor-critic algorithm as multi-time scale stochastic approximation. Sadhana 22:525–543

    MATH  MathSciNet  Google Scholar 

  • Borkar VS, Meyn SP (2000) The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization 38(2):447–469

    Article  MATH  MathSciNet  Google Scholar 

  • Gerencser L, Hill SD, Vago Z (1999) Optimization over discrete sets via SPSA. In: Proceedings of the 38th IEEE Conference on Decision and Control-CDC99, Phoenix Arizona, pp. 1791–1794

  • He Y, Fu MC, Marcus SI (2000) A simulation-based policy iteration algorithm for average cost unichain Markov decision processes. In: Laguna M, Gonzalez-Velarde JL (eds) Computing tools for modeling, optimization and simulation, Kluwer, pp. 161–182

  • Konda VR, Borkar VS (1999) Actor-critic type learning algorithms for Markov decision processes. SIAM Journal on Control and Optimization 38(1):94–123

    Article  MATH  MathSciNet  Google Scholar 

  • Konda VR, Tsitsiklis JN (2003) Actor-critic algorithms. SIAM Journal on Control and Optimization 42(4):1143–1166

    Article  MATH  MathSciNet  Google Scholar 

  • Kushner HJ, Clark DS (1978) Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer, Berlin Heidelberg New York

    Google Scholar 

  • Marbach P, Mihatsch O, Tsitsiklis JN (2000) Call admission control and routing in integrated services networks using neuro-dynamic programming. IEEE J Sel Areas Commun 18(2):197–208

    Article  Google Scholar 

  • Perko L (1998) Differential Equations and Dynamical Systems, 2nd ed. Texts in Applied Mathematics, vol. 7. Springer, Berlin Heidelberg New York

    Google Scholar 

  • Puterman ML (1994) Markov decision processes: Discrete stochastic dynamic programming. Wiley, New York

    MATH  Google Scholar 

  • Tsitsiklis JN, Van Roy B (1997) An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42(5):674–690

    Article  MATH  Google Scholar 

  • Tsitsiklis JN, Van Roy B (1999) Average cost temporal-difference learning. Automatica 35(11):1799–1808

    Article  MATH  Google Scholar 

  • Spall JC (1992) Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control 37:332–341

    Article  MATH  MathSciNet  Google Scholar 

  • Spall JC (1997) A one-measurement form of simultaneous perturbation stochastic approximation. Automatica 33:109–112

    Article  MATH  MathSciNet  Google Scholar 

  • Van Roy B (2001) Neuro-dynamic programming: Overview and recent trends. In: Feinberg E, Shwartz A (eds) Handbook of Markov Decision Processes: Methods and Applications. Kluwer, Dordrecht

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shalabh Bhatnagar.

Additional information

This work was supported in part by Grant no. SR/S3/EE/43/2002-SERC-Engg from the Department of Science and Technology, Government of India.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Abdulla, M.S., Bhatnagar, S. Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes. Discrete Event Dyn Syst 17, 23–52 (2007). https://doi.org/10.1007/s10626-006-0003-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10626-006-0003-y

Keywords

Navigation