Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes

Abdulla, Mohammed Shahid; Bhatnagar, Shalabh

doi:10.1007/s10626-006-0003-y

Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes

Published: 04 January 2007

Volume 17, pages 23–52, (2007)
Cite this article

Discrete Event Dynamic Systems Aims and scope Submit manuscript

Mohammed Shahid Abdulla¹ &
Shalabh Bhatnagar¹

304 Accesses
Explore all metrics

Abstract

This article proposes several two-timescale simulation-based actor-critic algorithms for solution of infinite horizon Markov Decision Processes with finite state-space under the average cost criterion. Two of the algorithms are for the compact (non-discrete) action setting while the rest are for finite-action spaces. On the slower timescale, all the algorithms perform a gradient search over corresponding policy spaces using two different Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimates. On the faster timescale, the differential cost function corresponding to a given stationary policy is updated and an additional averaging is performed for enhanced performance. A proof of convergence to a locally optimal policy is presented. Next, we discuss a memory efficient implementation that uses a feature-based representation of the state-space and performs TD(0) learning along the faster timescale. The TD(0) algorithm does not follow an on-line sampling of states but is observed to do well on our setting. Numerical experiments on a problem of rate based flow control are presented using the proposed algorithms. We consider here the model of a single bottleneck node in the continuous time queueing framework. We show performance comparisons of our algorithms with the two-timescale actor-critic algorithms of Konda and Borkar (1999) and Bhatnagar and Kumar (2004). Our algorithms exhibit more than an order of magnitude better performance over those of Konda and Borkar (1999).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Variance-constrained actor-critic algorithms for discounted and average reward MDPs

Article 05 August 2016

On the sample complexity of actor-critic method for reinforcement learning with function approximation

Article 16 February 2023

One-Step Improvement Ideas and Computational Aspects

References

Altman E (2001) Applications of Markov decision processes in communication networks. A survey. In: Feinberg E, Shwartz A (eds) Handbook of Markov Decision Processes Methods and Applications. Kluwer, Dordrecht
Google Scholar
Bertsekas DP (1976) Dynamic programming and stochastic control. Academic Press, New York
MATH Google Scholar
Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming. Belmont MA: Athena Scientific
MATH Google Scholar
Bhatnagar S (2005) Adaptive multivariate three-timescale stochastic approximation algorithms for simulation based optimization. ACM Transactions on Modeling and Computer Simulation 15(1):74–107
Article Google Scholar
Bhatnagar S, Abdulla MS (2005) Reinforcement Learning Based Algorithms for Finite Horizon Markov Decision Processes (submitted)
Bhatnagar S, Fu MC, Marcus SI, Fard PJ (2001) Optimal structured feedback policies for ABR flow control using two-timescale SPSA. IEEE/ACM Transactions on Networking 9(4):479–491
Article Google Scholar
Bhatnagar S, Fu MC, Marcus SI, Bhatnagar S (2001) Two timescale algorithms for simulation optimization of hidden Markov models. IIE Transactions (Pritsker special issue on simulation) 3:245–258
Google Scholar
Bhatnagar S, Fu MC, Marcus SI, Wang I-J (2003) Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences. ACM Transactions on Modeling and Computer Simulation 13(4):180–209
Article Google Scholar
Bhatnagar S, Kumar S (2004) A simultaneous perturbation stochastic approximation-based actor-critic algorithm for Markov decision processes. IEEE Transactions on Automatic Control 49(4):592–598
Article MathSciNet Google Scholar
Bhatnagar S, Panigrahi JR (2006) Actor-critic algorithms for hierarchical Markov decision processes. Automatica 42(4):637–644
Article MATH MathSciNet Google Scholar
Borkar VS (1998) Asynchronous stochastic approximation. SIAM Journal on Control and Optimization 36:840–851
Article MATH MathSciNet Google Scholar
Borkar VS, Konda VR (1997) Actor-critic algorithm as multi-time scale stochastic approximation. Sadhana 22:525–543
MATH MathSciNet Google Scholar
Borkar VS, Meyn SP (2000) The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization 38(2):447–469
Article MATH MathSciNet Google Scholar
Gerencser L, Hill SD, Vago Z (1999) Optimization over discrete sets via SPSA. In: Proceedings of the 38th IEEE Conference on Decision and Control-CDC99, Phoenix Arizona, pp. 1791–1794
He Y, Fu MC, Marcus SI (2000) A simulation-based policy iteration algorithm for average cost unichain Markov decision processes. In: Laguna M, Gonzalez-Velarde JL (eds) Computing tools for modeling, optimization and simulation, Kluwer, pp. 161–182
Konda VR, Borkar VS (1999) Actor-critic type learning algorithms for Markov decision processes. SIAM Journal on Control and Optimization 38(1):94–123
Article MATH MathSciNet Google Scholar
Konda VR, Tsitsiklis JN (2003) Actor-critic algorithms. SIAM Journal on Control and Optimization 42(4):1143–1166
Article MATH MathSciNet Google Scholar
Kushner HJ, Clark DS (1978) Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer, Berlin Heidelberg New York
Google Scholar
Marbach P, Mihatsch O, Tsitsiklis JN (2000) Call admission control and routing in integrated services networks using neuro-dynamic programming. IEEE J Sel Areas Commun 18(2):197–208
Article Google Scholar
Perko L (1998) Differential Equations and Dynamical Systems, 2nd ed. Texts in Applied Mathematics, vol. 7. Springer, Berlin Heidelberg New York
Google Scholar
Puterman ML (1994) Markov decision processes: Discrete stochastic dynamic programming. Wiley, New York
MATH Google Scholar
Tsitsiklis JN, Van Roy B (1997) An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42(5):674–690
Article MATH Google Scholar
Tsitsiklis JN, Van Roy B (1999) Average cost temporal-difference learning. Automatica 35(11):1799–1808
Article MATH Google Scholar
Spall JC (1992) Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control 37:332–341
Article MATH MathSciNet Google Scholar
Spall JC (1997) A one-measurement form of simultaneous perturbation stochastic approximation. Automatica 33:109–112
Article MATH MathSciNet Google Scholar
Van Roy B (2001) Neuro-dynamic programming: Overview and recent trends. In: Feinberg E, Shwartz A (eds) Handbook of Markov Decision Processes: Methods and Applications. Kluwer, Dordrecht
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Automation, Indian Institute of Science, Bangalore, 560 012, India
Mohammed Shahid Abdulla & Shalabh Bhatnagar

Authors

Mohammed Shahid Abdulla
View author publications
You can also search for this author in PubMed Google Scholar
Shalabh Bhatnagar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shalabh Bhatnagar.

Additional information

This work was supported in part by Grant no. SR/S3/EE/43/2002-SERC-Engg from the Department of Science and Technology, Government of India.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Abdulla, M.S., Bhatnagar, S. Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes. Discrete Event Dyn Syst 17, 23–52 (2007). https://doi.org/10.1007/s10626-006-0003-y

Download citation

Received: 24 May 2005
Accepted: 21 August 2006
Published: 04 January 2007
Issue Date: March 2007
DOI: https://doi.org/10.1007/s10626-006-0003-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Variance-constrained actor-critic algorithms for discounted and average reward MDPs

On the sample complexity of actor-critic method for reinforcement learning with function approximation

One-Step Improvement Ideas and Computational Aspects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Variance-constrained actor-critic algorithms for discounted and average reward MDPs

On the sample complexity of actor-critic method for reinforcement learning with function approximation

One-Step Improvement Ideas and Computational Aspects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation