Skip to main content
Log in

Approximate Gradient Methods in Policy-Space Optimization of Markov Reward Processes

  • Published:
Discrete Event Dynamic Systems Aims and scope Submit manuscript

Abstract

We consider a discrete time, finite state Markov reward process that depends on a set of parameters. We start with a brief review of (stochastic) gradient descent methods that tune the parameters in order to optimize the average reward, using a single (possibly simulated) sample path of the process of interest. The resulting algorithms can be implemented online, and have the property that the gradient of the average reward converges to zero with probability 1. On the other hand, the updates can have a high variance, resulting in slow convergence. We address this issue and propose two approaches to reduce the variance. These approaches rely on approximate gradient formulas, which introduce an additional bias into the update direction. We derive bounds for the resulting bias terms and characterize the asymptotic behavior of the resulting algorithms. For one of the approaches considered, the magnitude of the bias term exhibits an interesting dependence on the time it takes for the rewards to reach steady-state. We also apply the methodology to Markov reward processes with a reward-free termination state, and an expected total reward criterion. We use a call admission control problem to illustrate the performance of the proposed algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Baxter, J., and Bartlett, P. L. 1999. Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms. Unpublished manuscript, November.

  • Bertsekas, D. P. 1995. Dynamic Programming and Optimal Control, Vol. I and II. Belmont, MA: Athena Scientific.

    Google Scholar 

  • Bertsekas, D. P. 1995. Nonlinear Programming. Belmont, MA: Athena Scientific.

    Google Scholar 

  • Cao, X. R. 2000. A unified approach to Markov decision problems and performance sensitivity analysis. Automatica 36: 771–774.

    Google Scholar 

  • Cao, X. R., and Chen, H. F. 1997. Perturbation realization, potentials, and sensitivity analysis of Markov processes. IEEE Transactions on Automatic Control 42: 1382–1393.

    Google Scholar 

  • Chong, E. K. P., and Ramadage, P. J. 1994. Stochastic optimization of regenerative systems using infinitesimal perturbation analysis. IEEE Trans. on Automatic Control 39: 1400–1410.

    Google Scholar 

  • Cao, X. R., and Wan, Y. W. 1998. Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization. IEEE Trans. on Control Systems Technology 6: 482–494.

    Google Scholar 

  • Fu, M. C., and Hu, J.-Q. 1994. Smoothed perturbation analysis derivative estimation for Markov chains. Operations Research Letters 15: 241–251.

    Google Scholar 

  • Fu, M. and Hu, J.-Q. 1997. Conditional Monte Carlo: Gradient Estimation and Optimization Applications. Boston, MA: Kluwer Academic Publishers.

    Google Scholar 

  • Gallager, R. G. 1995. Discrete Stochastic Processes. Boston/Dordrech/London: Kluwer Academic Publishers.

    Google Scholar 

  • Glynn, P. W. 1986. Stochastic approximation for Monte Carlo optimization. Proceedings of the 1986 Winter Simulation Conference, pp. 285–289.

  • Glynn, P. W. 1987. Likelihood ratio gradient estimation: An overview. Proceedings of the 1987 Winter Simulation Conference, pp. 366–375.

  • Jaakkola, T., Singh, S. P., and Jordan, M. I. 1995. Reinforcement learning algorithm for partially observable Markov decision problems. Advances in Neural Information Processing Systems. Vol. 7, San Francisco, CA: Morgan Kaufman, pp. 345–352.

    Google Scholar 

  • Kimura, H., Miyazaki, K., and Kobayashi, S. (1997) Reinforcement learning in POMDPs with function approximation. In D. H. Fisher (editor), Proceedings of the 14th International Conference on Machine Learning, pp. 152–160.

  • Marbach, P. 1998. Simulation-based optimization of Markov decision processes. Ph.D. Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, MA.

  • Marbach, P., and Tsitsiklis, J. N. (2001) Simulation-based optimization of Markov reward processes. IEEE Transactions on Automatic Control 46(2): 191–209.

    Google Scholar 

  • Marbach, P. and Tsitsiklis, J. N. (1999) Simulation-based optimization of Markov reward processes: Implementation issues. Proceedings of the 38th IEEE Conference on Decision and Control, Phoenix, Arizona, pp. 1769–1774, December.

  • Tresp, V., and Hofmann, R. 1995. Missing and Noisy Data in Nonlinear Time-Series Prediction. In Neural Networks for Signal Processing, S. F. Girosi, J. Mahoul, E. Manolakos and E. Wilson (editors), New York: IEEE Signal Processing Society, pp. 1–10.

    Google Scholar 

  • Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8: 229–256.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Marbach, P., Tsitsiklis, J.N. Approximate Gradient Methods in Policy-Space Optimization of Markov Reward Processes. Discrete Event Dynamic Systems 13, 111–148 (2003). https://doi.org/10.1023/A:1022145020786

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1022145020786

Navigation