Abstract
In this paper, we study the variance minimization problem of Markov decision processes (MDPs) in which the policy is parameterized by action selection probabilities or other general parameters. Different from the average or discounted criterion mostly used in the traditional MDP theory, the variance criterion is difficult to handle because of the non-Markovian property caused by the nonlinear (quadratic) structure of variance function. With the basic idea of sensitivity-based optimization, we derive a difference formula of the reward variance under any two parametric policies. A variance derivative formula is also obtained. With these sensitivity formulas, we obtain a necessary condition of the optimal policy with the minimal variance. We also prove that the optimal policy with the minimal variance can be found in the deterministic policy space. An iterative algorithm is further developed to efficiently reduce the reward variance and this algorithm can converge to the local optimal policy. Finally, we conduct some numerical experiments to demonstrate the main results of this paper.
Similar content being viewed by others
References
Bertsekas DP (2012) Dynamic programming and optimal control – vol II, 4th edn. Athena Scientific, Massachusetts
Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming, Athena scientific. Belmont , Massachusetts
Cao XR (2007) Stochastic learning and optimization – a sensitivity-based approach. Springer, New York
Cao XR, Chen HF (1997) Perturbation realization, potentials, and sensitivity analysis of Markov processes. IEEE Trans Autom Control 42:1382–1393
Chung KJ (1994) Mean-variance tradeoffs in an undiscounted MDP: the unichain case. Oper Res 42(1): 184–188
Feinberg E, Schwartz A (2002) Handbook of Markov decision processes: methods and applications. Kluwer Academic Publishers, Boston
Guo X, Hernandez-Lerma O (2009) Continuous-time Markov decision processes. Springer, Theory and Applications
Guo X, Huang X, Zhang Y (2015) On the first passage g-mean-variance optimality for discounted continuous-time Markov decision processes. SIAM J Control Optim 53(3):1406–1424
Guo X, Ye L, Yin G (2012) A mean-variance optimization problem for discounted Markov decision processes. Eur J Oper Res 220:423–429
Hernandez-Lerma O, Vega-Amaya O, Carrasco G (1999) Sample-path optimality and variance-minimization of average cost Markov control processes. SIAM J Control Optim 38:79–93
Huo H, Zou X, Guo X (2017) The risk probability criterion for discounted continuous-time Markov decision processes. Discrete Event Dynamic Systems: Theory and Applications
Littman ML, Dean TL, Kaelbling LP (1995) On the complexity of solving Markov decision problems. In: Proceedings of the Eleventh conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc.
Luh PB, Yu Y, Zhang B, Litvinov E, Zheng T, Zhao F, Zhao J, Wang C (2014) Grid integration of intermittent wind generation: a Markovian approach. IEEE Trans Smart Grid 5(2):732–741
Marbach P, Tsitsiklis JN (2001) Simulation-based optimization of Markov reward processes. IEEE Trans Autom Control 46:191–209
Markowitz H (1952) Portfolio selection. J Financ 7:77–91
Mannor S, Tsitsiklis JN (2011) Mean-variance optimization in Markov decision processes. In: Proceedings of the 28th international conference on machine learning. Bellevue, WA, USA
Melekopoglou M, Condon A (1990) On the complexity of the policy iteration algorithm for stochastic games. Technical Report CS-TR-90-941, Computer Sciences Department, University of Wisconsin Madison
Powell WB (2007) Approximate dynamic programming: solving the curses of dimensionality. Wiley
Puterman ML (1994) Markov decision processes: discrete stochastic dynamic programming. Wiley, New York
Silver D, Huang A, Maddison CJ et al (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489
Sobel MJ (1994) Mean-variance tradeoffs in an undiscounted MDP. Oper Res 42:175–183
Tamar A, Castro DD, Mannor S (2012) Policy gradients with variance related risk criteria. In: Proceedings of the 29th international conference on machine learning (ICML). Edinburgh, Scotland
Ummels BC, Gibescu M, Pelgrum E, Kling WL, Brand AJ (2007) Impacts of wind power on thermal generation unit commitment and dispatch. IEEE Trans Energy Convers 22:44–51
Xia L (2014) Event-based optimization of admission control in open queueing networks. Discrete Event Dynamic Systems: Theory and Applications 24(2):133–151
Xia L (2016) Optimization of parametric policies of Markov decision processes under a variance criterion. In: Proceedings of the 13th international workshop on discrete event systems (WODES2016). Xi’an, China, May 30-June 1, pp 332–337
Xia L (2016) Optimization of Markov decision processes under the variance criterion. Automatica 73 :269–278
Xia L, Jia QS (2015) Parameterized Markov decision process and its application to service rate control. Automatica 54:29–35
Acknowledgments
This work was supported in part by the National Key Research and Development Program of China (2016YFB0901900), the National Natural Science Foundation of China (61573206, 61203039, U1301254), and the Suzhou-Tsinghua Innovation Leading Action Project.
Author information
Authors and Affiliations
Corresponding author
Additional information
This article belongs to the Topical Collection: Special Issue on Performance Analysis and Optimization of Discrete Event Systems
Guest Editors: Christos G. Cassandras and Alessandro Giua
Rights and permissions
About this article
Cite this article
Xia, L. Variance minimization of parameterized Markov decision processes. Discrete Event Dyn Syst 28, 63–81 (2018). https://doi.org/10.1007/s10626-017-0258-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10626-017-0258-5