Elsevier

Automatica

Volume 67, May 2016, Pages 77-84
Automatica

Brief paper
A unified approach to time-aggregated Markov decision processes

https://doi.org/10.1016/j.automatica.2015.12.022Get rights and content

Abstract

This paper presents a unified approach to time-aggregated Markov decision processes (MDPs) with an average cost criterion. The approach is based on a framework in which a time-aggregated MDP constitutes a semi-Markov decision process (SMDP). By analyzing the performance sensitivity formulas of this SMDP, a number of optimization algorithms for time aggregated MDPs, including those previously reported in the literature, can be developed in a simple and intuitive way.

Introduction

Markov decision processes (MDPs) often serve as common models, and are widely applied in a variety of fields, including control, artificial intelligence and operations research (Puterman, 1994, Sutton and Barto, 1998). The major difficulty in solving MDPs is a problem called the “curse of dimensionality” (Puterman, 1994). Reducing the dimensionality of a state space can substantially improve the computational efficiency of MDP solutions. The time aggregation approach (Cao, Ren, Bhatnagar, Fu, & Marcus, 2002) affords MDPs state reduction by dividing the process into time segments according to certain state subsets. Performance gradient estimation for Markov processes with time aggregation was presented using the stochastic recursive method and likelihood ratio in Zhang and Ho (1991). In addition, a number of optimization algorithms, including policy iteration (Cao et al., 2002, Ren and Krogh, 2005) and value iteration algorithms (Arruda and Fragoso, 2011, Ren and Krogh, 2005, Sun et al., 2007), have been developed for time-aggregated MDPs. However, the aforementioned algorithms were proposed independently of one another, and the relationship between them remains unclear. The objectives of this paper are to provide a unified formulation from the performance sensitivity point of view proposed in Cao (2007) to relate the previously reported algorithms systematically and to propose new optimization algorithms for time-aggregated MDPs.

We first show that a time-aggregated MDP essentially constitutes a semi-Markov decision process (SMDP). Then, we present a unified approach to time-aggregated MDP using performance sensitivity of this SMDP. Our approach is motivated by the sensitivity-based approach (Cao, 2007, Cao and Chen, 1997), where performance sensitivity formulas provide a unified framework for MDPs. An infinitesimal generator-based performance sensitivity formula was proposed for SMDPs in Cao (2003), which we call a continuous time-type formula in this paper. We then present a discrete time-type performance sensitivity formula. By analyzing these performance sensitivity formulas, we propose a unified approach to time-aggregated MDPs from the continuous-time and discrete-time perspectives. The proposed approach unifies and develops a number of optimization algorithms for time-aggregated MDPs, including those previously reported in the literature, in an intuitive and simple way. This approach is an extension of the sensitivity-based approach (Cao, 2007), and provides new insights to time-aggregated MDPs. Its significance can be described as follows: (1) A unified formulation for policy iteration algorithms is obtained by directly comparing two types of performance difference formulas, an approach that is more intuitive and simple than those in the previous literature. (2) Different value iteration algorithms are investigated in a unified way. This unification demonstrates the differences in the development of value iteration algorithms using two types of Bellman optimality equations. On this basis, we present a stochastic shortest path (SSP) value iteration and a generalized standard value iteration. The SSP value iteration preserves the weighted sup-norm contraction property (Bertsekas, 1998), which is helpful for developing asynchronous iterations. The generalized standard value iteration can be more intuitively understood as a traditional value iteration than the data-transformation method in Arruda and Fragoso (2011) or Puterman (1994), and it obviates the need to solve several average-cost MDPs or SSPs during the process of value iteration. (3) Finally, the proposed approach provides a performance gradient-based optimization algorithm that can be applied to cases in which the transition probabilities are unknown.

Section snippets

Time-aggregated Markov decision processes

We briefly describe the standard MDP and time-aggregated MDP by following the notations in Cao et al. (2002). Consider a time-homogeneous discrete-time MDP X={Xt,t=0,1,} on a finite state space S={1,2,,M}. At any transition time t with Xt=iS, action a is taken from a feasible action space A. That action determines the transition probabilities pa(i,j) from state i to state j, and a cost f(i,a) is incurred. In this paper, we consider a set of stationary policies Πs, which means that a policy L

Performance sensitivity

In this section, we analyze the structure of performance sensitivity of the time-aggregated MDP using the SMDP.

An infinitesimal generator AL=ΛL(P̃LI) is defined in Cao (2003), where ΛL=diag{1H1(1,L(1)),,1H1(M1,L(M1))}. Let pL be the steady-state probability row vector of the SMDP, and then pL satisfies pLAL=0,pLe=1. From Ross (1996), we have pL=π̃L(ΛL)1π̃LH1L. Define a cost-rate function vector under policy L as ΛLHfL. Then, we have ηL=π̃L(ΛL)1π̃LH1LΛLHfL=pLΛLHfL. Thus, performance (4) is

Unified approach to time-aggregated MDP

In this section, on the basis of performance difference formulas (6), (10), we show that many of the optimization algorithms for time-aggregated MDPs, including policy iteration, value iteration, linear programming and performance gradient-based algorithms, can be developed in a simple and intuitive way.

Simulation example

In this section, we replicate the multi-component replacement problem in Sun et al. (2007). An asset consists of several components that must be replaced if they reach their predefined lifetime or fail. The replacement of any component incurs a common setup cost and a new component cost. We consider an asset comprising of three components. The lifetime of each is 10 periods and both the setup cost and the cost of each new component are 10 units. Each component may fail with probability 0.01 in

Conclusion

This paper proposes a unified approach to time-aggregated MDPs. Using two types of performance sensitivity formulas, we unify and develop a number of optimization algorithms from the sensitivity-based point of view, including policy iteration algorithms, value iteration algorithms, linear programming algorithms, and performance gradient-based algorithms. The algorithms developed in this paper can be directly applied to the SMDP and to MDPs with fractional costs. The approach fits the recently

Acknowledgments

The authors would like to thank the anonymous reviewers for their comments, which have helped us to greatly improve the content of this paper. The authors were partially supported by the National Natural Science Foundation of China (61004036), Shenzhen-Hongkong innovation cycle project (SGLH20120925143844293), Shenzhen basic research program (JCYJ20140417172417158, JCYJ20140901003938996, JCYJ20150731105106111), Guangdong province—CAS strategic collaboration project (2013B091000009), the

Yanjie Li received his B.Sc. degree from Qingdao University (QDU), Qingdao, China, in 2001 and Ph.D. degree from the University of Science and Technology of China (USTC), Hefei, China, in 2006. From August 2006 to August 2008, he was a research associate in the Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology (HKUST). In September 2008, he joined Harbin Institute of Technology Shenzhen Graduate School (HITSGS), Shenzhen, China. Now he is an

References (18)

There are more references available in the full text version of this article.

Cited by (5)

  • Average reward reinforcement learning for semi-markov decision processes

    2017, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Yanjie Li received his B.Sc. degree from Qingdao University (QDU), Qingdao, China, in 2001 and Ph.D. degree from the University of Science and Technology of China (USTC), Hefei, China, in 2006. From August 2006 to August 2008, he was a research associate in the Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology (HKUST). In September 2008, he joined Harbin Institute of Technology Shenzhen Graduate School (HITSGS), Shenzhen, China. Now he is an associate professor at HITSGS. He is the recipient of Ho-Pan-Ching-Yi best paper award in 2014. His research interests include stochastic learning and optimization, Markov decision process (MDP), partially observable MDP and reinforcement learning.

Xinyu Wu is now a professor at Shenzhen Institutes of Advanced Technology, and associate director of Center for Intelligent and Biomimetic systems. He received his B.E. and M.E. degrees from the Department of Automation, University of Science and Technology of China (USTC) in 2001 and 2004, respectively. His Ph.D. degree was awarded at the Chinese University of Hong Kong in 2008. He has published over 100 papers and two monographs. His research interests include computer vision, robotics, intelligent systems and optimization.

The material in this paper was not presented at any conference. This paper was recommended for publication in revised form by Associate Editor Bart De Schutter under the direction of Editor Ian R. Petersen.

1

Tel.: +86 755 26033788; fax: +86 755 26033774.

View full text