Elsevier

Automatica

Volume 42, Issue 3, March 2006, Pages 393-403
Automatica

The control of a two-level Markov decision process by time aggregation

https://doi.org/10.1016/j.automatica.2005.11.006Get rights and content

Abstract

The solution of Markov Decision Processes (MDPs) often relies on special properties of the processes. For two-level MDPs, the difference in the rates of state changes of the upper and lower levels has led to limiting or approximate solutions of such problems. In this paper, we solve a two-level MDP without making any assumption on the rates of state changes of the two levels. We first show that such a two-level MDP is a non-standard one where the optimal actions of different states can be related to each other. Then we give assumptions (conditions) under which such a specially constrained MDP can be solved by policy iteration. We further show that the computational effort can be reduced by decomposing the MDP. A two-level MDP with M upper-level states can be decomposed into one MDP for the upper level and M to M(M-1) MDPs for the lower level, depending on the structure of the two-level MDP. The upper-level MDP is solved by time aggregation, a technique introduced in a recent paper [Cao, X.-R., Ren, Z. Y., Bhatnagar, S., Fu, M., & Marcus, S. (2002). A time aggregation approach to Markov decision processes. Automatica, 38(6), 929–943.], and the lower-level MDPs are solved by embedded Markov chains.

Introduction

In Markov Decision Processes (MDPs) of two-level hierarchical structures, the states are formed by the status of both the upper and the lower levels and state changes can be caused by the status changes at either level. Decisions are also made at each level. Such decisions affect both the state transitions of the two levels and the reward of the MDPs. The existing solutions of such MDPs often rely on the difference in the time scales of the two levels, i.e., the rate of state changes in the lower level is faster than that of the upper level by multiple orders of magnitude. Between two upper-level state changes, on average the lower level has already gone through so many state changes that one can make use of the long-run sum or average from the lower level to make decision at an upper-level state change. See Chang, Fard, Marcus, and Shayman (2003) and its references for examples of MDPs with multiple time scales.

In a different context and for a different purpose, singularly perturbed MDPs also make use of two time scales (Abbad et al., 1992, Bielecki and Filar, 1991). Reducible transition probability matrices of policies in the form of disjoint, closed communicating classes are made positive by perturbing with a policy dependent matrix ɛD, where ɛ>0 is a constant. By controlling ɛ, the inter-class state transitions can be less frequent than the intra-class. One main result of singularly perturbed MDPs is that the optimal control policy for the “limit control MDPs”, the cases as ɛ0, is a good control policy for singularly perturbed MDPs of a sufficiently small ɛ.

The idea of two time scales also occurs in hybrid stochastic systems (Filar et al., 2001, Filar and Haurie, 2001). The operation modes in the upper level are modeled by Markov jump processes and the characteristics of the lower level are modeled either by deterministic functions (Filar et al., 2001) or by diffusion processes (Filar & Haurie, 2001), with both the upper-level operations modes and the lower-level characteristics controllable.

In this paper, we study a class of the two-level MDPs under the long-run average reward criterion. Our two-level MDPs take a structure similar to MDPs of two time scales except that our two-level model is an atypical MDP: an upper-level decision at a state can affect the state transitions of a group of states with the same upper-level state. The various decisions, within the lower levels and spanning across both levels, are coupled, i.e., we cannot single out and solve the decisions level by level, purpose by purpose. Such coupling effect increases the computational burden to solve the two-level MDP, making it practically infeasible for real-life problems. We then show that the coupling effect disappears if (i) the sojourn times of each upper-level state—the duration between entering and leaving an upper-level state—are uncontrollable, and (ii) the set of the initial lower-level state distributions after an upper-level state change is independent of the lower-level states before the upper-level state change. With both of these assumptions, the decisions are decoupled; the computational effort for the optimal policy becomes manageable; and it is possible to implement the centralized control scheme in a decentralized fashion.

To further reduce the computational effort, we show that the whole problem can be decomposed into smaller MDPs, one for the upper level, and a number of MDPs for the lower level, where the number depends on the problem structure. Our solution of the upper-level MDP uses time aggregation (Cao, Ren, Bhatnagar, Fu, & Marcus, 2002), and our lower-level of embedded Markov chains. Combining the algorithms of the two levels solves the two-level MDP.

Our model is related to that in Chang et al. (2003), though the two models have two critical differences. First, the number of lower-level state changes between two upper-level state changes is fixed in Chang et al. (2003) but random here. With constant time between two upper-level state changes, it is conceptually straightforward to embed on upper-level state changes and make use of the total rewards in upper-level sojourn times to look for optimal decisions. Nonetheless, the computational effort for all possible nonstationary policies is too large to be materialized. Consequently, Chang et al. (2003) looks for approximations, and bounds the performance of approximations in the same spirit of those in single-level MDPs. As for our model, the calculation of the total rewards of sojourn times is made involved by randomness. We still find computationally simple close-form expressions for such quantities, based on which later we give an exact analysis of the two-level MDP. Second, the upper- and lower-level decisions in Chang et al. (2003) are not coupled as ours. There, any consideration for a state, e.g., its upper-level decision, can be made purely based on the cost and benefit of the state. However, in our model, as shown in (1) below, states with the same upper level must take the same upper-level decisions, a constraint that makes our MDP unconventional.

Our contributions are as follows. First, we tackle a two-level MDP from a perspective that does not rely on multiple time scales. Our analysis is exact, and we allow for random number of lower-level state changes between two upper-level state changes. Second, our two-level control problem is not a standard MDP because each action at the upper level applies to a group of states. We solve such a specially constrained MDP under different assumptions. Third, other than satisfying with an algorithm that solves the two-level MDP, we go further to find algorithms that take less computational effort, which is a continuation of one of the authors’ previous work on time aggregation (Cao et al., 2002). Such algorithms can be implemented as the optimal decentralized control. Finally, our approach sheds light on hybrid systems where the lower level is modeled as continuous systems.

Section snippets

The two-level MDP

Consider a two-level MDP with M upper-level states and Ni lower-level states for the ith upper-level state, i=1,,M. In general, Ni can be different from Nm for 1imM, and in specific applications, lower-level states of different upper levels can bear different physical meanings. For ease of reference, we call an upper-level state a mode, and its lower-level states settings of the mode. The state of the MDP at period t is denoted by (Xt,Yt), where Xt is the mode and Yt is the setting at

The coupling and decoupling of the two-level MDP

We will use policy iteration (cf. Cao, 1998, Cao, 1999; Cao & Chen, 1997; Cao, Yuan, & Qiu, 1996) to explain the coupling effect of actions and the way to decouple them. Other methods face the same difficulty as policy iteration for the coupled two-level MDP, and they are not as efficient to decouple actions.

Policy iteration for two-level MDPs

In this section, we provide a policy iteration algorithm for a two-level MDP with decoupled actions. In the next section, we show that the computational effort can be reduced by decomposing the problem into the upper-level and a number of lower-level problems, all of them are of smaller sizes.

As the preparation for subsequent discussion, we introduce phase-type distributions discussed in Chapter 2 of Neuts (1981). Let N be a positive integer; B be an N×N non-negative matrix; B0 be an N

The lower-level problem

Similar to ξ(m), we define an (i,m)-sojourn time ξ(i,m) as an m-sojourn time that is changed from mode i. Let hf(ξ(m)) and hf(ξ(i,m)) be, respectively, the total reward from an m-sojourn time ξ(m) and an (i,m)-sojourn time ξ(i,m) for a performance function f. Setting t=1 as the beginning of the sojourn time ξ, we have hf(ξ)=t=1λ(ξ)f(Xt,Yt). The expected total reward areHf(m)=E[hf(ξ(m))]=θ(m)(I-ζmS(m))-1fmandHf(i,m)=E[hf(ξ(i,m))]=θ(i,m)(I-ζmS(m))-1fm,for the m- and (i,m)-sojourn times,

Conclusion

In this paper, we show that for a two-level MDP, if the sojourn time of each mode is uncontrollable and the sets of the initial setting distributions after a mode change are independent of the settings before the mode change, the effect of the actions at different states can be decoupled and the problem can be solved with policy iteration accordingly. Furthermore, the upper-level MDP is solved by the time-aggregated approach, and the lower-level MDP for each mode is solved as a total-cost MDP

Acknowledgements

We thank three anonymous referees and the Associate Editor for their constructive comments that help improve the content of the paper.

Yat-wah Wan received the B.S. degree in Mechanical Engineering from the University of Hong Kong, M.S. degree in Industrial Engineering from the Texas A & M University, and the Ph.D. degree in Operations Research from the University of California, Berkeley. From August 1991 to December 1993, he served in the Department of Manufacturing Engineering, City Polytechnics of Hong Kong, and from December 1993 to July 2004 in the Department of Industrial Engineering and Engineering Management, Hong Kong

References (14)

  • X.-R. Cao et al.

    A time aggregation approach to Markov decision processes

    Automatica

    (2002)
  • J.A. Filar et al.

    A two-factor stochastic production model with two time scales

    Automatica

    (2001)
  • M. Abbad et al.

    Algorithms for singularly perturbed limiting average control problem

    IEEE Transactions on Automatic Control

    (1992)
  • T.R. Bielecki et al.

    Singularly perturbed Markov control problems

    Annals of Operations Research

    (1991)
  • X.-R. Cao

    The relation among potentials, perturbation analysis, and Markov decision processes

    Journal of Discrete Event Dynamic Systems

    (1998)
  • X.-R. Cao

    Single sample path based optimization of Markov chains

    Journal of Optimization: Theory and Application

    (1999)
  • X.-R. Cao et al.

    Perturbation realization, potentials, and sensitivity analysis of Markov processes

    IEEE Transactions on Automatic Control

    (1997)
There are more references available in the full text version of this article.

Cited by (9)

View all citing articles on Scopus

Yat-wah Wan received the B.S. degree in Mechanical Engineering from the University of Hong Kong, M.S. degree in Industrial Engineering from the Texas A & M University, and the Ph.D. degree in Operations Research from the University of California, Berkeley. From August 1991 to December 1993, he served in the Department of Manufacturing Engineering, City Polytechnics of Hong Kong, and from December 1993 to July 2004 in the Department of Industrial Engineering and Engineering Management, Hong Kong University of Science and Technology. He is currently an Associate Professor in the Institute of Global Operations Strategy and Logistics Management, National Dong Hwa University, where he joined in August 2004. His research interests include the control and optimization of stochastic systems, transportation, and logistics.

Xi-Ren Cao received the M.S. and Ph.D. degrees from Harvard University, in 1981 and 1984, respectively, where he was a research fellow from 1984 to 1986. He then worked as a principal and consultant engineer/engineering manager at Digital Equipment Corporation, U.S.A, until October 1993. Since then, he is a Professor of the Hong Kong University of Science and Technology (HKUST), Hong Kong, China. He is the director of the Center for Networking at HKUST. He held visiting positions at Harvard University, University of Massachusetts at Amherst, AT&T Labs, University of Maryland at College Park, University of Notre Dame, Tsinghua University, University of Science and Technology of China, and other universities.

Dr. Cao owns three patents in data- and tele-communications and published two books in the area of discrete event dynamic systems. He received the Outstanding Transactions Paper Award from the IEEE Control System Society in 1987 and the Outstanding Publication Award from the Institution of Management Science in 1990. He is a Fellow of IEEE, Chairman of IEEE Fellow Evaluation Committee of IEEE Control System Society, Associate Editor at Large of IEEE Transactions of Automatic Control, Editor-in-Chief of Discrete Event Dynamic Systems: Theory and Applications, and he is/was on Board of Governors of IEEE Control Systems Society, associate editor of a number of international journals and chairman of a few technical committees of international professional societies. His current research areas include discrete event dynamic systems, stochastic learning and optimisation, performance analysis of communication systems, and signal processing.

This paper was not presented at any IFAC meeting. This paper was recommended for publication in revised form by Associate Editor Ioannis Paschalidis under the direction of Editor Ian Petersen.

1

The research was partially supported by a grant from Hong Kong RGC.

View full text