A policy improvement method for constrained average Markov decision processes
Introduction
Consider a Markov decision process [5], [10] (MDP) with a finite state set X, a finite admissible action set , a nonnegative cost function , and a transition function P that maps to the set of probability distributions over X. We denote the probability of making a transition to state when taking action at state by . We impose the following ergodicity assumption on P: Assumption 1.1 Define and for all . There exists a positive number such that
Let be the set of all Markovian (history-independent) stationary deterministic policies . Define the objective function value of a policy with an initial state :where is a random variable denoting state at time t. The MDP M is associated with a constant and a constraint cost function , where each feasible policy needs to satisfy the constraint inequality of , where the constraint function value of is defined such thatThe ergodicity assumption implies that for any policy , is independent of the starting state x [5, Lemma 3.3 (b.ii)], from which we write as a constant omitting x (similarly for . Throughout the paper, we further assume that the following feasibility condition holds: Assumption 1.2 .
This paper considers the problem of designing a policy that improves a given feasible base-policy such that while satisfying . For the unconstrained case, Howard provided an improvement method [10] that induces the well-known policy iteration algorithm for solving MDPs. Based on his proof idea in unichain MDPs, we provide a simple proof on the policy improvement method presented below for the constrained case.
Even though there exists an extensive list of works that deal with solving/analyzing constrained MDPs via “direct method,” linear programming, the Lagrange approach, and the Pareto approach, (see [6] for the discussion and the references therein), to the author's best knowledge, there are few works on the policy improvement approach. Recently, Chang [1] studied policy improvement methods in constrained finite-horizon MDPs and infinite-horizon discounted MDPs that induce local optimal policy-iteration algorithms. This paper is a counterpart work on the average reward case.
Section snippets
Single-policy improvement
It is well-known that any policy satisfies the Poisson's equation under Assumption 1.1 (see, e.g., [5], [7], [9] for related discussions): there exists a bounded function defined over X such thatand defined over X such that
Given , define the set asand a policy asNote that for all ,
Acknowledgment
This work was supported by the Ministry of Commerce, Industry and Energy under the 21st Century Frontier Program: Intelligent Robot Project.
References (10)
- et al.
Approximate receding horizon approach for Markov decision processes: average reward case
J. Math. Anal. Appl.
(2003) A policy improvement method in constrained stochastic dynamic programming
IEEE Trans. Automat. Control
(2006)- et al.
Parallel rollout for on-line solution of partially observable Markov decision processes
Discrete Event Dynamic Systems Theory Appl.
(2004) Computing a bias optimal policy in a discrete-time Markov decision problem
Oper. Res.
(1970)Adaptive Markov Control Processes
(1989)
Cited by (10)
Resource-constrained management of heterogeneous assets with stochastic deterioration
2009, European Journal of Operational ResearchCitation Excerpt :In related research, Hirayama and Kawai (2000) implement a genetic algorithm to obtain near-optimal stationary deterministic policies for a single MDP subject to a reward constraint. Chang (2006, 2007) presents a policy improvement approach that converges to a locally optimal policy for the average reward case and discounted reward case, respectively. In this paper, we adopt the intuitive approach of decomposing a large-scale system-wide MDP into smaller, constituent MDPs that interact only through the side constraints.
Comments on: "A policy improvement method for constrained average Markov decision processes" [Oper. Res. Lett. 35 (2007) 434-438]
2009, Operations Research LettersEnergy-efficient adaptive transmission of scalable video streaming in cognitive radio communications
2016, IEEE Systems JournalA duality framework for stochastic optimal control of complex systems
2016, IEEE Transactions on Automatic ControlEnergy-Efficient Adaptive Rate Control for Streaming Media Transmission over Cognitive Radio
2015, IEEE Transactions on Communications