A policy improvement method for constrained average Markov decision processes

doi:10.1016/j.orl.2006.09.003

Operations Research Letters

Volume 35, Issue 4, July 2007, Pages 434-438

https://doi.org/10.1016/j.orl.2006.09.003 Get rights and content

Abstract

This brief paper presents a policy improvement method for constrained Markov decision processes (MDPs) with average cost criterion under an ergodicity assumption, extending Howard's policy improvement for MDPs. The improvement method induces a policy iteration-type algorithm that converges to a local optimal policy.

Introduction

Consider a Markov decision process [5], [10] (MDP) $M = (X, A, P, C)$ with a finite state set X, a finite admissible action set $A (x), x \in X$ , a nonnegative cost function $C : X \times A (X) \to R_{+}$ , and a transition function P that maps ${(x, a) | x \in X, a \in A (x)}$ to the set of probability distributions over X. We denote the probability of making a transition to state $y \in X$ when taking action $a \in A (x)$ at state $x \in X$ by $P_{xy}^{a}$ . We impose the following ergodicity assumption on P:

Assumption 1.1

Define $K ≔ {(x, a) | x \in X, a \in A (x)}$ and $P (y | k) ≔ P_{xy}^{a}$ for all $(x, a) \in K, y \in X$ . There exists a positive number $α < 1$ such that $\max_{k, k^{'} \in K} \sum_{y \in X} | P (y | k) - P (y | k^{'}) | ⩽ 2 α .$

Let $Π$ be the set of all Markovian (history-independent) stationary deterministic policies $π : X \to A (X)$ . Define the objective function value of a policy $π \in Π$ with an initial state $x \in X$ : $V^{π} (x) = \lim_{H \to \infty} \frac{1}{H} E [\sum_{t = 0}^{H - 1} C (X_{t}, π (X_{t}))| X_{0} = x], x \in X,$ where $X_{t}$ is a random variable denoting state at time t. The MDP M is associated with a constant $κ \in R$ and a constraint cost function $D : X \times A (X) \to R_{+}$ , where each feasible policy $φ \in Π$ needs to satisfy the constraint inequality of $J^{φ} (x) ⩽ κ, x \in X$ , where the constraint function value of $φ$ is defined such that $J^{φ} (x) = \lim_{H \to \infty} \frac{1}{H} E [\sum_{t = 0}^{H - 1} D (X_{t}, φ_{t} (X_{t}))| X_{0} = x], x \in X .$ The ergodicity assumption implies that for any policy $π \in Π$ , $J^{π} (x)$ is independent of the starting state x [5, Lemma 3.3 (b.ii)], from which we write $J^{π} (x)$ as a constant $J^{π}$ omitting x (similarly for $V^{π})$ . Throughout the paper, we further assume that the following feasibility condition holds:

Assumption 1.2

$κ \in [\min_{π \in Π} J^{π}, \infty)$ .

This paper considers the problem of designing a policy $\tilde{φ}$ that improves a given feasible base-policy $φ$ such that $V^{\tilde{φ}} ⩽ V^{φ}$ while satisfying $J^{\tilde{φ}} ⩽ κ$ . For the unconstrained case, Howard provided an improvement method [10] that induces the well-known policy iteration algorithm for solving MDPs. Based on his proof idea in unichain MDPs, we provide a simple proof on the policy improvement method presented below for the constrained case.

Even though there exists an extensive list of works that deal with solving/analyzing constrained MDPs via “direct method,” linear programming, the Lagrange approach, and the Pareto approach, (see [6] for the discussion and the references therein), to the author's best knowledge, there are few works on the policy improvement approach. Recently, Chang [1] studied policy improvement methods in constrained finite-horizon MDPs and infinite-horizon discounted MDPs that induce local optimal policy-iteration algorithms. This paper is a counterpart work on the average reward case.

Section snippets

Single-policy improvement

It is well-known that any policy $π \in Π$ satisfies the Poisson's equation under Assumption 1.1 (see, e.g., [5], [7], [9] for related discussions): there exists a bounded function $h_{D}^{π}$ defined over X such that $J^{π} + h_{D}^{π} (x) = D (x, π (x)) + \sum_{y \in X} P_{xy}^{π (x)} h_{D}^{π} (y), x \in X,$ and $h_{C}^{π}$ defined over X such that $V^{π} + h_{C}^{π} (x) = C (x, π (x)) + \sum_{y \in X} P_{xy}^{π (x)} h_{C}^{π} (y), x \in X .$

Given $Θ ⩾ 0$ , define the set $\tilde{F} (x)$ as $\tilde{F} (x) = \{u |u \in A (x), D (x, u) + \sum_{y \in X} P_{xy}^{u} h_{D}^{φ} (y) ⩽ J^{φ} + h_{D}^{φ} (x) + Θ\}, x \in X,$ and a policy $\tilde{φ} \in Π$ as $\tilde{φ} (x) \in \underset{a \in \tilde{F} (x)}{\arg \min} \{C (x, a) + \sum_{y \in X} P_{xy}^{a} h_{C}^{φ} (y)\}, x \in X .$ Note that for all $x \in X$ , $\tilde{F} (x)$

Acknowledgment

This work was supported by the Ministry of Commerce, Industry and Energy under the 21st Century Frontier Program: Intelligent Robot Project.

References (10)

H.S. Chang et al.
Approximate receding horizon approach for Markov decision processes: average reward case
J. Math. Anal. Appl.
(2003)
H.S. Chang
A policy improvement method in constrained stochastic dynamic programming
IEEE Trans. Automat. Control
(2006)
H.S. Chang et al.
Parallel rollout for on-line solution of partially observable Markov decision processes
Discrete Event Dynamic Systems Theory Appl.
(2004)
E.V. Denardo
Computing a bias optimal policy in a discrete-time Markov decision problem
Oper. Res.
(1970)
O. Hernandez-Lerma
Adaptive Markov Control Processes
(1989)

There are more references available in the full text version of this article.

Cited by (10)

Resource-constrained management of heterogeneous assets with stochastic deterioration
2009, European Journal of Operational Research
Citation Excerpt :
In related research, Hirayama and Kawai (2000) implement a genetic algorithm to obtain near-optimal stationary deterministic policies for a single MDP subject to a reward constraint. Chang (2006, 2007) presents a policy improvement approach that converges to a locally optimal policy for the average reward case and discounted reward case, respectively. In this paper, we adopt the intuitive approach of decomposing a large-scale system-wide MDP into smaller, constituent MDPs that interact only through the side constraints.
We consider a collection of heterogeneous assets which exhibit independent stochastic behavior, but are economically interdependent due to resource constraints on management decisions. We represent the collection of assets as a network of nonhomogeneous Markov decision processes linked by side constraints. To facilitate a procedure for obtaining nonrandomized decision policies, we express the resource limitations as integrated chance constraints and relax them into the objective function within a Lagrangian penalty term. We utilize subgradient optimization to establish upper bounds and a greedy randomized Lagrangian repair heuristic to obtain feasible solutions. We empirically validate the tightness of these a posteriori upper and lower bounds with computational experiments on a pair of applications. For pavement maintenance, we examine the effect that the reward structure of each constituent MDP has on the maintenance policy in the presence of budget constraints. For equipment replacement, we consider constraints on two resource types.
Comments on: "A policy improvement method for constrained average Markov decision processes" [Oper. Res. Lett. 35 (2007) 434-438]
2009, Operations Research Letters
Energy-efficient adaptive transmission of scalable video streaming in cognitive radio communications
2016, IEEE Systems Journal
A duality framework for stochastic optimal control of complex systems
2016, IEEE Transactions on Automatic Control
Energy-Efficient Adaptive Rate Control for Streaming Media Transmission over Cognitive Radio
2015, IEEE Transactions on Communications
Approximation of average cost Markov decision processes using empirical distributions and concentration inequalities
2015, Stochastics

View all citing articles on Scopus

View full text

A policy improvement method for constrained average Markov decision processes

Abstract

Introduction

Section snippets

Single-policy improvement

Acknowledgment

J. Math. Anal. Appl.

A policy improvement method in constrained stochastic dynamic programming

IEEE Trans. Automat. Control

Parallel rollout for on-line solution of partially observable Markov decision processes

Discrete Event Dynamic Systems Theory Appl.

Computing a bias optimal policy in a discrete-time Markov decision problem

Oper. Res.

Adaptive Markov Control Processes