A policy improvement method for constrained average Markov decision processes

https://doi.org/10.1016/j.orl.2006.09.003Get rights and content

Abstract

This brief paper presents a policy improvement method for constrained Markov decision processes (MDPs) with average cost criterion under an ergodicity assumption, extending Howard's policy improvement for MDPs. The improvement method induces a policy iteration-type algorithm that converges to a local optimal policy.

Introduction

Consider a Markov decision process [5], [10] (MDP) M=(X,A,P,C) with a finite state set X, a finite admissible action set A(x),xX, a nonnegative cost function C:X×A(X)R+, and a transition function P that maps {(x,a)|xX,aA(x)} to the set of probability distributions over X. We denote the probability of making a transition to state yX when taking action aA(x) at state xX by Pxya. We impose the following ergodicity assumption on P:

Assumption 1.1

Define K{(x,a)|xX,aA(x)} and P(y|k)Pxya for all (x,a)K,yX. There exists a positive number α<1 such that maxk,kKyX|P(y|k)-P(y|k)|2α.

Let Π be the set of all Markovian (history-independent) stationary deterministic policies π:XA(X). Define the objective function value of a policy πΠ with an initial state xX:Vπ(x)=limH1HEt=0H-1C(Xt,π(Xt))X0=x,xX,where Xt is a random variable denoting state at time t. The MDP M is associated with a constant κR and a constraint cost function D:X×A(X)R+, where each feasible policy φΠ needs to satisfy the constraint inequality of Jφ(x)κ,xX, where the constraint function value of φ is defined such thatJφ(x)=limH1HEt=0H-1D(Xt,φt(Xt))X0=x,xX.The ergodicity assumption implies that for any policy πΠ, Jπ(x) is independent of the starting state x [5, Lemma 3.3 (b.ii)], from which we write Jπ(x) as a constant Jπ omitting x (similarly for Vπ). Throughout the paper, we further assume that the following feasibility condition holds:

Assumption 1.2

κ[minπΠJπ,).

This paper considers the problem of designing a policy φ˜ that improves a given feasible base-policy φ such that Vφ˜Vφ while satisfying Jφ˜κ. For the unconstrained case, Howard provided an improvement method [10] that induces the well-known policy iteration algorithm for solving MDPs. Based on his proof idea in unichain MDPs, we provide a simple proof on the policy improvement method presented below for the constrained case.

Even though there exists an extensive list of works that deal with solving/analyzing constrained MDPs via “direct method,” linear programming, the Lagrange approach, and the Pareto approach, (see [6] for the discussion and the references therein), to the author's best knowledge, there are few works on the policy improvement approach. Recently, Chang [1] studied policy improvement methods in constrained finite-horizon MDPs and infinite-horizon discounted MDPs that induce local optimal policy-iteration algorithms. This paper is a counterpart work on the average reward case.

Section snippets

Single-policy improvement

It is well-known that any policy πΠ satisfies the Poisson's equation under Assumption 1.1 (see, e.g., [5], [7], [9] for related discussions): there exists a bounded function hDπ defined over X such thatJπ+hDπ(x)=D(x,π(x))+yXPxyπ(x)hDπ(y),xX,and hCπ defined over X such thatVπ+hCπ(x)=C(x,π(x))+yXPxyπ(x)hCπ(y),xX.

Given Θ0, define the set F˜(x) asF˜(x)=uuA(x),D(x,u)+yXPxyuhDφ(y)Jφ+hDφ(x)+Θ,xX,and a policy φ˜Π asφ˜(x)argminaF˜(x)C(x,a)+yXPxyahCφ(y),xX.Note that for all xX, F˜(x)

Acknowledgment

This work was supported by the Ministry of Commerce, Industry and Energy under the 21st Century Frontier Program: Intelligent Robot Project.

References (10)

  • H.S. Chang et al.

    Approximate receding horizon approach for Markov decision processes: average reward case

    J. Math. Anal. Appl.

    (2003)
  • H.S. Chang

    A policy improvement method in constrained stochastic dynamic programming

    IEEE Trans. Automat. Control

    (2006)
  • H.S. Chang et al.

    Parallel rollout for on-line solution of partially observable Markov decision processes

    Discrete Event Dynamic Systems Theory Appl.

    (2004)
  • E.V. Denardo

    Computing a bias optimal policy in a discrete-time Markov decision problem

    Oper. Res.

    (1970)
  • O. Hernandez-Lerma

    Adaptive Markov Control Processes

    (1989)
There are more references available in the full text version of this article.

Cited by (10)

View all citing articles on Scopus
View full text