Discounted continuous-time Markov decision processes with unbounded rates and randomized history-dependent policies: the dynamic programming approach

Piunovskiy, Alexey; Zhang, Yi

doi:10.1007/s10288-013-0236-1

Discounted continuous-time Markov decision processes with unbounded rates and randomized history-dependent policies: the dynamic programming approach

Research paper
Published: 31 March 2013

Volume 12, pages 49–75, (2014)
Cite this article

4OR Aims and scope Submit manuscript

Alexey Piunovskiy¹ &
Yi Zhang¹

467 Accesses
15 Citations
Explore all metrics

Abstract

This paper deals with a continuous-time Markov decision process in Borel state and action spaces and with unbounded transition rates. Under history-dependent policies, the controlled process may not be Markov. The main contribution is that for such non-Markov processes we establish the Dynkin formula, which plays important roles in establishing optimality results for continuous-time Markov decision processes. We further illustrate this by showing, for a discounted continuous-time Markov decision process, the existence of a deterministic stationary optimal policy (out of the class of history-dependent policies) and characterizing the value function through the Bellman equation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Existence and Uniqueness of Quasi-stationary Distributions for Symmetric Markov Processes with Tightness Property

Article 17 January 2019

A Crash Course in Differential Games and Applications

Article Open access 25 March 2024

On some mean field games and master equations through the lens of conservation laws

Article 16 April 2024

Notes

We do not explicitly indicate the topology of the a Borel space.
Below by measurable we always mean Borel-measurable.
It can be easily verified that $\forall ~(x,a)\in K,$ $\left( \frac{q(dy|x,a)}{1+\bar{q}_x}+I\{x\in dy\}\right) $ is a probability measure on $(S,{\fancyscript{B}}(S))$.

References

Bertsekas D, Shreve S (1978) Stochastic optimal control. Academic Press, NY
Google Scholar
Feinberg E (2004) Continuous time discounted jump Markov decision processes: a discrete-event approach. Math Oper Res 29:492–524
Article Google Scholar
Feinberg E (2012) Reduction of discounted continuous-time MDPs with unbounded jump and reward rates to discrete-time total-reward MDPs. In: Hernández-Hernández D, Minjarez-Sosa A (eds) Optimization, control, and application of stochastic systems, Birkhauser, Basel, pp 77–97
Guo X (2007) Continuous-time Markov decision processes with discounted rewards: the case of Polish spaces. Math Oper Res 32:73–87
Article Google Scholar
Guo X, Hernández-Lerma O (2009) Continuous-time Markov decision processes: theory and applications. Springer, Heidelberg
Book Google Scholar
Guo X, Piunovskiy A (2011) Discounted continuous-time Markov decision processes with constraints: unbounded transition and loss rates. Math Oper Res 36:105–132
Article Google Scholar
Guo X, Song X (2011) Discounted continuous-time constrained Markov decision processes in Polish spaces. Ann Appl Probab 21:2016–2049
Article Google Scholar
Guo X, Zhu W (2002) Denumerable-state continuous-time Markov decision processes with unbounded transition and reward rates under the discounted criterion. J Appl Probab 39:233–250
Article Google Scholar
Guo X, Hernández-Lerma O, Prieto-Rumeau T (2006) A survey of recent results on continuous-time Markov decision processes. Top 14:177–257
Article Google Scholar
Guo X, Huang Y, Song X (2012) Linear programming and constrained average optimality for general continuous-time Markov decision processes in history-dependent policies. SIAM J Control Optim 50:23–47
Article Google Scholar
Guo X, Vykertas M, Zhang Y (2013) Absorbing continuous-time Markov decision processes with total cost criteria. Adv Appl Probab 45 (to appear)
Hernández-Lerma O, Lasserre JB (1996) Discrete-time Markov control processes. Springer, NY
Book Google Scholar
Hernández-Lerma O, Lasserre JB (1999) Further topics on discrete-time Markov control processes. Springer, NY
Book Google Scholar
Jacod J (1975) Multivariate point processes: predictable projection, Radon–Nykodym derivatives, representation of martingales. Z Wahrscheinlichkeitstheorie verw Gebite 31:235–253
Article Google Scholar
Kakumanu P (1971) Continuously discounted Markov decision models with countable state and action spaces. Ann Math Statist 42:919–926
Article Google Scholar
Kitaev M (1986) Semi-Markov and jump Markov controlled models: average cost criterion. Theory Probab Appl 30:272–288
Article Google Scholar
Kitaev M, Rykov V (1995) Controlled queueing systems. CRC Press, Boca Raton
Google Scholar
Piunovskiy A (1998) A controlled jump discounted model with constraints. Theory Probab Appl 42:51–71
Article Google Scholar
Piunovskiy A, Zhang Y (2011a) Accuracy of fluid approximations to controlled birth-and-death processes: absorbing Case. Math Meth Oper Res 73:159–187
Article Google Scholar
Piunovskiy A, Zhang Y (2011b) Discounted continuous-time Markov decision processes with unbounded rates: the convex analytic approach. SIAM J Control Optim 49:2032–2061
Article Google Scholar
Piunovskiy A, Zhang Y (2011c) Discounted continuous-time Markov decision processes with unbounded rates: the dynamic programming approach. http://arxiv.org/abs/1103.0134
Piunovskiy A, Zhang Y (2012) The transformation method for continuous-time Markov decision processes. J Optim Theory Appl 154:691–712
Article Google Scholar
Prieto-Rumeau T, Hernández-Lerma O (2012) Selected topics in continuous-time controlled Markov chains and Markov games. Imperial College Press, London
Google Scholar
Puterman M (1994) Markov decision processes: discrete stochastic dynamic programming. Wiley, New York
Yan H, Zhang J, Guo X (2003) Continuous-time Markov decision processes with unbounded transition and discounted-reward rates. Stoch Anal Appl 26:209–231
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematical Sciences, University of Liverpool, Liverpool, L69 7ZL, UK
Alexey Piunovskiy & Yi Zhang

Authors

Alexey Piunovskiy
View author publications
You can also search for this author in PubMed Google Scholar
Yi Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yi Zhang.

Appendix

In this appendix, we establish some lemmas, and prove the main statements.

Proof of Theorem 2

Step 1. We prove that Eq. (9) holds for $r(x):= u(x)I\{x\in S_l\}$, where $S_l$ is defined in Condition 1. We obviously have

$$\begin{aligned}&\int \limits _Sw^{\prime }(y)E_x^\pi \left[ \int \limits _0^t\int \limits _A\pi (da|\omega ,v)q(dy{\setminus } \{\xi _v\}|\xi _v,a)dv\right] \nonumber \\&\quad =E_x^\pi \left[ \int \limits _0^t\int \limits _A\pi (da|\omega ,v)\int \limits _Sw^{\prime }(y)q(dy{\setminus } \{\xi _v\}|\xi _v,a)dv\right] \nonumber \\&\quad =E_x^\pi \left[ \int \limits _0^t\int \limits _A\pi (da|\omega ,v) \int \limits _Sw^{\prime }(y)\left\{ q(dy|\xi _v,a)-q(\{\xi _v\}|\xi _v,a)I\{\xi _v\in dy\}\right\} dv\right] \nonumber \\&\quad <\infty . \end{aligned}$$

(16)

Indeed, by Condition 3(a, b) and Theorem 1(a),

$$\begin{aligned}&E_x^\pi \left[ \int \limits _0^t\int \limits _A\pi (da|\omega ,v) \int \limits _Sw^{\prime }(y)q(dy|\xi _v,a)dv\right] \nonumber \\&\quad \le E_x^\pi \left[ \int \limits _0^t\int \limits _A \pi (da|\omega ,v)(\rho ^{\prime }w^{\prime }(\xi _v)+b^{\prime })dv\right] \\&\quad \le L^{\prime }\rho ^{\prime }\int \limits _0^tE_x^\pi \left[ w(\xi _v)\right] dv+b^{\prime }t<\infty , \end{aligned}$$

and

$$\begin{aligned}&E_x^\pi \left[ \int \limits _0^t\int \limits _A\pi (da|\omega ,v)w^{\prime }(\xi _v)|q(\{\xi _v\} |\xi _v,a)|dv\right] \nonumber \\&\quad =E_x^\pi \left[ \int \limits _0^t\int \limits _A\pi (da|\omega ,v)w^{\prime }(\xi _v)q_{\xi _v} (a)dv\right] \nonumber \\&\quad \le L^{\prime }\int \limits _0^tE_x^\pi \left[ w(\xi _v)\right] dv<\infty . \end{aligned}$$

(17)

It then follows from the previous calculations that

$$\begin{aligned}&\int \limits _S|r(y)|E_x^\pi \left[ \int \limits _0^t\int \limits _A\pi (da|\omega ,v) q(dy{\setminus }\{\xi _v\}|\xi _v,a)dv\right] \\&\quad \le ||r||_{w^{\prime }}\int \limits _Sw^{\prime }(y)E_x^\pi \left[ \int \limits _0^t\int \limits _A\pi (da|\omega ,v)q(dy{\setminus } \{\xi _v\}|\xi _v,a)dv\right] <\infty , \end{aligned}$$

and

$$\begin{aligned} E_x^\pi \left[ \int \limits _0^t\int \limits _A\pi (da|\omega ,v)q_{\xi _v} (a)|r(\xi _v)|dv\right] <\infty . \end{aligned}$$

Now in order to establish equation (9) for $r(x)=u(x)I\{x\in S_l\}$, it only remains to integrate formally $r(\cdot )$ over $S$ with respect to $P_x^\pi (\xi _t\in \cdot )$ and use (5) in Theorem 1(b).

Step 2. We prove that Eq. (9) holds for any $u\in \mathbf{B}_{w^{\prime }}(S)$. By putting $S_{-1}:= \emptyset $ and observing $E_x^\pi \left[ \sum _{l=-1}^\infty |u(\xi _t)|I\{\xi _t\in S_{l+1}{\setminus } S_l\}\right] <\infty ,$ we have

$$\begin{aligned}&E_x^\pi \left[ u(\xi _t)\right] -u(x)\\&\quad =E_x^\pi \left[ \sum _{l=-1}^\infty u(\xi _t)I\{\xi _t\in S_{l+1}{\setminus } S_l\}\right] -\sum _{l=-1}^\infty u(x)I\{x\in S_{l+1}{\setminus } S_l\}\\&\quad =\sum _{l=-1}^{\infty }E_x^\pi \left[ u(\xi _t)I\{\xi _t\in S_{l+1}{\setminus } S_l\}\right] -\sum _{l=-1}^\infty u(x)I\{x\in S_{l+1}{\setminus } S_l\}\\&\quad =\sum _{l=-1}^{\infty }\left\{ E_x^\pi \left[ u(\xi _t)I\{\xi _t\in S_{l+1}{\setminus } S_l\}\right] - u(x)I\{x\in S_{l+1}{\setminus } S_l\}\right\} \\&\quad =\sum _{l=-1}^\infty \left\{ E_x^\pi \left[ \int \limits _0^t\int \limits _S\int \limits _A\pi (da|\omega ,v) q(dy|\xi _v,a)u(y)I\{y\in S_{l+1}{\setminus } S_l\}\right] \right\} \\&\quad =E_x^\pi \left[ \int \limits _0^t\int \limits _S\int \limits _A\pi (da|\omega ,v) q(dy|\xi _{v},a)u(y)dv\right] , \end{aligned}$$

where the second last equality follows from formally applying the result obtained in Step 1 of this proof, i.e., (9) holds for $r(x).$ The involved interchange of the order of integrations, summations and expectations is legal, as can be easily verified similarly to (16) and (17).

Step 3. We prove that Eq. (10) holds for any $u\in \mathbf{B}_{w^{\prime }}(S).$ In this proof we repeatedly apply (9) to $E_x^\pi [u(\xi _t)]\in \mathbf{B}_{w^{\prime }}(S).$ On the one hand, we have

$$\begin{aligned}&\text{ LHS } \text{ of } (10)\\&\quad =e^{-\alpha t}\left\{ u(x)+E_x^\pi \left[ \int \limits _0^t\int \limits _S\int \limits _A \pi (da|\omega ,v)q(dy|\xi _v,a)u(y)dv\right] \right\} -u(x)\\&\quad =e^{-\alpha t}E_x^\pi \left[ \int \limits _0^t\int \limits _S\int \limits _A \pi (da|\omega ,v)q(dy|\xi _v,a)u(y)dv\right] +u(x)(e^{-\alpha t}-1). \end{aligned}$$

On the other hand, we have the following two observations. Firstly,

$$\begin{aligned}&E_x^\pi \left[ \int \limits _0^te^{-\alpha v}(-\alpha u(\xi _v))dv\right] =-\alpha \int \limits _0^t e^{-\alpha v}E_x^\pi \left[ u(\xi _v)\right] dv\\&\quad =-\alpha \int \limits _0^te^{-\alpha v}\left\{ u(x)+E_x^\pi \left[ \int \limits _0^v\int \limits _S\int \limits _A\pi (da|\omega ,r) q(dy|\xi _r,a)u(y)dr\right] \right\} dv\\&\quad =(e^{-\alpha t}-1)u(x)-\alpha \int \limits _0^te^{-\alpha v}E_x^\pi \left[ \int \limits _0^v\int \limits _S\int \limits _A \pi (da|\omega ,r)q(dy|\xi _r,a)u(y)dr\right] dv\\&\quad \!=\!(e^{-\alpha t}\!-\!1)u(x)\!-\!\alpha E_x^\pi \left[ \int \limits _0^t\left\{ e^{-\alpha v}\int \limits _0^v\int \limits _S\int \limits _A\pi (da|\omega ,r)q(dy|\xi _r,a)u(y)dr\right\} dv\right] \end{aligned}$$

where the interchange of the order of integrals in the first and the last equalities is legal as Condition 3(a) and Theorem 1(c) imply that for each $u\in \mathbf{B}_{w^{\prime }}(S),$

$$\begin{aligned} E_x^\pi \left[ \int \limits _0^te^{-\alpha v}\alpha |u(\xi _v)|dv\right] <\infty \end{aligned}$$

and

$$\begin{aligned} \int \limits _0^te^{-\alpha v}E_x^\pi \left[ \int \limits _0^v\int \limits _S\int \limits _A \pi (da|\omega ,r)q(dy|\xi _r,a)|u|(y)dr\right] dv<\infty . \end{aligned}$$

Secondly, integration by parts results in

$$\begin{aligned}&E_x^\pi \left[ \int \limits _0^te^{-\alpha v}\int \limits _S\int \limits _A\pi (da|\omega ,v)q(dy|\xi _v,a)u(y)dv\right] \nonumber \\&\qquad =E_x^\pi \left[ e^{-\alpha t}\int \limits _0^t\int \limits _S\int \limits _A\pi (da|\omega ,r)q(dy|\xi _r,a)u(y)dr\right] \\&\qquad \quad +\,\alpha E_x^\pi \left[ \int \limits _0^t e^{-\alpha v}\int \limits _0^v\int \limits _S\int \limits _A\pi (da|\omega ,r)q(dy|\xi _r,a)u(y)dr~dv\right] \!. \end{aligned}$$

These two observations, together with the expression for the LHS of (10) obtained in the above, finally lead to

$$\begin{aligned} \text{ RHS } \text{ of } (10)&= E_x^\pi \left[ \int \limits _0^te^{-\alpha v}(-\alpha u(\xi _v))dv\right] \\&\quad +E_x^\pi \left[ \int \limits _0^te^{-\alpha v}\int \limits _S\int \limits _A\pi (da|\omega ,v)q(dy|\xi _v,a)u(y)dv\right] \\&= (e^{-\alpha t}-1)u(x)+E_x^\pi \left[ e^{-\alpha t}\int \limits _0^t\int \limits _S\int \limits _A\pi (da|\omega ,r)q(dy|\xi _r,a)u(y)dr\right] \\&= \text{ LHS } \text{ of } (10), \end{aligned}$$

as required. $\square $

Lemma 1

Suppose Condition 1(b) and Condition 4 are satisfied. Then for each $u\in \mathbf{B}_{w}(S),$ the function $v$ given by

$$\begin{aligned} v(x):= \inf _{a\in A(x)}\left\{ \frac{c_0(x,a)}{\alpha +1+\bar{q}_x}+\frac{1+\bar{q}_x}{\alpha +1+\bar{q}_x}\int \limits _S u(y)\left( \frac{q(dy|x,a)}{1+\bar{q}_x}+I\{x\in dy\}\right) \right\} \end{aligned}$$

is measurable in $x\in S$.

Proof

By Remark 3, Condition 1(b) and Condition 4, we refer to the proof of Hernández-Lerma and Lasserre (1999, Lem. 8.3.7(a)) for that $\forall ~u\in \mathbf{B}_w(S),x\in S,$ the function^{Footnote 3} $\int _S u(y)\left( \frac{q(dy|x,a)}{1+\bar{q}_x}+I\{x\in dy\}\right) $ is lower semicontinuous in $a\in A(x).$ Indeed, the proof of Hernández-Lerma and Lasserre (1999, Lem. 8.3.7(a)) shows that $\forall ~u\in \mathbf{B}_w(S),$ the function $\int _S (u(y)+||u||_w w(y))\left( \frac{q(dy|x,a)}{1+\bar{q}_x}+I\{x\in dy\}\right) $ is lower semicontinuous in $a\in A(x),$ so that it remains to apply the facts that the sum of two lower semicontinuous functions is still lower semicontinuous, and the function $-||u||_w \int _Sw(y)\left( \frac{q(dy|x,a)}{1+\bar{q}_x}+I\{x\in dy\}\right) $ is lower semicontinuous in $a\in A(x).$

It follows from the above and Condition 4(c) that $\forall ~ x\in S, u\in \mathbf{B}_{w}(S),$ the function

$$\begin{aligned} \frac{c_0(x,a)}{\alpha +1+\bar{q}_x}+\frac{1+\bar{q}_x}{\alpha +1+\bar{q}_x}\int \limits _S u(y)\left( \frac{q(dy|x,a)}{1+\bar{q}_x}+I\{x\in dy\}\right) \end{aligned}$$

is lower semicontinuous in $a\in A(x).$ By Bertsekas and Shreve (1978, Prop. 7.29), $\forall ~u\in \mathbf{B}_{w}(S),$ the function

$$\begin{aligned} \frac{c_0(x,a)}{\alpha +1+\bar{q}_x}+\frac{1+\bar{q}_x}{\alpha +1+\bar{q}_x}\int \limits _S u(y)\left( \frac{q(dy|x,a)}{1+\bar{q}_x}+I\{x\in dy\}\right) \end{aligned}$$

is measurable on $K$. Now it remains to apply Hernández-Lerma and Lasserre (1996, D.5 Prop.), see also Bertsekas and Shreve (1978, Prop. 7.33), for the statement of this lemma. $\square $

Remark 6

From the above proof, we incidentally obtain that for each $u\in \mathbf{B}_{w}(S),$ $\int _S u(y)q(dy|x,a)$ is lower semicontinuous in $a\in A(x).$

Proof of Theorem 3

Throughout this proof, $x\in S$ is arbitrarily fixed. Due to Lemma 1, functions $u^{(n)},n=0,1,2,\ldots $ are measurable. Now the proof goes in steps.

Step 1. We prove that $\{u^{(n)},n=0,1,\ldots \}$ is a non-increasing sequence.

Straightforward calculations result in

$$\begin{aligned} u^{(1)}(x)\!&= \!\inf _{a\in A(x)}\left\{ \frac{c_0(x,a)}{\alpha +1+\bar{q}_x}\!+\!\frac{1+\bar{q}_x}{\alpha +1+\bar{q}_x} \int \limits _S u^{(0)}(y)\left( \frac{q(dy|x,a)}{1+\bar{q}_x}+I\{x\in dy\}\right) \right\} \\ \!&= \!\inf _{a\in A(x)}\left\{ \frac{c_0(x,a)}{\alpha +1+\bar{q}_x}\right. \\&\quad \left. +\frac{1+\bar{q}_x}{\alpha +1+\bar{q}_x}\int \limits _S\left( \frac{M(\alpha w(y)+b)}{\alpha (\alpha -\rho )}+\frac{c}{\alpha }\right) \left( \frac{q(dy|x,a)}{1+\bar{q}_x}+I\{x\in dy\}\right) \right\} \end{aligned}$$

and thus

$$\begin{aligned}&u^{(1)}(x)\\&\quad \le \inf _{a\in A(x)}\left\{ \frac{c_0(x,a)}{\alpha +1+\bar{q}_x}\right\} +\frac{1+\bar{q}_x}{\alpha +1+\bar{q}_x}\\&\qquad \cdot \sup _{a\in A(x)}\left\{ \int \limits _S\left( \frac{M(\alpha w(y)+b)}{\alpha (\alpha -\rho )}+\frac{c}{\alpha }\right) \left( \frac{q(dy|x,a)}{1+\bar{q}_x}+I\{x\in dy\}\right) \right\} \\&\quad \le \frac{Mw(x)+c}{\alpha +1+\bar{q}_x}+\frac{1+\bar{q}_x}{\alpha +1+\bar{q}_x}\left\{ \frac{bM}{\alpha (\alpha -\rho )} +\frac{M(\rho w(x)+b)}{(\alpha -\rho )(1+\bar{q}_x)}+\frac{M w(x)}{\alpha -\rho }+\frac{c}{\alpha }\right\} \\&\quad =u^{(0)}(x), \end{aligned}$$

where the last inequality follows from Condition 1(b) and Condition 2(c). Now the result of Step 1 follows from this and the monotonicity of the RHS of (12) with respect to $u^{(n)}$.

Step 2. We prove that $\forall ~n=0,1,\ldots ,$

$$\begin{aligned} |u^{(n)}(x)|\le \frac{M(\alpha w(y)+b)}{\alpha (\alpha -\rho )}+\frac{c}{\alpha }=u^{(0)}(x). \end{aligned}$$

On the one hand, the result of Step 1 implies that for each $n=0,1,\ldots ,$

$$\begin{aligned} u^{(n)}(x)\le \frac{M(\alpha w(y)+b)}{\alpha (\alpha -\rho )}+\frac{c}{\alpha }. \end{aligned}$$

On the other hand, we have that

$$\begin{aligned}&u^{(1)}(x)\\&\quad =\inf _{a\in A(x)}\left\{ \frac{c_0(x,a)}{\alpha +1+\bar{q}_x}+\frac{1+\bar{q}_x}{\alpha +1+\bar{q}_x} \int \limits _S u^{(0)}(y)\left( \frac{q(dy|x,a)}{1+\bar{q}_x}+I\{x\in dy\}\right) \right\} \\&\quad = \inf _{a\in A(x)}\left\{ \frac{c_0(x,a)}{\alpha +1+\bar{q}_x}\right. \\&\qquad \left. +\frac{1+\bar{q}_x}{\alpha +1+\bar{q}_x}\int \limits _S\left( \frac{M(\alpha w(y)+b)}{\alpha (\alpha -\rho )}+\frac{c}{\alpha }\right) \left( \frac{q(dy|x,a)}{1+\bar{q}_x}+I\{x\in dy\}\right) \right\} , \end{aligned}$$

and thus

$$\begin{aligned}&u^{(1)}(x)\\&\quad \ge \inf _{a\in A(x)}\left\{ \frac{c_0(x,a)}{\alpha +1+\bar{q}_x}\right\} \\&\qquad +\inf _{a\in A(x)} \left\{ \frac{1+\bar{q}_x}{\alpha +1+\bar{q}_x}\int \limits _S\left( \frac{M(\alpha w(y)+b)}{\alpha (\alpha -\rho )}\!+\!\frac{c}{\alpha }\right) \left( \frac{q(dy|x,a)}{1+\bar{q}_x}\!+\!I\{x\in dy\}\right) \right\} \\&\quad \ge -\frac{Mw(x)+c}{\alpha +1+\bar{q}_x}\\&\qquad +\,\frac{1\!+\!\bar{q}_x}{\alpha \!+\!1\!+\!\bar{q}_x}\inf _{a\in A(x)}\left\{ \int \limits _S\!-\!\left( \frac{M(\alpha w(y)\!+\!b)}{\alpha (\alpha \!-\!\rho )}\!+\!\frac{c}{\alpha }\right) \left( \frac{q(dy|x,a)}{1\!+\!\bar{q}_x}\!+\!I\{x\in dy\}\right) \right\} , \end{aligned}$$

where the second inequality is because of Condition 2(c), $\frac{M(\alpha w(y)+b)}{\alpha (\alpha -\rho )}+\frac{c}{\alpha }\ge 0$ and the fact of $\frac{q(dy|x,a)}{1+\bar{q}_x}+I\{x\in dy\}$ being a probability measure. It follows now that

$$\begin{aligned}&u^{(1)}(x)\\&\quad =-\frac{Mw(x)+c}{\alpha +1+\bar{q}_x}\\&\qquad -\frac{1+\bar{q}_x}{\alpha +1+\bar{q}_x}\sup _{a\in A(x)}\left\{ \int \limits _S\left( \frac{M(\alpha w(y)+b)}{\alpha (\alpha -\rho )}\!+\!\frac{c}{\alpha }\right) \left( \frac{q(dy|x,a)}{1+\bar{q}_x}\!+\!I\{x\in dy\}\right) \right\} \\&\quad \ge -\frac{Mw(x)+c}{\alpha +1+\bar{q}_x}\\&\qquad -\frac{1+\bar{q}_x}{\alpha +1+\bar{q}_x}\left\{ \frac{bM}{\alpha (\alpha -\rho )}+\frac{M(\rho w(x)+b)}{(\alpha -\rho )(1+\bar{q}_x)}+\frac{M w(x)}{\alpha -\rho }+\frac{c}{\alpha }\right\} =-u^{(0)}(x), \end{aligned}$$

where the last inequality follows from Condition 1(b). This and an inductive argument lead to that $\forall ~ n=0,1,\ldots , u^{(n)}(x)\ge -\left( \frac{M(\alpha w(y)+b)}{\alpha (\alpha -\rho )}+\frac{c}{\alpha }\right) .$ Thus, Step 2 is completed.

It follows from the results of Step 1 and Step 2 that $u^*(x)=\lim _{n\rightarrow \infty }u^{(n)}(x)$ exists and $u^{*}\in \mathbf{B}_{w}(S).$ It remains to prove that $u^*\in \mathbf{B}_w(S)$ solves the Bellman equation (11). For convenience we introduce the operator from $\mathbf{B}_w(S)$ to $\mathbf{B}_w(S)$ via

$$\begin{aligned}&T\circ u(x)\\&\quad := \inf _{a\in A(x)}\left\{ \frac{c_0(x,a)}{\alpha +1+\bar{q}_x}+\frac{1+\bar{q}_x}{\alpha +1+\bar{q}_x}\int \limits _S u(y)\left( \frac{q(dy|x,a)}{1+\bar{q}_x}+I\{x\in dy\}\right) \right\} , \end{aligned}$$

$\forall ~x\in S, u\in \mathbf{B}_w(S).$ Clearly, the operator $T$ is monotonic, and indeed increasing. So on the one hand, by $u^*(x)\le u^{(n)}(x),$ we have $T\circ u^*(x)\le T\circ u^{(n)}(x),\forall ~x\in S, n=0,1,\ldots ,$ where passing to the limit as $n\rightarrow \infty $ gives $T\circ u^*(x)\le u^*(x),\forall ~x\in S.$ On the other hand, $u^{(n+1)}(x)=T\circ u^{n}(x)\le \frac{c_0(x,a)}{\alpha +1+\bar{q}_x}+\frac{1+\bar{q}_x}{\alpha +1+\bar{q}_x}\int \limits _S u^{(n)}(y)\left( \frac{q(dy|x,a)}{1+\bar{q}_x}+I\{x\in dy\}\right) $, where passing to the limit as $n\rightarrow \infty ,$ which is permitted by the dominated convergence theorem, leads to $u^*(x)\le \frac{c_0(x,a)}{\alpha +1+\bar{q}_x}+\frac{1+\bar{q}_x}{\alpha +1+\bar{q}_x}\int \limits _S u^*(y)\left( \frac{q(dy|x,a)}{1+\bar{q}_x}+I\{x\in dy\}\right) ,$ and thus $u^*(x)\le T\circ u^{*}(x),\forall ~x\in S.$ Therefore, $u^*(x)=T\circ u^*(x),\forall ~x\in S,$ i.e.,

$$\begin{aligned} u^*(x)\!=\!\inf _{a\in A(x)}\left\{ \frac{c_0(x,a)}{\alpha +1+\bar{q}_x}+\frac{1+\bar{q}_x}{\alpha +1+\bar{q}_x}\int \limits _S u^*(y)\left( \frac{q(dy|x,a)}{1+\bar{q}_x}+I\{x\in dy\}\right) \right\} , \end{aligned}$$

which, under the imposed conditions, is equivalent to the Bellman equation, as can be seen after some rearrangements. $\square $

Lemma 2

Suppose Condition 1, Condition 2(a, b, c) and Condition 3(a, b)

are satisfied. Then under any policy $\pi ,$

$$\begin{aligned}&V_0(\gamma ,\pi )\nonumber \\&\quad =E_\gamma ^{\pi }\left[ \int \limits _0^\infty e^{-\alpha t}\int \limits _A\pi (da|\omega ,t)\left\{ c_0(\xi _t,a)-\alpha u(\xi _t)+\int \limits _S q(dy|\xi _t,a)u(y)\right\} dt\right] \nonumber \\&\quad \quad +\int \limits _S\gamma (dy)u(y), \end{aligned}$$

(18)

where $u\in \mathbf{B}_{w^{\prime }}(S)$ is an arbitrary function.

Proof

By applying Dynkin’s formula (10), we have

$$\begin{aligned} e^{-\alpha t}E_\gamma ^\pi \left[ u(\xi _t)\right]&= \int \limits _S \gamma (dy)u(y)+E_\gamma ^\pi \left[ \int \limits _0^te^{-\alpha v}\int \limits _A \pi (da|\omega ,v)\right. \\&\quad \left. \left\{ -\alpha u(\xi _v)+\int \limits _Sq(dy|\xi _v,a)u(y)\right\} dv\right] . \end{aligned}$$

The expectations of all particular summands are finite here. According to Theorem 1(c), see also its proof, we can formally add $E_\gamma ^\pi \left[ \int _0^te^{-\alpha v}\int _A\pi (da|\omega ,v)c_0(\xi _v,a) dv\right] $ to the both sides of the above equation, and take the limit as $t\rightarrow \infty $. We emphasize that $\lim _{t\rightarrow \infty }e^{-\alpha t}E_\gamma ^\pi \left[ u(\xi _t)\right] =0$ because of Theorem 1(a) and Condition 2(b). $\square $

The next result is a trivial corollary of Lemma 2. A weaker version, which is restricted to a specific class of Markov policies $\pi ,$ is also established in Guo (2007, Lem. 5.3) under stronger conditions requiring Condition 4(a, b).

Corollary 1

Suppose Condition 1, Condition 2(b, c) and Condition 3(a, b) are satisfied. Then under any fixed Markov policy $\pi ,$ $\forall ~x\in S$, $u\in \mathbf{B}_{w^{\prime }}(S),$ the following assertions hold.

(a)
If $\alpha u(x)\ge \int _A \pi (da|x,t)c_0(x,a)+\int _S\int _A\pi (da|x,t)q(dy|x,a)u(y),\forall ~x\in S,t\ge 0,$ then $u(x)\ge V_0(x,\pi ).$
(b)
If $\alpha u(x)\le \int _A \pi (da|x,t)c_0(x,a)+\int _S\int _A\pi (da|x,t)q(dy|x,a)u(y),\forall ~x\in S,t\ge 0,$ then $u(x)\le V_0(x,\pi ).$

Proof of Theorem 4

(a) Under the conditions of the theorem, $\mathbf{B}_{w^{\prime }}(S)\subseteq \mathbf{B}_{w}(S).$ So by Remark 6 and Hernández-Lerma and Lasserre (1996, Prop. D5), there is a (Borel-)measurable selector $\phi ^*:S\rightarrow A$ whose graph is contained in $K$ such that $\alpha u^*(x)=\inf _{a\in A(x)}\left\{ c_0(x,a)+\int _Sq(dy|x,a)u^*(y)\right\} = c_0(x,\phi ^*(x))+\int _Sq(dy|x,\phi ^*(x))u^*(y).$ By Lemma 2 and the definition of the function $u^*\in \mathbf{B}_{w^{\prime }}(S)$, see Remark 4, for any policy $\pi ,$

$$\begin{aligned}&V_0(x,\pi )\\&\quad =E_x^{\pi }\left[ \int \limits _0^\infty e^{-\alpha t}\int \limits _A\pi (da|\omega ,t)\left\{ c_0(\xi _t,a)-\alpha u^*(\xi _t)+\int \limits _S q(dy|\xi _t,a)u(y)\right\} dt\right] \nonumber \\&\quad \quad + u^*(x)\ge u^*(x), \end{aligned}$$

and $V_0(x,\phi ^*)= u^*(x),$ where $x\in S$ is arbitrarily fixed. Thus, the deterministic stationary policy given by $\phi ^*$ is optimal. From the above argument, the last two assertions are obvious.

Parts (b, c) are trivial consequences of part (a) of this theorem.

(d) We observe that the Bellman function $u^*(\cdot )$ is feasible for linear program (13). Consider any function $v(\cdot )$ that is also feasible for linear program (13). By referring to Corollary 1(b), we have that under any Markov policy $\pi ,$ $v(x)\le V_0(x,\pi ).$ Now suppose $\int _S\gamma (dy)v(y)>\int _S\gamma (dy)u^*(y).$ Then there exist some $\hat{x} \in S$ and constant $\delta >0$ such that $u^*(\hat{x})<v(\hat{x})-\delta .$ Hence, $u^*(\hat{x})<V_0(\hat{x},\pi )-\delta ,$ where $\pi $ is any Markov policy. But this contradicts part (a) of this theorem. Therefore, any feasible solution $v$ to linear program (13) satisfies $\int _S\gamma (dy)v(y)\le \int _S\gamma (dy)u^*(y),$ as required.

(e) From part (d) of this theorem, we know that the optimal value of linear program (13) is given by $\int _S u^*(y)\gamma (dy).$ Therefore, if some feasible solution $v$ to linear program (13) satisfies $u^*(x)=v(x)$ a.s. with respect to $\gamma $, then it solves the linear program, too. Hence we conclude the sufficiency part of the statement.

As for the necessity, let $v$ be any optimal solution to linear program (13). Suppose the relation of $v=u^*$ a.s. with respect to $\gamma $ is false. Then there exist measurable subsets $\varGamma _1,\varGamma _2\subseteq S$, such that the following conditions are satisfied: $\varGamma _1\bigcap \varGamma _2=\emptyset ,$ $v(x)>u^*(x)$ on $\varGamma _1,$ $v(x)<u^*(x)$ on $\varGamma _2,$ $v(x)=u^*(x)$ on $S{\setminus }\varGamma _1\setminus \varGamma _2,$ and the case $\gamma (\varGamma _1)=\gamma (\varGamma _2)=0$ is excluded. Now let us define a function $\hat{v}$ by $\hat{v}(x)=I\{x\in S{\setminus }\varGamma _2\}v(x)+I\{x\in \varGamma _2\}u^*(x),$ which is feasible for linear program (13). Indeed, firstly, it is evident that $\hat{v}\in \mathbf{B}_{w^{\prime }}(S)$. Secondly, we have that $\forall ~x\in S{\setminus }\varGamma _2,$

$$\begin{aligned}&\frac{1}{\alpha }c_0(x,a)-\hat{v}(x)+\frac{1}{\alpha } \int \limits _S\hat{v}(y)q(dy|x,a)\\&\quad =\frac{1}{\alpha }c_0(x,a)-v(x)+\frac{1}{\alpha } \int \limits _{S{\setminus }\varGamma _2}v(y)q(dy|x,a)+\frac{1}{\alpha } \int \limits _{\varGamma _2}u^*(y)q(dy|x,a)\\&\quad \ge \frac{1}{\alpha }c_0(x,a)-v(x)+\frac{1}{\alpha } \int \limits _{S{\setminus }\varGamma _2}v(y)q(dy|x,a)+\frac{1}{\alpha } \int \limits _{\varGamma _2}v(y)q(dy|x,a)\ge 0, \end{aligned}$$

and $\forall ~x\in \varGamma _2,$

$$\begin{aligned}&\frac{1}{\alpha }c_0(x,a)-\hat{v}(x)+\frac{1}{\alpha } \int \limits _S\hat{v}(y)q(dy|x,a)\\&\quad =\frac{1}{\alpha }c_0(x,a)-u^*(x)+\frac{1}{\alpha } \int \limits _{S{\setminus }\varGamma _2}v(y)q(dy|x,a)+\frac{1}{\alpha } \int \limits _{\varGamma _2}u^*(y)q(dy|x,a)\\&\quad \ge \frac{1}{\alpha }c_0(x,a)-u^*(x)+\frac{1}{\alpha } \int \limits _{S{\setminus }\varGamma _2}u^*(y)q(dy|x,a)+\frac{1}{\alpha } \int \limits _{\varGamma _2}u^*(y)q(dy|x,a)\ge 0. \end{aligned}$$

However, $\int _S\hat{v}(y)\gamma (dy)=\int _{S{\setminus }{\varGamma _2}}v(x) \gamma (dx)+\int \limits _{S{\setminus } {\varGamma _2}}u^*(x)\gamma (dx)> \int \limits _Sv(x)\gamma (dx),$ which is a contradiction against that $v$ is optimal for linear program (13). Now the necessity part follows. $\square $

Proof of Proposition 1

(a) We take functions $w$ and $w^{\prime }$ in the form

$$\begin{aligned} w(x)&= \left\{ \begin{array}{ll} 1, &{} \quad \text{ if }\,\, x=0; \\ \frac{1}{x^4}, &{} \quad \text{ if } \,\,x\in (0,1]; \end{array}\right. \\ w^{\prime }(x)&= \left\{ \begin{array}{ll} 1, &{}\quad \text{ if }\;\; x=0; \\ \frac{1}{x^2}, &{} \quad \text{ if }\,\, x\in (0,1], \end{array}\right. \end{aligned}$$

and put $S_0=\{0\}$, $S_l=S_0\cup \left( \frac{1}{l+1},1\right] $, $l=1,2,\ldots .$ Now Condition 1(a, c) is obviously satisfied.

Condition 1(b) can be verified for $\rho := 4\lambda $ and $b=0$ as follows:

if $x=0$ then
$$\begin{aligned} \int \limits _S q(dy|x,a) w(y)=5\lambda \int \limits _0^1\frac{1}{y^4} y^4 dy-\lambda =4\lambda =\rho w(0); \end{aligned}$$
if $x\in (0,1]$ then
$$\begin{aligned} \int \limits _S q(dy|x,a)w(y)=\frac{a}{x} w(0)-\frac{a}{x} w(x)=\frac{a}{x}\left( 1-\frac{1}{x^4}\right) \le 0<\rho w(x). \end{aligned}$$

For Condition 2, it is sufficient to notice that $\forall ~x\in (0,1],$

$$\begin{aligned} \inf _{a\in A(x)} c_0(x,a)=\left\{ \begin{array}{ll} C_1 x-\frac{1}{4C_2 x^2}, &{}\quad \text{ if } \frac{1}{2C_2}<\bar{A}; \\ C_1 x+C_2\frac{\bar{A}^2}{x^2}-\frac{\bar{A}}{x^2}, &{}\quad \text{ otherwise }, \end{array}\right. \end{aligned}$$

$\inf _{a\in A(0)}c_0(0,a)=0,$ and $\alpha >4\lambda =\rho .$

Condition 3(b, c, d) can be verified similarly to what is presented above by taking $\rho ^{\prime }=\frac{2\lambda }{3}$, $b^{\prime }=0$. Since

$$\begin{aligned} q_x(a)=\left\{ \begin{array}{ll} \lambda , &{}\quad \text{ if } x=0, \\ \ \\ \frac{a}{x}, &{}\quad \text{ if } x\in (0,1], \end{array}\right. \end{aligned}$$

$\forall ~x\in (0,1],A(x)=[0,\frac{\bar{A}}{x}]$, and $A(0)=\{0\},$ we have $\forall ~x\in (0,1],\bar{q}_x\le \frac{\bar{A}}{x^2}$ and $\bar{q}_{0}=\lambda .$ From this and trivial calculations, we see that Condition 3(a) is also satisfied.

Finally, Condition 4 obviously holds.

(b) If we denote $z^{(n+1)}=f(z^{(n)})$ then, for $z>\frac{\epsilon }{2}>0$, where $\epsilon >0$ is any fixed constant, the function $f$ is differentiable:

$$\begin{aligned} \frac{df}{dz}=\frac{-5\lambda }{\alpha +\lambda }\int \limits _0^1\frac{\partial u(y,z)}{\partial z} y^4 dy, \end{aligned}$$

where

$$\begin{aligned} \frac{\partial u(x,z)}{\partial z}&= -1+\frac{\alpha C_2 x^2}{\sqrt{\alpha ^2 C_2^2 x^4+C_1C_2 x^3+\alpha C_2 x^2z}}\\&= \frac{\alpha C_2 x^2-\sqrt{\alpha ^2 C_2^2 x^4+C_1C_2 x^3+\alpha C_2 x^2z}}{\sqrt{\alpha ^2 C_2^2 x^4+C_1C_2 x^3+\alpha C_2 x^2z}}\in (-1,0),\forall ~x\in (0,1], \end{aligned}$$

so that $\forall ~z\in (\frac{\epsilon }{2},\infty ), 0<\frac{df}{dz}<\frac{\lambda }{\alpha +\lambda }<1$.

It remains to estimate $z^{(1)}$:

$$\begin{aligned} u^{(1)}(x)&= -2\alpha C_2 x^2+2\sqrt{\alpha ^2 C_2^2 x^4+C_1C_2 x^3}\le -2\alpha C_2 x^2+\left( 2\alpha C_2 x^2+\frac{C_1 x}{\alpha }\right) \\&= \frac{C_1 x}{\alpha } \end{aligned}$$

for each $x\in (0,1]$;

$$\begin{aligned} z^{(1)} \ge 1-\frac{5\lambda C_1}{\alpha (\alpha +\lambda )}\int \limits _0^1 y dy>1-\frac{C_1}{2\alpha }\ge 0 \end{aligned}$$

because $\alpha >4\lambda $ and $C_1<2\alpha $. The map $z\rightarrow f(z)$ is contracting on $[\epsilon ,\infty )$, e.g., for $\epsilon =z^{(1)}$. Since

$$\begin{aligned} f\left( \frac{10}{7} C_2\lambda +\frac{\alpha +\lambda }{\alpha }\right)&< 1+\frac{5\lambda }{\alpha +\lambda }\left[ \int \limits _0^1 \left( 2\alpha C_2 x^2+\frac{10}{7} C_2\lambda +\frac{\alpha +\lambda }{\alpha }\right) x^4 dx\right] \\&= \frac{10}{7} C_2\lambda +\frac{\alpha +\lambda }{\alpha }, \end{aligned}$$

we conclude that $z^*< \frac{10}{7} C_2\lambda +\frac{\alpha +\lambda }{\alpha }$.

(c) Clearly, the function $u^*(x)$ (supplemented by $u^*(0)=1-z^*$) is bounded; hence $u^*\in \mathbf{B}_{w^{\prime }}(S)$. Therefore, according to Theorem 4, it is sufficient to check that $u^*$ solves equation (11) and $\phi ^*$ provides the infimum.

The expression in the parenthesis of (11) equals

$$\begin{aligned} \lambda \int \limits _0^1 u^*(y) 5 y^4 dy-\lambda u^*(0) \text{ if } x=0, \end{aligned}$$

and

$$\begin{aligned} C_1x+C_2 a^2-\frac{a}{x} +\frac{a}{x} u^*(0)-\frac{a}{x} u^*(x)\quad \text{ if } x\in (0,1]. \end{aligned}$$

Therefore,

$$\begin{aligned} u^*(0)=\frac{5\lambda }{\alpha +\lambda }\int \limits _0^1 u^*(y) y^4 dy \end{aligned}$$

and $\phi ^*(x)$ given by (15) provides the infimum. (Note that $u^*(x)+z^*\ge -2\alpha C_2 x^2+2\sqrt{\alpha ^2 C_2^2 x^4}=0.$) Finally, at $x>0$, the RHS of (11) equals $C_1 x-\frac{(u^*(x)+z^*)^2}{4x^2 C_2}$, and the equation

$$\begin{aligned} 4\alpha C_2 x^2 u^*(x)=4C_1C_2 x^3-(u^*(x))^2-2 u^*(x) z^*-(z^*)^2 \end{aligned}$$

holds because $u^*(x)=-2\alpha C_2 x^2-z^*+2\sqrt{\alpha ^2 C_2^2 x^4+C_1C_2 x^3+\alpha C_2 x^2 z^*}.$ $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Piunovskiy, A., Zhang, Y. Discounted continuous-time Markov decision processes with unbounded rates and randomized history-dependent policies: the dynamic programming approach. 4OR-Q J Oper Res 12, 49–75 (2014). https://doi.org/10.1007/s10288-013-0236-1

Download citation

Received: 29 November 2012
Revised: 19 February 2013
Published: 31 March 2013
Issue Date: March 2014
DOI: https://doi.org/10.1007/s10288-013-0236-1

Keywords

Mathematics Subject Classification (2000)

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discounted continuous-time Markov decision processes with unbounded rates and randomized history-dependent policies: the dynamic programming approach

Abstract

Access this article

Similar content being viewed by others

Existence and Uniqueness of Quasi-stationary Distributions for Symmetric Markov Processes with Tightness Property

A Crash Course in Differential Games and Applications

On some mean field games and master equations through the lens of conservation laws

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Proof of Theorem 2

Lemma 1

Proof

Remark 6

Proof of Theorem 3

Lemma 2

Proof

Corollary 1

Proof of Theorem 4

Proof of Proposition 1

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification (2000)

Navigation

Discounted continuous-time Markov decision processes with unbounded rates and randomized history-dependent policies: the dynamic programming approach

Abstract

Access this article

Similar content being viewed by others

Existence and Uniqueness of Quasi-stationary Distributions for Symmetric Markov Processes with Tightness Property

A Crash Course in Differential Games and Applications

On some mean field games and master equations through the lens of conservation laws

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Proof of Theorem 2

Lemma 1

Proof

Remark 6

Proof of Theorem 3

Lemma 2

Proof

Corollary 1

Proof of Theorem 4

Proof of Proposition 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2000)

Search

Navigation