Computation of weighted sums of rewards for concurrent MDPs

Buchholz, Peter; Scheftelowitsch, Dimitri

doi:10.1007/s00186-018-0653-1

Computation of weighted sums of rewards for concurrent MDPs

Original Article
Published: 31 October 2018

Volume 89, pages 1–42, (2019)
Cite this article

Mathematical Methods of Operations Research Aims and scope Submit manuscript

727 Accesses
Explore all metrics

Abstract

We consider sets of Markov decision processes (MDPs) with shared state and action spaces and assume that the individual MDPs in such a set represent different scenarios for a system’s operation. In this setting, we solve the problem of finding a single policy that performs well under each of these scenarios by considering the weighted sum of value vectors for each of the scenarios. Several solution approaches as well as the general complexity of the problem are discussed and algorithms that are based on these solution approaches are presented. Finally, we compare the derived algorithms on a set of benchmark problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Tuning the Discount Factor in Order to Reach Average Optimality on Deterministic MDPs

Simple Strategies in Multi-Objective MDPs

Concurrent MDPs with Finite Markovian Policies

Notes

This corresponds also to the corresponding slack variable being zero.
Alternatively, an Octave implementation with solvers sqp and glpk is also available and shows a similar performance.

References

Amato C, Bernstein DS, Zilberstein S (2007) Solving POMDPs using quadratically constrained linear programs. In: Proceedings of the 20th international joint conference on artificial intelligence, IJCAI 2007. Hyderabad, India, January 6–12, 2007, pp 2418–2424
Berman A, Plemmons RJ (1994) Nonnegative matrices in the mathematical sciences. Classics in applied mathematics. SIAM, Philadelphia
Book MATH Google Scholar
Bertsimas D, Mišić VV (2017) Robust product line design. Oper Res 65(1):19–37
Article MathSciNet MATH Google Scholar
Bertsimas D, Silberholz J, Trikalinos T (2016) Optimal healthcare decision making under multiple mathematical models: application in prostate cancer screening. Health Care Manag Sci 21:105–118
Article Google Scholar
Björklund H, Vorobyov S (2007) A combinatorial strongly subexponential strategy improvement algorithm for mean payoff games. Discrete Appl Math 155(2):210–229. https://doi.org/10.1016/j.dam.2006.04.029
Article MathSciNet MATH Google Scholar
Caro F, Das-Gupta A (2015) Robust control of the multi-armed bandit problem. Ann Oper Res. https://doi.org/10.1007/s10479-015-1965-7
Castillo AC, Castro PM, Mahalec V (2018) Global optimization of MIQCPs with dynamic piecewise relaxations. J Glob Optim 71(4):691–716. https://doi.org/10.1007/s10898-018-0612-7
Article MathSciNet MATH Google Scholar
Colvin M, Maravelias CT (2010) Modeling methods and a branch and cut algorithm for pharmaceutical clinical trial planning using stochastic programming. Eur J Oper Res 203(1):205–215
Article MATH Google Scholar
d’Epenoux F (1963) A probabilistic production and inventory problem. Manag Sci 10(1):98–108. https://doi.org/10.1287/mnsc.10.1.98
Article Google Scholar
Dupacová J, Consigli G, Wallace SW (2000) Scenarios for multistage stochastic programs. Ann Oper Res 100(1–4):25–53. https://doi.org/10.1023/A:1019206915174
Article MathSciNet MATH Google Scholar
Ehrgott M (2005) Multicriteria optimization, 2nd edn. Springer, Berlin. https://doi.org/10.1007/3-540-27659-9
Book MATH Google Scholar
Feinberg EA, Schwartz A (eds) (2002) Handbook of Markov decision processes. Kluwer, Boston
MATH Google Scholar
Filar J, Vrieze K (1997) Competitive Markov decision processes. Springer, New York
MATH Google Scholar
Gandhi A, Gupta V, Harchol-Balter M, Kozuch MA (2010) Optimality analysis of energy-performance trade-off for server farm management. Perform Eval 67(11):1155–1171
Article Google Scholar
Garey MR, Johnson DS (1978) Computers and intractability: a guide to the theory of NP-completeness. Freeman, San Francisco
MATH Google Scholar
Givan R, Leach SM, Dean TL (2000) Bounded-parameter Markov decision processes. Artif Intell 122(1–2):71–109
Article MathSciNet MATH Google Scholar
Hager WW (1989) Updating the inverse of a matrix. SIAM Rev 31(2):221–239
Article MathSciNet MATH Google Scholar
Iyengar GN (2005) Robust dynamic programming. Math Oper Res 30(2):257–280
Article MathSciNet MATH Google Scholar
Kaelbling LP, Littman ML, Cassandra AR (1998) Planning and acting in partially observable stochastic domains. Artif Intell 101(1–2):99–134
Article MathSciNet MATH Google Scholar
Klamroth K, Köbis E, Schöbel A, Tammer C (2013) A unified approach for different concepts of robustness and stochastic programming via non-linear scalarizing functionals. Optimization 62(5):649–671
Article MathSciNet MATH Google Scholar
Mercier L, Hentenryck PV (2008) Amsaa: a multistep anticipatory algorithm for online stochastic combinatorial optimization. In: Perron L, Trick MA (eds) Integration of AI and OR techniques in constraint programming for combinatorial optimization problems, 5th international conference, CPAIOR 2008, Paris, France, May 20–23, 2008, Proceedings. Lecture Notes in Computer Science, vol 5015, pp 173–187. Springer
Nesterov Y, Nemirovskii A (1994) Interior-point polynomial algorithms in convex programming. Society for Industrial and Applied Mathematics, Philadelphia
Book MATH Google Scholar
Nilim A, Ghaoui LE (2005) Robust control of Markov decision processes with uncertain transition matrices. Oper Res 53(5):780–798
Article MathSciNet MATH Google Scholar
Papadimitriou CH, Tsitsiklis JN (1987) The complexity of Markov decision processes. Math Oper Res 12(3):441–450
Article MathSciNet MATH Google Scholar
Park J, Boyd S (2017) Heuristics for nonconvex quadratically constrained quadratic programming. CoRR arXiv:1703.07870v2
Puterman ML (2005) Markov decision processes. Wiley, London
MATH Google Scholar
Qualizza A, Belotti P, Margot F (2012) Linear programming relaxations of quadratically constrained quadratic programs. In: Lee J, Leyffer S (eds) Mixed integer nonlinear programming, vol 154. Springer, New York
Chapter MATH Google Scholar
Raskin J, Sankur O (2014) Multiple-environment Markov decision processes. CoRR arXiv:1405.4733
Rockafellar RT, Wets RJ (1991) Scenarios and policy aggregation in optimization under uncertainty. Math Oper Res 16(1):119–147
Article MathSciNet MATH Google Scholar
Roijers DM, Scharpff J, Spaan MTJ, Oliehoek FA, de Weerdt M, Whiteson S (2014) Bounded approximations for linear multi-objective planning under uncertainty. In: Chien SA, Do MB, Fern A, Ruml W (eds) Proceedings of the twenty-fourth international conference on automated planning and scheduling, ICAPS 2014, Portsmouth, New Hampshire, USA, June 21–26, 2014. http://www.aaai.org/ocs/index.php/ICAPS/ICAPS14/paper/view/7929
Ruszczyński A, Shapiro A (2009) Lectures on stochastic programming. SIAM, Philadelphia. https://doi.org/10.1137/1.9780898718751
Book MATH Google Scholar
Satia JK, Lave RE (1973) Markovian decision processes with uncertain transition probabilities. Oper Res 21(3):728–740
Article MathSciNet MATH Google Scholar
Serfozo RF (1979) An equivalence between continuous and discrete time Markov decision processes. Oper Res 27(3):616–620
Article MathSciNet MATH Google Scholar
Sigaud O, Buffet O (eds) (2010) Markov decision processes in artificial intelligence. Wiley-ISTE, London
MATH Google Scholar
Singh SP, Cohn D (1997) How to dynamically merge Markov decision processes. In: Jordan MI, Kearns MJ, Solla SA(eds) Advances in neural information processing systems 10, [NIPS Conference, Denver, Colorado, USA, 1997]. The MIT Press, pp 1057–1063
Singh SP, Jaakkola TS, Jordan MI (1994) Learning without state-estimation in partially observable Markovian decision processes. In: Cohen WW, Hirsh H (eds) Machine learning, proceedings of the eleventh international conference, Rutgers University, New Brunswick, NJ, USA, July 10–13, 1994, pp 284–292
Steimle LN, Kaufman DL, Denton BT (2018) Multi-model Markov decision processes. Technical report, Optimization-online
Vielma JP (2015) Mixed integer linear programming formulation techniques. SIAM Rev 57(1):3–57
Article MathSciNet MATH Google Scholar
Walraven E, Spaan MTJ (2015) Planning under uncertainty with weighted state scenarios. In: Meila M, Heskes T (eds) Proceedings of the thirty-first conference on uncertainty in artificial intelligence, UAI 2015, July 12–16, 2015, Amsterdam, The Netherlands, pp 912–921. AUAI Press
White CC, Eldeib HK (1994) Markov decision processes with imprecise transition probabilities. Oper Res 42(4):739–749
Article MathSciNet MATH Google Scholar
White CC, White DJ (1989) Markov decision processes. Eur J Oper Res 39(6):1–16
Article MathSciNet MATH Google Scholar
Wierman A, Andrew LL, Tang A (2012) Power-aware speed scaling in processor sharing systems: optimality and robustness. Perform Eval 69(12):601–622
Article Google Scholar
Wiesemann W, Kuhn D, Rustem B (2013) Robust Markov decision processes. Math Oper Res 38(1):153–183
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Informatik IV, TU Dortmund, Dortmund, Germany
Peter Buchholz & Dimitri Scheftelowitsch

Authors

Peter Buchholz
View author publications
You can also search for this author inPubMed Google Scholar
Dimitri Scheftelowitsch
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Peter Buchholz.

Appendices

Appendix A: Action dependent rewards

Assume that rewards of a concurrent MDP depend on actions and successor states. Then ${\varvec{R}}_k^a(s,s')$ is the reward that is obtained in state s of MDP k if action a is chosen and $s'$ is the successor state. Let

$$\begin{aligned} \left( {\mathcal {S}}, \varvec{\alpha }, \left( {\varvec{P}}^a_k\right) _{a \in {\mathcal {A}}}, \left( {\varvec{R}}^a_k\right) _{a \in {\mathcal {A}}}\right) \end{aligned}$$

(27)

be a concurrent MDP with action-dependent rewards. We assume that this MDP should be analyzed for the discounted reward with discount factor $\gamma \in ({0,1})$. The transformation into an MDP with rewards that do not depend on the action results in the following MDP which depends on $\gamma $.

$$\begin{aligned} \begin{array}{l} \left( {\tilde{{\mathcal {S}}}}, \varvec{{\tilde{\alpha }}}, \left( {\varvec{{\tilde{P}}}}^a_k\right) _{a \in {\mathcal {A}}}, \varvec{{\tilde{r}}}\right) , \text{ where } \\ {\tilde{{\mathcal {S}}}} = {\mathcal {S}}\cup \left\{ (s,a,s') | s, s' \in {\mathcal {S}}, a \in {\mathcal {A}}, {\varvec{P}}_k^a(s,a,s') > 0\right\} \\ \varvec{{\tilde{\alpha }}}\in {{\mathbb {R}}}_{\ge 0}^{|{\tilde{{\mathcal {S}}}}|\times 1}, \varvec{{\tilde{\alpha }}}\mathbb {1}= 1 \text{ with } \varvec{{\tilde{\alpha }}}(\varvec{\sigma }) = \left\{ \begin{array}{ll} \varvec{\alpha }(s) &{} \text{ if } \varvec{\sigma } = s, \\ 0 &{} \text{ otherwise, } \end{array}\right. \\ {\varvec{{\tilde{P}}}}_k^a \in {{\mathbb {R}}}^{|{\tilde{{\mathcal {S}}}}| \times |{\tilde{{\mathcal {S}}}}|}_{\ge 0}, {\varvec{{\tilde{P}}}}_k^a\mathbb {1}= \mathbb {1} \text{ with } {\varvec{{\tilde{P}}}}_k^a(\varvec{\sigma },\varvec{\sigma '}) = \left\{ \begin{array}{ll} {\varvec{P}}_k^a(s,s') &{} \text{ if } \varvec{\sigma } = s \text{ and } \varvec{\sigma '} = (s,a,s'),\\ 1.0 &{} \text{ if } \varvec{\sigma } = (s, a, s') \text{ and } \varvec{\sigma '} = s', \\ 0 &{} \text{ otherwise, } \end{array}\right. \\ \varvec{{\tilde{r}}}\in {{\mathbb {R}}}^{1 \times |{\tilde{{\mathcal {S}}}}|} \text{ with } \varvec{{\tilde{r}}}(\varvec{\sigma }) = \left\{ \begin{array}{ll} \frac{{\varvec{R}}^a_k(s,s')}{\sqrt{\gamma }} &{} \text{ if } \varvec{\sigma } = (s, a, s'),\\ 0 &{} \text{ otherwise. } \end{array}\right. \end{array}\nonumber \\ \end{aligned}$$

(28)

Observe that the state space ${\tilde{{\mathcal {S}}}}$ consists of states $s \in {\mathcal {S}}$ and states $(s,a,s')$. The modified MDP alternates between states $s \in {\tilde{{\mathcal {S}}}} \cap {\mathcal {S}}$ and states $(s,a,s') \in {\tilde{{\mathcal {S}}}} {\setminus } {\mathcal {S}}$. Rewards are only gained in the latter states.

For some policy ${\varvec{{\varPi }}}$ defined on ${\mathcal {S}}$ we define a policy ${\varvec{{\tilde{{\varPi }}}}}$ on ${\tilde{{\mathcal {S}}}}$ as follows

$$\begin{aligned} \varvec{{\tilde{\pi }}}_{\varvec{\sigma }}(a) = \left\{ \begin{array}{ll} \varvec{\pi }_s(a) &{}\quad \text{ if } \varvec{\sigma }=s \in {\mathcal {S}}, \\ 1 &{} \quad \text{ if } \varvec{\sigma } = (s, a, s'),\\ 0 &{} \quad \text{ otherwise. } \end{array}\right. \end{aligned}$$

Thus, a policy for the MDP with action-dependent rewards can be uniquely translated into a policy for the MDP with action-independent rewards. Define the following sequences of vectors

$$\begin{aligned} \begin{array}{lll} \varvec{g}^0 = {\mathbf {0}}\in {{\mathbb {R}}}^{|{\mathcal {S}}| \times 1}, &{} \varvec{g}^h(s) = \sum \limits _{s' \in {\mathcal {S}}} \sum \limits _{a \in {\mathcal {A}}} \varvec{\pi }_s(a){\varvec{P}}_k^a(s,s')\left( {\varvec{R}}_k^a(s,s')+ \gamma \varvec{g}^{h-1}(s')\right) \\ \varvec{{\tilde{g}}}^0 = {\mathbf {0}}\in {{\mathbb {R}}}^{|{\tilde{{\mathcal {S}}}}| \times 1}, &{} \varvec{{\tilde{g}}}^h(\varvec{\sigma }) = \sum \limits _{\varvec{\sigma '} \in {\tilde{{\mathcal {S}}}}} \sum \limits _{a \in {\mathcal {A}}} \varvec{{\tilde{\pi }}}_s(a){\varvec{{\tilde{P}}}}_k^a(\varvec{\sigma },\varvec{\sigma '})\left( \varvec{{\tilde{r}}}(\varvec{\sigma })+ \sqrt{\gamma } \varvec{{\tilde{g}}}^{h-1}(\varvec{\sigma '})\right) \end{array} \end{aligned}$$

The vectors $\varvec{g}^h$ and $\varvec{{\tilde{g}}}^h$ contain the expected discounted rewards over the course of h transitions under the policy ${\varvec{{\varPi }}}$ resp. ${\varvec{{\tilde{{\varPi }}}}}$. Now we show the main relation between $\varvec{g}^h$ and $\varvec{{\tilde{g}}}^h$.

Theorem 6

For all $s \in {\mathcal {S}}$ and all $h \in {{\mathbb {N}}}_0$ it is $\varvec{g}^h(s) = \varvec{{\tilde{g}}}^{2h}(s) = \varvec{{\tilde{g}}}^{2h+1}(s)$.

Proof

We prove the correspondence of the values in the vectors for h and 2h by induction. For $h=0$ we have by definition of the vectors $\varvec{g}^0(s) = \varvec{{\tilde{g}}}^0(s) = 0$. Now assume that the correspondence holds for $h \ge 0$, then

$$\begin{aligned} \begin{array}{lll} \varvec{{\tilde{g}}}^{2h+2}(s) &{} = \sum \limits _{s'\in {\mathcal {S}}}\sum \limits _{a \in {\mathcal {A}}} \varvec{{\tilde{\pi }}}_s(a){\varvec{{\tilde{P}}}}_k^a(s,(s,a,s')) \left( \varvec{{\tilde{r}}}(s) + \sqrt{\gamma } \varvec{{\tilde{g}}}^{2h+1}(s,a,s')\right) \\ &{} = \sum \limits _{s'\in {\mathcal {S}}}\sum \limits _{a \in {\mathcal {A}}} \varvec{\pi }_s(a){\varvec{P}}_k^a(s,s')\left( \sum \limits _{s''\in {\mathcal {S}}}\sum \limits _{a' \in {\mathcal {A}}} \varvec{{\tilde{\pi }}}_{(s,a,s')}(a'){\varvec{{\tilde{P}}}}_k^{a'}((s,a,s'), s'') \right. \\ &{} \quad \left. \times \left( \sqrt{\gamma } \varvec{{\tilde{r}}}(s,a,s') + \gamma \varvec{{\tilde{g}}}^{2h}(s'') \right) \right) \\ &{} = \sum \limits _{s'\in {\mathcal {S}}}\sum \limits _{a \in {\mathcal {A}}} \varvec{\pi }_s(a){\varvec{P}}_k^a(s,s') \left( \sqrt{\gamma } \varvec{{\tilde{r}}}(s,a,s') + \gamma \varvec{{\tilde{g}}}^{2h}(s') \right) \\ &{} = \sum \limits _{s'\in {\mathcal {S}}}\sum \limits _{a \in {\mathcal {A}}} \varvec{\pi }_s(a){\varvec{P}}_k^a(s,s')\left( \sqrt{\gamma } \frac{{\varvec{R}}_k^a(s,s')}{\sqrt{\gamma }} + \gamma \varvec{g}^h(s')\right) \\ &{} = \varvec{g}^{h+1}(s) \end{array} \end{aligned}$$

which implies that it holds for $h+1$. Now we consider $\varvec{{\tilde{g}}}^{2h+1}(s)$. For $h = 0$ we have

$$\begin{aligned} \begin{array}{lll} \varvec{{\tilde{g}}}^1(s) = \sum \limits _{s'\in {\mathcal {S}}}\sum \limits _{a \in {\mathcal {A}}} \varvec{\pi }_s(a){\varvec{{\tilde{P}}}}_k^a(s,(s,a,s')) \left( \varvec{{\tilde{r}}}(s) + \sqrt{\gamma }\varvec{{\tilde{g}}}^{0}(s,a,s')\right) = 0 \end{array} \end{aligned}$$

because $\varvec{{\tilde{r}}}(s) = 0$ and $\varvec{{\tilde{g}}}^0={\mathbf {0}}$. Now assume that the result holds for $2h+1$, we show that it holds for $2h+3$.

$$\begin{aligned} \begin{array}{lll} \varvec{{\tilde{g}}}^{2h+3}(s) &{} = \sum \limits _{s'\in {\mathcal {S}}}\sum \limits _{a \in {\mathcal {A}}} \varvec{{\tilde{\pi }}}_s(a){\varvec{{\tilde{P}}}}_k^a(s,(s,a,s')) \left( \varvec{{\tilde{r}}}(s) + \sqrt{\gamma }\varvec{{\tilde{g}}}^{2h+2}(s,a,s')\right) \\ &{} = \sum \limits _{s'\in {\mathcal {S}}}\sum \limits _{a \in {\mathcal {A}}} \varvec{\pi }_s(a){\varvec{P}}_k^a(s,s')\left( \sum \limits _{s''\in {\mathcal {S}}}\sum \limits _{a' \in {\mathcal {A}}} \varvec{{\tilde{\pi }}}_{(s,a,s')}(a'){\varvec{{\tilde{P}}}}_k^{a'}((s,a,s'), s'')\right. \\ &{} \quad \times \left. \left( \sqrt{\gamma }\varvec{{\tilde{r}}}(s,a,s') + \gamma \varvec{{\tilde{g}}}^{2h+1}(s'') \right) \right) \\ &{} = \sum \limits _{s'\in {\mathcal {S}}}\sum \limits _{a \in {\mathcal {A}}} \varvec{\pi }_s(a){\varvec{P}}_k^a(s,s')\left( \sqrt{\gamma }\frac{{\varvec{R}}_k^a(s,s')}{\sqrt{\gamma }} +\gamma \varvec{g}^h(s')\right) \\ &{} = \varvec{g}^{h+1}(s) \end{array} \end{aligned}$$

$\square $

Let $G^{\varvec{{\varPi }}}(\gamma ) = \varvec{\alpha }\lim _{h \rightarrow \infty } \varvec{g}^h$ and ${\tilde{G}}^{\varvec{{\tilde{{\varPi }}}}}(\sqrt{\gamma }) = \varvec{{\tilde{\alpha }}}\lim _{h \rightarrow \infty } \varvec{{\tilde{g}}}^h$ the discounted gains for the two MDPs with discount factors $\gamma \in ({0,1})$ and $\sqrt{\gamma }$, respectively.

Theorem 7

For any policy ${\varvec{{\varPi }}}$ and any discount factor $\gamma \in ({0,1})$ the relation

$$\begin{aligned} G^{\varvec{{\varPi }}}(\gamma ) = {\tilde{G}}^{\varvec{{\tilde{{\varPi }}}}}(\sqrt{\gamma }) \end{aligned}$$

holds.

Proof

First, we show the existence of $G^{\varvec{{\varPi }}}(\gamma )$. Since ${\varvec{R}}^a_k(s, s')$ is finite for all $a \in {\mathcal {A}}$ and for all $s, s' \in {\mathcal {S}}$ and can be bounded by a real value $R \ge \left| {{\varvec{R}}^a_k(s, s')}\right| $, the sequence $\left( \varvec{g}^h\right) _{h \in {{\mathbb {N}}}_0}$ adheres to

$$\begin{aligned} \left| {\varvec{g}^{h+i}(s) - \varvec{g}^i(s)}\right|&= \sum \limits _{s' \in {\mathcal {S}}} \sum \limits _{a \in {\mathcal {A}}} \varvec{\pi }_s(a) {\varvec{P}}^a_k(s, s') \left| {\varvec{R}}^a_k(s, s') + \gamma \varvec{g}^{h+i - 1}(s') - {\varvec{R}}^a_k(s, s') \right. \\&\quad \left. - \gamma \varvec{g}^{i - 1}(s') \right| \\&\le \gamma \max \limits _{s' \in {\mathcal {S}}} \left| { \varvec{g}^{h+i - 1}(s') - \varvec{g}^{i - 1}(s') }\right| . \end{aligned}$$

By induction, it follows that

$$\begin{aligned} \left| {\varvec{g}^{h+i}(s) - \varvec{g}^i(s)}\right|&\le \gamma ^i \max \limits _{s' \in {\mathcal {S}}} \left| {\varvec{g}^h(s')}\right| \\&\le \gamma ^i R \sum \limits _{j = 0}^h \gamma ^j \\&= \gamma ^i R \frac{1 - \gamma ^h}{1 - \gamma } \\&\le \gamma ^i \frac{R}{1 - \gamma } \end{aligned}$$

for $i \in {{\mathbb {N}}}$ which implies the Cauchy convergence criterion. Then Theorem 6 implies that $\varvec{g}(s) = \lim _{h \rightarrow \infty } \varvec{g}^h(s) = \lim _{h \rightarrow \infty } \varvec{{\tilde{g}}}^{2h}(s) = \lim _{h \rightarrow \infty } \varvec{{\tilde{g}}}^h(s) = \varvec{{\tilde{g}}}(s)$ for all $s \in {\mathcal {S}}$ and we have

$$\begin{aligned} G^{\varvec{{\varPi }}}(\gamma )&= \sum \limits _{s \in {\mathcal {S}}} \varvec{\alpha }(s) \varvec{g}(s) \\&= \sum \limits _{s \in {\mathcal {S}}} \varvec{{\tilde{\alpha }}}(s) \varvec{{\tilde{g}}}(s)&\text {since }\varvec{{\tilde{\alpha }}}(s) = \varvec{\alpha }(s)\text { for }s \in {\mathcal {S}}\\&= \sum \limits _{\sigma \in {\tilde{{\mathcal {S}}}}} \varvec{{\tilde{\alpha }}}(\sigma ) \varvec{{\tilde{g}}}(\sigma )&\text {since }\varvec{{\tilde{\alpha }}}(\sigma ) = 0\text { for }\sigma \not \in {\mathcal {S}}\\&= {\tilde{G}}^{\varvec{{\varPi }}}(\sqrt{\gamma }). \end{aligned}$$

$\square $

Analogously the following result can be shown.

Theorem 8

For any policy ${\varvec{{\tilde{{\varPi }}}}}$ in the action-independent MDP $\left( {{\tilde{{\mathcal {S}}}}, \varvec{{\tilde{\alpha }}}, \left( {{\varvec{{\tilde{P}}}}^a_k}\right) _{a \in {\mathcal {A}}}, \varvec{{\tilde{r}}}}\right) $, the policy ${\varvec{{\varPi }}}$ in the action-dependent MDP derived by projection of ${\varvec{{\tilde{{\varPi }}}}}$ on ${\mathcal {S}}$ is subject to

$$\begin{aligned} G^{\varvec{{\varPi }}}(\gamma ) = {\tilde{G}}^{\varvec{{\tilde{{\varPi }}}}}(\sqrt{\gamma }) \end{aligned}$$

This implies that every policy (including the optimal policy) in the MDP with action-independent rewards yields the same expected discounted reward as its projection in the MDP with action-dependent rewards. Together with Theorem 7 this means that there exists a one-to-one correspondence between policies in both MDPs which yields the same expected discounted reward, and optimal policies in one MDP are also optimal in the other one. This completes the transformation from the action-dependent reward model.

Appendix B: Proof of Theorem 1

Theorem 1

The decision problem defined in Definition 1 is NP-complete.

We perform a reduction from 3-SAT. Given a 3-SAT instance with n variables $x_1, \ldots , x_n$ and m clauses $C_1, \ldots , C_m$ where each clause contains three literals, we construct a concurrent MDP ${\mathcal {M}}= \{1, 2\}$ consisting of two MDPs, a vector $\varvec{w}\in {{\mathbb {R}}}^2$ and a real number $g \in {{\mathbb {R}}}$ such that the instance will be satisfiable if and only if there is a policy ${\varvec{{\varPi }}}$ for the concurrent MDP that yields the value g.

The first part of our construction are the states. They are arranged in three groups.

First, we create a specially designated sink state $s_0$ which yields 0 reward in both MDPs.
Then, we transfer the variables of the Boolean satisfiability problem into states of the MDPs: for each variable x we create two states $s_{x}, s_{x}'$ in both MDPs. The reward is 0 in states of the form $s_{x}$ and 1 in states of the form $s_{x}'$.
Last, we create states for clauses: for each clause C, a state $s_C$ is created. Again, the reward in $s_C$ is zero for all clauses.

The second part of the construction are the actions. We create three actions ${\mathcal {A}}= \{1, 2, 3\}$ with the following semantics.

In sink state $s_0$, it is ${\varvec{P}}^a_k(s_0, s_0) = 1$ for all $a \in {\mathcal {A}}$ and $k \in {\mathcal {M}}$.
In the variable states, we define ${\varvec{P}}^1_1(s_x, s_{x}') = {\varvec{P}}^2_1(s_x, s_0) = {\varvec{P}}^3_1(s_x, s_0) = 1$ and ${\varvec{P}}^1_2(s_x, s_0) = {\varvec{P}}^2_2(s_x, s_{x}') = {\varvec{P}}^3_2(s_x, s_{x}') = 1$, that is, we define actions in $s_x$ to lead to different outcomes in the MDPs. The motivation is to force a mutually exclusive choice of values for the Boolean variables in the concurrent MDP. In the auxiliary variable states, it is ${\varvec{P}}^a_k(s_x', s_x) = 1$ for all actions $a \in {\mathcal {A}}$ and MDPs $k \in {\mathcal {M}}$; the idea behind these states is to exploit non-linearity of the problem. The construction is visualized in Fig. 3 where the upper part corresponds to the first MDP and the lower part corresponds to the second MDP in ${\mathcal {M}}$.
In the clause states, we define actions as follows. In a clause $C = L_1 \vee L_2 \vee L_3$, the chosen action represents the literal that evaluates to true. Hence, we define ${\varvec{P}}^a_k(C, s)$ by setting ${\varvec{P}}^a_k(C, s) = 1$ in the cases
- $L_a = x, k = 1, s = s_x$
- $L_a = \lnot x, k = 1, s = s_0$
- $L_a = \lnot x, k = 2, s = s_x$
- $L_a = x, k = 2, s = s_0$

A graphical sketch of this setup can be seen in Fig. 4. Again, the upper part of the drawing corresponds to the first MDP in ${\mathcal {M}}$ while the lower part corresponds to the second MDP.

The idea behind this construction is to infer functions $\beta :\{1, \ldots , n\} \rightarrow \{0, 1\}$ that map variables to values and $\nu :\{1, \ldots , m\} \rightarrow \{1, 2, 3\}$ that map the clauses to the satisfying variables. This is done to create a mapping from policies to variable assignments in the SAT problem. Furthermore, we define the initial distribution $\alpha $ with $\alpha (s_C) = {1}/{m}$ for all clauses C for both MDPs and weights $\varvec{w}= ({1}/{2}, {1}/{2})$. Concerning the value, we set an auxiliary constant $q := \frac{1}{1 - \gamma ^2}$ and the required value $g := \frac{\gamma ^2 q}{2}$ where $\gamma $ is a non-zero discount factor in the concurrent MDP.

We prove the validity of the reduction. First, we show that if there is an assignment $\beta :\{1, \ldots , n\} \rightarrow \{0, 1\}$ that satisfies the SAT instance, then there also exists a policy $\pi $ such that $\sum _{k=1}^{K} \varvec{w}(k) G^{{\varvec{{\varPi }}}}_k \ge g$. We construct the policy in two steps. In the first step, we set $\pi _{s_x}(1) = 1 \Leftrightarrow \beta (x) = 1$ for all variables x. In the second step, it follows from the existence of a satisfying assignment that in each clause, a literal is satisfied, defining a function $\nu :\{1, \ldots , m\} \rightarrow \{1, 2, 3\}$ that defines the number of a satisfied literal in every clause. Thus, we set $\pi _{s_C}(a) = 1 \Leftrightarrow \nu (C) = a$.

We verify that the constructed policy yields the given value. As in each clause the satisfying literal is chosen, the value of this state will be 0 in one MDP and $\gamma ^2 q$ in the other one.

Now we show that if there is no satisfying assignment, then the value of the concurrent MDP will be lower than g. Given any assignment $\beta $ and any assignment $\nu $, the induced policy will lead from at least one clause state to the sink state $s_0$ with nonzero probability in both MDPs, yielding a lower value. However, we must take care of stationary but not pure policies that still might induce the desired value. One can observe that if the stationary policy is not pure in a state $s_x$ for a variable x, then the cumulative discounted reward in this state is $\frac{p\gamma }{1 - p^2\gamma ^2}$ for some real $0< p < 1$. Deriving the value of a clause state from which one can arrive to this variable state, we get, summing over both MDPs, a summand

$$\begin{aligned} \frac{1}{2} \left( \frac{p \gamma ^2}{1 - p^2 \gamma ^2} + \frac{(1 - p) \gamma ^2}{1 - (1 - p)^2 \gamma ^2} \right) \end{aligned}$$

Let $f(p) = \frac{p}{1 - p^2 \gamma ^2} + \frac{1 - p}{1 - (1 - p)^2 \gamma ^2}$. Computing the derivative, we obtain

$$\begin{aligned} f'(p) = \frac{2 \gamma ^2 p}{ \left( 1 -\gamma ^2 p^2 \right) ^2 } - \frac{2 \gamma ^2 (1 - p)^2}{ \left( 1 -\gamma ^2 (1 - p)^2 \right) ^2 } + \frac{1}{1 - \gamma ^2 p^2} - \frac{1}{1 - \gamma ^2 (1 - p)^2} \end{aligned}$$

which has its roots at

$$\begin{aligned} \frac{1}{2}, \frac{1 \pm \sqrt{1 - 4\gamma ^{-2} + 4\gamma ^{-2} \sqrt{4 - \gamma ^{2}} }}{2}, \frac{1 \pm i \sqrt{1 - 4\gamma ^{-2} + 4\gamma ^{-2} \sqrt{4 - \gamma ^{2}} }}{2}. \end{aligned}$$

The only roots of interest are the real ones, and thus, we investigate the pair

$$\begin{aligned} \frac{1 \pm \sqrt{1 - 4\gamma ^{-2} + 4\gamma ^{-2} \sqrt{4 - \gamma ^{2}} } }{2}. \end{aligned}$$

(29)

It can be seen that for $0< \gamma < 1$, the value $\sqrt{4 - \gamma ^2}$ is at least $\sqrt{3} > 1$, and the root term in (29) is thus greater than one. This means that the whole term (29) is either greater than one or negative. Hence, the possible extreme points of f in [0, 1] may lie at 0, 1, or ${1}/{2}$. We can see that $f(0) = f(1) = q$ while $f({1}/{2}) = \frac{1}{1 - {1}/{4} \gamma ^2} < q$. Hence, a non-pure policy in a variable state will have a lower cumulative discounted reward. Concerning the clause states, we observe that a non-pure policy cannot yield higher rewards than a pure one, as the expected discounted reward in a clause state is linear in the expected discounted rewards in the following variable states; the clause states are not visited again.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Buchholz, P., Scheftelowitsch, D. Computation of weighted sums of rewards for concurrent MDPs. Math Meth Oper Res 89, 1–42 (2019). https://doi.org/10.1007/s00186-018-0653-1

Download citation

Received: 21 August 2017
Accepted: 17 October 2018
Published: 31 October 2018
Issue Date: 19 February 2019
DOI: https://doi.org/10.1007/s00186-018-0653-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Computation of weighted sums of rewards for concurrent MDPs

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Tuning the Discount Factor in Order to Reach Average Optimality on Deterministic MDPs

Simple Strategies in Multi-Objective MDPs

Concurrent MDPs with Finite Markovian Policies

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Action dependent rewards

Theorem 6

Proof

Theorem 7

Proof

Theorem 8

Appendix B: Proof of Theorem 1

Theorem 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now