Skip to main content
Log in

Computation of weighted sums of rewards for concurrent MDPs

  • Original Article
  • Published:
Mathematical Methods of Operations Research Aims and scope Submit manuscript

Abstract

We consider sets of Markov decision processes (MDPs) with shared state and action spaces and assume that the individual MDPs in such a set represent different scenarios for a system’s operation. In this setting, we solve the problem of finding a single policy that performs well under each of these scenarios by considering the weighted sum of value vectors for each of the scenarios. Several solution approaches as well as the general complexity of the problem are discussed and algorithms that are based on these solution approaches are presented. Finally, we compare the derived algorithms on a set of benchmark problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. This corresponds also to the corresponding slack variable being zero.

  2. Alternatively, an Octave implementation with solvers sqp and glpk is also available and shows a similar performance.

References

  • Amato C, Bernstein DS, Zilberstein S (2007) Solving POMDPs using quadratically constrained linear programs. In: Proceedings of the 20th international joint conference on artificial intelligence, IJCAI 2007. Hyderabad, India, January 6–12, 2007, pp 2418–2424

  • Berman A, Plemmons RJ (1994) Nonnegative matrices in the mathematical sciences. Classics in applied mathematics. SIAM, Philadelphia

    Book  MATH  Google Scholar 

  • Bertsimas D, Mišić VV (2017) Robust product line design. Oper Res 65(1):19–37

    Article  MathSciNet  MATH  Google Scholar 

  • Bertsimas D, Silberholz J, Trikalinos T (2016) Optimal healthcare decision making under multiple mathematical models: application in prostate cancer screening. Health Care Manag Sci 21:105–118

    Article  Google Scholar 

  • Björklund H, Vorobyov S (2007) A combinatorial strongly subexponential strategy improvement algorithm for mean payoff games. Discrete Appl Math 155(2):210–229. https://doi.org/10.1016/j.dam.2006.04.029

    Article  MathSciNet  MATH  Google Scholar 

  • Caro F, Das-Gupta A (2015) Robust control of the multi-armed bandit problem. Ann Oper Res. https://doi.org/10.1007/s10479-015-1965-7

  • Castillo AC, Castro PM, Mahalec V (2018) Global optimization of MIQCPs with dynamic piecewise relaxations. J Glob Optim 71(4):691–716. https://doi.org/10.1007/s10898-018-0612-7

    Article  MathSciNet  MATH  Google Scholar 

  • Colvin M, Maravelias CT (2010) Modeling methods and a branch and cut algorithm for pharmaceutical clinical trial planning using stochastic programming. Eur J Oper Res 203(1):205–215

    Article  MATH  Google Scholar 

  • d’Epenoux F (1963) A probabilistic production and inventory problem. Manag Sci 10(1):98–108. https://doi.org/10.1287/mnsc.10.1.98

    Article  Google Scholar 

  • Dupacová J, Consigli G, Wallace SW (2000) Scenarios for multistage stochastic programs. Ann Oper Res 100(1–4):25–53. https://doi.org/10.1023/A:1019206915174

    Article  MathSciNet  MATH  Google Scholar 

  • Ehrgott M (2005) Multicriteria optimization, 2nd edn. Springer, Berlin. https://doi.org/10.1007/3-540-27659-9

    Book  MATH  Google Scholar 

  • Feinberg EA, Schwartz A (eds) (2002) Handbook of Markov decision processes. Kluwer, Boston

    MATH  Google Scholar 

  • Filar J, Vrieze K (1997) Competitive Markov decision processes. Springer, New York

    MATH  Google Scholar 

  • Gandhi A, Gupta V, Harchol-Balter M, Kozuch MA (2010) Optimality analysis of energy-performance trade-off for server farm management. Perform Eval 67(11):1155–1171

    Article  Google Scholar 

  • Garey MR, Johnson DS (1978) Computers and intractability: a guide to the theory of NP-completeness. Freeman, San Francisco

    MATH  Google Scholar 

  • Givan R, Leach SM, Dean TL (2000) Bounded-parameter Markov decision processes. Artif Intell 122(1–2):71–109

    Article  MathSciNet  MATH  Google Scholar 

  • Hager WW (1989) Updating the inverse of a matrix. SIAM Rev 31(2):221–239

    Article  MathSciNet  MATH  Google Scholar 

  • Iyengar GN (2005) Robust dynamic programming. Math Oper Res 30(2):257–280

    Article  MathSciNet  MATH  Google Scholar 

  • Kaelbling LP, Littman ML, Cassandra AR (1998) Planning and acting in partially observable stochastic domains. Artif Intell 101(1–2):99–134

    Article  MathSciNet  MATH  Google Scholar 

  • Klamroth K, Köbis E, Schöbel A, Tammer C (2013) A unified approach for different concepts of robustness and stochastic programming via non-linear scalarizing functionals. Optimization 62(5):649–671

    Article  MathSciNet  MATH  Google Scholar 

  • Mercier L, Hentenryck PV (2008) Amsaa: a multistep anticipatory algorithm for online stochastic combinatorial optimization. In: Perron L, Trick MA (eds) Integration of AI and OR techniques in constraint programming for combinatorial optimization problems, 5th international conference, CPAIOR 2008, Paris, France, May 20–23, 2008, Proceedings. Lecture Notes in Computer Science, vol 5015, pp 173–187. Springer

  • Nesterov Y, Nemirovskii A (1994) Interior-point polynomial algorithms in convex programming. Society for Industrial and Applied Mathematics, Philadelphia

    Book  MATH  Google Scholar 

  • Nilim A, Ghaoui LE (2005) Robust control of Markov decision processes with uncertain transition matrices. Oper Res 53(5):780–798

    Article  MathSciNet  MATH  Google Scholar 

  • Papadimitriou CH, Tsitsiklis JN (1987) The complexity of Markov decision processes. Math Oper Res 12(3):441–450

    Article  MathSciNet  MATH  Google Scholar 

  • Park J, Boyd S (2017) Heuristics for nonconvex quadratically constrained quadratic programming. CoRR arXiv:1703.07870v2

  • Puterman ML (2005) Markov decision processes. Wiley, London

    MATH  Google Scholar 

  • Qualizza A, Belotti P, Margot F (2012) Linear programming relaxations of quadratically constrained quadratic programs. In: Lee J, Leyffer S (eds) Mixed integer nonlinear programming, vol 154. Springer, New York

    Chapter  MATH  Google Scholar 

  • Raskin J, Sankur O (2014) Multiple-environment Markov decision processes. CoRR arXiv:1405.4733

  • Rockafellar RT, Wets RJ (1991) Scenarios and policy aggregation in optimization under uncertainty. Math Oper Res 16(1):119–147

    Article  MathSciNet  MATH  Google Scholar 

  • Roijers DM, Scharpff J, Spaan MTJ, Oliehoek FA, de Weerdt M, Whiteson S (2014) Bounded approximations for linear multi-objective planning under uncertainty. In: Chien SA, Do MB, Fern A, Ruml W (eds) Proceedings of the twenty-fourth international conference on automated planning and scheduling, ICAPS 2014, Portsmouth, New Hampshire, USA, June 21–26, 2014. http://www.aaai.org/ocs/index.php/ICAPS/ICAPS14/paper/view/7929

  • Ruszczyński A, Shapiro A (2009) Lectures on stochastic programming. SIAM, Philadelphia. https://doi.org/10.1137/1.9780898718751

    Book  MATH  Google Scholar 

  • Satia JK, Lave RE (1973) Markovian decision processes with uncertain transition probabilities. Oper Res 21(3):728–740

    Article  MathSciNet  MATH  Google Scholar 

  • Serfozo RF (1979) An equivalence between continuous and discrete time Markov decision processes. Oper Res 27(3):616–620

    Article  MathSciNet  MATH  Google Scholar 

  • Sigaud O, Buffet O (eds) (2010) Markov decision processes in artificial intelligence. Wiley-ISTE, London

    MATH  Google Scholar 

  • Singh SP, Cohn D (1997) How to dynamically merge Markov decision processes. In: Jordan MI, Kearns MJ, Solla SA(eds) Advances in neural information processing systems 10, [NIPS Conference, Denver, Colorado, USA, 1997]. The MIT Press, pp 1057–1063

  • Singh SP, Jaakkola TS, Jordan MI (1994) Learning without state-estimation in partially observable Markovian decision processes. In: Cohen WW, Hirsh H (eds) Machine learning, proceedings of the eleventh international conference, Rutgers University, New Brunswick, NJ, USA, July 10–13, 1994, pp 284–292

  • Steimle LN, Kaufman DL, Denton BT (2018) Multi-model Markov decision processes. Technical report, Optimization-online

  • Vielma JP (2015) Mixed integer linear programming formulation techniques. SIAM Rev 57(1):3–57

    Article  MathSciNet  MATH  Google Scholar 

  • Walraven E, Spaan MTJ (2015) Planning under uncertainty with weighted state scenarios. In: Meila M, Heskes T (eds) Proceedings of the thirty-first conference on uncertainty in artificial intelligence, UAI 2015, July 12–16, 2015, Amsterdam, The Netherlands, pp 912–921. AUAI Press

  • White CC, Eldeib HK (1994) Markov decision processes with imprecise transition probabilities. Oper Res 42(4):739–749

    Article  MathSciNet  MATH  Google Scholar 

  • White CC, White DJ (1989) Markov decision processes. Eur J Oper Res 39(6):1–16

    Article  MathSciNet  MATH  Google Scholar 

  • Wierman A, Andrew LL, Tang A (2012) Power-aware speed scaling in processor sharing systems: optimality and robustness. Perform Eval 69(12):601–622

    Article  Google Scholar 

  • Wiesemann W, Kuhn D, Rustem B (2013) Robust Markov decision processes. Math Oper Res 38(1):153–183

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peter Buchholz.

Appendices

Appendix A: Action dependent rewards

Assume that rewards of a concurrent MDP depend on actions and successor states. Then \({\varvec{R}}_k^a(s,s')\) is the reward that is obtained in state s of MDP k if action a is chosen and \(s'\) is the successor state. Let

$$\begin{aligned} \left( {\mathcal {S}}, \varvec{\alpha }, \left( {\varvec{P}}^a_k\right) _{a \in {\mathcal {A}}}, \left( {\varvec{R}}^a_k\right) _{a \in {\mathcal {A}}}\right) \end{aligned}$$
(27)

be a concurrent MDP with action-dependent rewards. We assume that this MDP should be analyzed for the discounted reward with discount factor \(\gamma \in ({0,1})\). The transformation into an MDP with rewards that do not depend on the action results in the following MDP which depends on \(\gamma \).

$$\begin{aligned} \begin{array}{l} \left( {\tilde{{\mathcal {S}}}}, \varvec{{\tilde{\alpha }}}, \left( {\varvec{{\tilde{P}}}}^a_k\right) _{a \in {\mathcal {A}}}, \varvec{{\tilde{r}}}\right) , \text{ where } \\ {\tilde{{\mathcal {S}}}} = {\mathcal {S}}\cup \left\{ (s,a,s') | s, s' \in {\mathcal {S}}, a \in {\mathcal {A}}, {\varvec{P}}_k^a(s,a,s') > 0\right\} \\ \varvec{{\tilde{\alpha }}}\in {{\mathbb {R}}}_{\ge 0}^{|{\tilde{{\mathcal {S}}}}|\times 1}, \varvec{{\tilde{\alpha }}}\mathbb {1}= 1 \text{ with } \varvec{{\tilde{\alpha }}}(\varvec{\sigma }) = \left\{ \begin{array}{ll} \varvec{\alpha }(s) &{} \text{ if } \varvec{\sigma } = s, \\ 0 &{} \text{ otherwise, } \end{array}\right. \\ {\varvec{{\tilde{P}}}}_k^a \in {{\mathbb {R}}}^{|{\tilde{{\mathcal {S}}}}| \times |{\tilde{{\mathcal {S}}}}|}_{\ge 0}, {\varvec{{\tilde{P}}}}_k^a\mathbb {1}= \mathbb {1} \text{ with } {\varvec{{\tilde{P}}}}_k^a(\varvec{\sigma },\varvec{\sigma '}) = \left\{ \begin{array}{ll} {\varvec{P}}_k^a(s,s') &{} \text{ if } \varvec{\sigma } = s \text{ and } \varvec{\sigma '} = (s,a,s'),\\ 1.0 &{} \text{ if } \varvec{\sigma } = (s, a, s') \text{ and } \varvec{\sigma '} = s', \\ 0 &{} \text{ otherwise, } \end{array}\right. \\ \varvec{{\tilde{r}}}\in {{\mathbb {R}}}^{1 \times |{\tilde{{\mathcal {S}}}}|} \text{ with } \varvec{{\tilde{r}}}(\varvec{\sigma }) = \left\{ \begin{array}{ll} \frac{{\varvec{R}}^a_k(s,s')}{\sqrt{\gamma }} &{} \text{ if } \varvec{\sigma } = (s, a, s'),\\ 0 &{} \text{ otherwise. } \end{array}\right. \end{array}\nonumber \\ \end{aligned}$$
(28)

Observe that the state space \({\tilde{{\mathcal {S}}}}\) consists of states \(s \in {\mathcal {S}}\) and states \((s,a,s')\). The modified MDP alternates between states \(s \in {\tilde{{\mathcal {S}}}} \cap {\mathcal {S}}\) and states \((s,a,s') \in {\tilde{{\mathcal {S}}}} {\setminus } {\mathcal {S}}\). Rewards are only gained in the latter states.

For some policy \({\varvec{{\varPi }}}\) defined on \({\mathcal {S}}\) we define a policy \({\varvec{{\tilde{{\varPi }}}}}\) on \({\tilde{{\mathcal {S}}}}\) as follows

$$\begin{aligned} \varvec{{\tilde{\pi }}}_{\varvec{\sigma }}(a) = \left\{ \begin{array}{ll} \varvec{\pi }_s(a) &{}\quad \text{ if } \varvec{\sigma }=s \in {\mathcal {S}}, \\ 1 &{} \quad \text{ if } \varvec{\sigma } = (s, a, s'),\\ 0 &{} \quad \text{ otherwise. } \end{array}\right. \end{aligned}$$

Thus, a policy for the MDP with action-dependent rewards can be uniquely translated into a policy for the MDP with action-independent rewards. Define the following sequences of vectors

$$\begin{aligned} \begin{array}{lll} \varvec{g}^0 = {\mathbf {0}}\in {{\mathbb {R}}}^{|{\mathcal {S}}| \times 1}, &{} \varvec{g}^h(s) = \sum \limits _{s' \in {\mathcal {S}}} \sum \limits _{a \in {\mathcal {A}}} \varvec{\pi }_s(a){\varvec{P}}_k^a(s,s')\left( {\varvec{R}}_k^a(s,s')+ \gamma \varvec{g}^{h-1}(s')\right) \\ \varvec{{\tilde{g}}}^0 = {\mathbf {0}}\in {{\mathbb {R}}}^{|{\tilde{{\mathcal {S}}}}| \times 1}, &{} \varvec{{\tilde{g}}}^h(\varvec{\sigma }) = \sum \limits _{\varvec{\sigma '} \in {\tilde{{\mathcal {S}}}}} \sum \limits _{a \in {\mathcal {A}}} \varvec{{\tilde{\pi }}}_s(a){\varvec{{\tilde{P}}}}_k^a(\varvec{\sigma },\varvec{\sigma '})\left( \varvec{{\tilde{r}}}(\varvec{\sigma })+ \sqrt{\gamma } \varvec{{\tilde{g}}}^{h-1}(\varvec{\sigma '})\right) \end{array} \end{aligned}$$

The vectors \(\varvec{g}^h\) and \(\varvec{{\tilde{g}}}^h\) contain the expected discounted rewards over the course of h transitions under the policy \({\varvec{{\varPi }}}\) resp. \({\varvec{{\tilde{{\varPi }}}}}\). Now we show the main relation between \(\varvec{g}^h\) and \(\varvec{{\tilde{g}}}^h\).

Theorem 6

For all \(s \in {\mathcal {S}}\) and all \(h \in {{\mathbb {N}}}_0\) it is \(\varvec{g}^h(s) = \varvec{{\tilde{g}}}^{2h}(s) = \varvec{{\tilde{g}}}^{2h+1}(s)\).

Proof

We prove the correspondence of the values in the vectors for h and 2h by induction. For \(h=0\) we have by definition of the vectors \(\varvec{g}^0(s) = \varvec{{\tilde{g}}}^0(s) = 0\). Now assume that the correspondence holds for \(h \ge 0\), then

$$\begin{aligned} \begin{array}{lll} \varvec{{\tilde{g}}}^{2h+2}(s) &{} = \sum \limits _{s'\in {\mathcal {S}}}\sum \limits _{a \in {\mathcal {A}}} \varvec{{\tilde{\pi }}}_s(a){\varvec{{\tilde{P}}}}_k^a(s,(s,a,s')) \left( \varvec{{\tilde{r}}}(s) + \sqrt{\gamma } \varvec{{\tilde{g}}}^{2h+1}(s,a,s')\right) \\ &{} = \sum \limits _{s'\in {\mathcal {S}}}\sum \limits _{a \in {\mathcal {A}}} \varvec{\pi }_s(a){\varvec{P}}_k^a(s,s')\left( \sum \limits _{s''\in {\mathcal {S}}}\sum \limits _{a' \in {\mathcal {A}}} \varvec{{\tilde{\pi }}}_{(s,a,s')}(a'){\varvec{{\tilde{P}}}}_k^{a'}((s,a,s'), s'') \right. \\ &{} \quad \left. \times \left( \sqrt{\gamma } \varvec{{\tilde{r}}}(s,a,s') + \gamma \varvec{{\tilde{g}}}^{2h}(s'') \right) \right) \\ &{} = \sum \limits _{s'\in {\mathcal {S}}}\sum \limits _{a \in {\mathcal {A}}} \varvec{\pi }_s(a){\varvec{P}}_k^a(s,s') \left( \sqrt{\gamma } \varvec{{\tilde{r}}}(s,a,s') + \gamma \varvec{{\tilde{g}}}^{2h}(s') \right) \\ &{} = \sum \limits _{s'\in {\mathcal {S}}}\sum \limits _{a \in {\mathcal {A}}} \varvec{\pi }_s(a){\varvec{P}}_k^a(s,s')\left( \sqrt{\gamma } \frac{{\varvec{R}}_k^a(s,s')}{\sqrt{\gamma }} + \gamma \varvec{g}^h(s')\right) \\ &{} = \varvec{g}^{h+1}(s) \end{array} \end{aligned}$$

which implies that it holds for \(h+1\). Now we consider \(\varvec{{\tilde{g}}}^{2h+1}(s)\). For \(h = 0\) we have

$$\begin{aligned} \begin{array}{lll} \varvec{{\tilde{g}}}^1(s) = \sum \limits _{s'\in {\mathcal {S}}}\sum \limits _{a \in {\mathcal {A}}} \varvec{\pi }_s(a){\varvec{{\tilde{P}}}}_k^a(s,(s,a,s')) \left( \varvec{{\tilde{r}}}(s) + \sqrt{\gamma }\varvec{{\tilde{g}}}^{0}(s,a,s')\right) = 0 \end{array} \end{aligned}$$

because \(\varvec{{\tilde{r}}}(s) = 0\) and \(\varvec{{\tilde{g}}}^0={\mathbf {0}}\). Now assume that the result holds for \(2h+1\), we show that it holds for \(2h+3\).

$$\begin{aligned} \begin{array}{lll} \varvec{{\tilde{g}}}^{2h+3}(s) &{} = \sum \limits _{s'\in {\mathcal {S}}}\sum \limits _{a \in {\mathcal {A}}} \varvec{{\tilde{\pi }}}_s(a){\varvec{{\tilde{P}}}}_k^a(s,(s,a,s')) \left( \varvec{{\tilde{r}}}(s) + \sqrt{\gamma }\varvec{{\tilde{g}}}^{2h+2}(s,a,s')\right) \\ &{} = \sum \limits _{s'\in {\mathcal {S}}}\sum \limits _{a \in {\mathcal {A}}} \varvec{\pi }_s(a){\varvec{P}}_k^a(s,s')\left( \sum \limits _{s''\in {\mathcal {S}}}\sum \limits _{a' \in {\mathcal {A}}} \varvec{{\tilde{\pi }}}_{(s,a,s')}(a'){\varvec{{\tilde{P}}}}_k^{a'}((s,a,s'), s'')\right. \\ &{} \quad \times \left. \left( \sqrt{\gamma }\varvec{{\tilde{r}}}(s,a,s') + \gamma \varvec{{\tilde{g}}}^{2h+1}(s'') \right) \right) \\ &{} = \sum \limits _{s'\in {\mathcal {S}}}\sum \limits _{a \in {\mathcal {A}}} \varvec{\pi }_s(a){\varvec{P}}_k^a(s,s')\left( \sqrt{\gamma }\frac{{\varvec{R}}_k^a(s,s')}{\sqrt{\gamma }} +\gamma \varvec{g}^h(s')\right) \\ &{} = \varvec{g}^{h+1}(s) \end{array} \end{aligned}$$

\(\square \)

Let \(G^{\varvec{{\varPi }}}(\gamma ) = \varvec{\alpha }\lim _{h \rightarrow \infty } \varvec{g}^h\) and \({\tilde{G}}^{\varvec{{\tilde{{\varPi }}}}}(\sqrt{\gamma }) = \varvec{{\tilde{\alpha }}}\lim _{h \rightarrow \infty } \varvec{{\tilde{g}}}^h\) the discounted gains for the two MDPs with discount factors \(\gamma \in ({0,1})\) and \(\sqrt{\gamma }\), respectively.

Theorem 7

For any policy \({\varvec{{\varPi }}}\) and any discount factor \(\gamma \in ({0,1})\) the relation

$$\begin{aligned} G^{\varvec{{\varPi }}}(\gamma ) = {\tilde{G}}^{\varvec{{\tilde{{\varPi }}}}}(\sqrt{\gamma }) \end{aligned}$$

holds.

Proof

First, we show the existence of \(G^{\varvec{{\varPi }}}(\gamma )\). Since \({\varvec{R}}^a_k(s, s')\) is finite for all \(a \in {\mathcal {A}}\) and for all \(s, s' \in {\mathcal {S}}\) and can be bounded by a real value \(R \ge \left| {{\varvec{R}}^a_k(s, s')}\right| \), the sequence \(\left( \varvec{g}^h\right) _{h \in {{\mathbb {N}}}_0}\) adheres to

$$\begin{aligned} \left| {\varvec{g}^{h+i}(s) - \varvec{g}^i(s)}\right|&= \sum \limits _{s' \in {\mathcal {S}}} \sum \limits _{a \in {\mathcal {A}}} \varvec{\pi }_s(a) {\varvec{P}}^a_k(s, s') \left| {\varvec{R}}^a_k(s, s') + \gamma \varvec{g}^{h+i - 1}(s') - {\varvec{R}}^a_k(s, s') \right. \\&\quad \left. - \gamma \varvec{g}^{i - 1}(s') \right| \\&\le \gamma \max \limits _{s' \in {\mathcal {S}}} \left| { \varvec{g}^{h+i - 1}(s') - \varvec{g}^{i - 1}(s') }\right| . \end{aligned}$$

By induction, it follows that

$$\begin{aligned} \left| {\varvec{g}^{h+i}(s) - \varvec{g}^i(s)}\right|&\le \gamma ^i \max \limits _{s' \in {\mathcal {S}}} \left| {\varvec{g}^h(s')}\right| \\&\le \gamma ^i R \sum \limits _{j = 0}^h \gamma ^j \\&= \gamma ^i R \frac{1 - \gamma ^h}{1 - \gamma } \\&\le \gamma ^i \frac{R}{1 - \gamma } \end{aligned}$$

for \(i \in {{\mathbb {N}}}\) which implies the Cauchy convergence criterion. Then Theorem 6 implies that \(\varvec{g}(s) = \lim _{h \rightarrow \infty } \varvec{g}^h(s) = \lim _{h \rightarrow \infty } \varvec{{\tilde{g}}}^{2h}(s) = \lim _{h \rightarrow \infty } \varvec{{\tilde{g}}}^h(s) = \varvec{{\tilde{g}}}(s)\) for all \(s \in {\mathcal {S}}\) and we have

$$\begin{aligned} G^{\varvec{{\varPi }}}(\gamma )&= \sum \limits _{s \in {\mathcal {S}}} \varvec{\alpha }(s) \varvec{g}(s) \\&= \sum \limits _{s \in {\mathcal {S}}} \varvec{{\tilde{\alpha }}}(s) \varvec{{\tilde{g}}}(s)&\text {since }\varvec{{\tilde{\alpha }}}(s) = \varvec{\alpha }(s)\text { for }s \in {\mathcal {S}}\\&= \sum \limits _{\sigma \in {\tilde{{\mathcal {S}}}}} \varvec{{\tilde{\alpha }}}(\sigma ) \varvec{{\tilde{g}}}(\sigma )&\text {since }\varvec{{\tilde{\alpha }}}(\sigma ) = 0\text { for }\sigma \not \in {\mathcal {S}}\\&= {\tilde{G}}^{\varvec{{\varPi }}}(\sqrt{\gamma }). \end{aligned}$$

\(\square \)

Analogously the following result can be shown.

Theorem 8

For any policy \({\varvec{{\tilde{{\varPi }}}}}\) in the action-independent MDP \(\left( {{\tilde{{\mathcal {S}}}}, \varvec{{\tilde{\alpha }}}, \left( {{\varvec{{\tilde{P}}}}^a_k}\right) _{a \in {\mathcal {A}}}, \varvec{{\tilde{r}}}}\right) \), the policy \({\varvec{{\varPi }}}\) in the action-dependent MDP derived by projection of \({\varvec{{\tilde{{\varPi }}}}}\) on \({\mathcal {S}}\) is subject to

$$\begin{aligned} G^{\varvec{{\varPi }}}(\gamma ) = {\tilde{G}}^{\varvec{{\tilde{{\varPi }}}}}(\sqrt{\gamma }) \end{aligned}$$

This implies that every policy (including the optimal policy) in the MDP with action-independent rewards yields the same expected discounted reward as its projection in the MDP with action-dependent rewards. Together with Theorem 7 this means that there exists a one-to-one correspondence between policies in both MDPs which yields the same expected discounted reward, and optimal policies in one MDP are also optimal in the other one. This completes the transformation from the action-dependent reward model.

Appendix B: Proof of Theorem 1

Theorem 1

The decision problem defined in Definition 1 is NP-complete.

Fig. 3
figure 3

The variable gadget

We perform a reduction from 3-SAT. Given a 3-SAT instance with n variables \(x_1, \ldots , x_n\) and m clauses \(C_1, \ldots , C_m\) where each clause contains three literals, we construct a concurrent MDP \({\mathcal {M}}= \{1, 2\}\) consisting of two MDPs, a vector \(\varvec{w}\in {{\mathbb {R}}}^2\) and a real number \(g \in {{\mathbb {R}}}\) such that the instance will be satisfiable if and only if there is a policy \({\varvec{{\varPi }}}\) for the concurrent MDP that yields the value g.

The first part of our construction are the states. They are arranged in three groups.

  • First, we create a specially designated sink state \(s_0\) which yields 0 reward in both MDPs.

  • Then, we transfer the variables of the Boolean satisfiability problem into states of the MDPs: for each variable x we create two states \(s_{x}, s_{x}'\) in both MDPs. The reward is 0 in states of the form \(s_{x}\) and 1 in states of the form \(s_{x}'\).

  • Last, we create states for clauses: for each clause C, a state \(s_C\) is created. Again, the reward in \(s_C\) is zero for all clauses.

The second part of the construction are the actions. We create three actions \({\mathcal {A}}= \{1, 2, 3\}\) with the following semantics.

  • In sink state \(s_0\), it is \({\varvec{P}}^a_k(s_0, s_0) = 1\) for all \(a \in {\mathcal {A}}\) and \(k \in {\mathcal {M}}\).

  • In the variable states, we define \({\varvec{P}}^1_1(s_x, s_{x}') = {\varvec{P}}^2_1(s_x, s_0) = {\varvec{P}}^3_1(s_x, s_0) = 1\) and \({\varvec{P}}^1_2(s_x, s_0) = {\varvec{P}}^2_2(s_x, s_{x}') = {\varvec{P}}^3_2(s_x, s_{x}') = 1\), that is, we define actions in \(s_x\) to lead to different outcomes in the MDPs. The motivation is to force a mutually exclusive choice of values for the Boolean variables in the concurrent MDP. In the auxiliary variable states, it is \({\varvec{P}}^a_k(s_x', s_x) = 1\) for all actions \(a \in {\mathcal {A}}\) and MDPs \(k \in {\mathcal {M}}\); the idea behind these states is to exploit non-linearity of the problem. The construction is visualized in Fig. 3 where the upper part corresponds to the first MDP and the lower part corresponds to the second MDP in \({\mathcal {M}}\).

  • In the clause states, we define actions as follows. In a clause \(C = L_1 \vee L_2 \vee L_3\), the chosen action represents the literal that evaluates to true. Hence, we define \({\varvec{P}}^a_k(C, s)\) by setting \({\varvec{P}}^a_k(C, s) = 1\) in the cases

    • \(L_a = x, k = 1, s = s_x\)

    • \(L_a = \lnot x, k = 1, s = s_0\)

    • \(L_a = \lnot x, k = 2, s = s_x\)

    • \(L_a = x, k = 2, s = s_0\)

A graphical sketch of this setup can be seen in Fig. 4. Again, the upper part of the drawing corresponds to the first MDP in \({\mathcal {M}}\) while the lower part corresponds to the second MDP.

The idea behind this construction is to infer functions \(\beta :\{1, \ldots , n\} \rightarrow \{0, 1\}\) that map variables to values and \(\nu :\{1, \ldots , m\} \rightarrow \{1, 2, 3\}\) that map the clauses to the satisfying variables. This is done to create a mapping from policies to variable assignments in the SAT problem. Furthermore, we define the initial distribution \(\alpha \) with \(\alpha (s_C) = {1}/{m}\) for all clauses C for both MDPs and weights \(\varvec{w}= ({1}/{2}, {1}/{2})\). Concerning the value, we set an auxiliary constant \(q := \frac{1}{1 - \gamma ^2}\) and the required value \(g := \frac{\gamma ^2 q}{2}\) where \(\gamma \) is a non-zero discount factor in the concurrent MDP.

Fig. 4
figure 4

The clause gadget

We prove the validity of the reduction. First, we show that if there is an assignment \(\beta :\{1, \ldots , n\} \rightarrow \{0, 1\}\) that satisfies the SAT instance, then there also exists a policy \(\pi \) such that \(\sum _{k=1}^{K} \varvec{w}(k) G^{{\varvec{{\varPi }}}}_k \ge g\). We construct the policy in two steps. In the first step, we set \(\pi _{s_x}(1) = 1 \Leftrightarrow \beta (x) = 1\) for all variables x. In the second step, it follows from the existence of a satisfying assignment that in each clause, a literal is satisfied, defining a function \(\nu :\{1, \ldots , m\} \rightarrow \{1, 2, 3\}\) that defines the number of a satisfied literal in every clause. Thus, we set \(\pi _{s_C}(a) = 1 \Leftrightarrow \nu (C) = a\).

We verify that the constructed policy yields the given value. As in each clause the satisfying literal is chosen, the value of this state will be 0 in one MDP and \(\gamma ^2 q\) in the other one.

Now we show that if there is no satisfying assignment, then the value of the concurrent MDP will be lower than g. Given any assignment \(\beta \) and any assignment \(\nu \), the induced policy will lead from at least one clause state to the sink state \(s_0\) with nonzero probability in both MDPs, yielding a lower value. However, we must take care of stationary but not pure policies that still might induce the desired value. One can observe that if the stationary policy is not pure in a state \(s_x\) for a variable x, then the cumulative discounted reward in this state is \(\frac{p\gamma }{1 - p^2\gamma ^2}\) for some real \(0< p < 1\). Deriving the value of a clause state from which one can arrive to this variable state, we get, summing over both MDPs, a summand

$$\begin{aligned} \frac{1}{2} \left( \frac{p \gamma ^2}{1 - p^2 \gamma ^2} + \frac{(1 - p) \gamma ^2}{1 - (1 - p)^2 \gamma ^2} \right) \end{aligned}$$

Let \(f(p) = \frac{p}{1 - p^2 \gamma ^2} + \frac{1 - p}{1 - (1 - p)^2 \gamma ^2}\). Computing the derivative, we obtain

$$\begin{aligned} f'(p) = \frac{2 \gamma ^2 p}{ \left( 1 -\gamma ^2 p^2 \right) ^2 } - \frac{2 \gamma ^2 (1 - p)^2}{ \left( 1 -\gamma ^2 (1 - p)^2 \right) ^2 } + \frac{1}{1 - \gamma ^2 p^2} - \frac{1}{1 - \gamma ^2 (1 - p)^2} \end{aligned}$$

which has its roots at

$$\begin{aligned} \frac{1}{2}, \frac{1 \pm \sqrt{1 - 4\gamma ^{-2} + 4\gamma ^{-2} \sqrt{4 - \gamma ^{2}} }}{2}, \frac{1 \pm i \sqrt{1 - 4\gamma ^{-2} + 4\gamma ^{-2} \sqrt{4 - \gamma ^{2}} }}{2}. \end{aligned}$$

The only roots of interest are the real ones, and thus, we investigate the pair

$$\begin{aligned} \frac{1 \pm \sqrt{1 - 4\gamma ^{-2} + 4\gamma ^{-2} \sqrt{4 - \gamma ^{2}} } }{2}. \end{aligned}$$
(29)

It can be seen that for \(0< \gamma < 1\), the value \(\sqrt{4 - \gamma ^2}\) is at least \(\sqrt{3} > 1\), and the root term in (29) is thus greater than one. This means that the whole term (29) is either greater than one or negative. Hence, the possible extreme points of f in [0, 1] may lie at 0, 1, or \({1}/{2}\). We can see that \(f(0) = f(1) = q\) while \(f({1}/{2}) = \frac{1}{1 - {1}/{4} \gamma ^2} < q\). Hence, a non-pure policy in a variable state will have a lower cumulative discounted reward. Concerning the clause states, we observe that a non-pure policy cannot yield higher rewards than a pure one, as the expected discounted reward in a clause state is linear in the expected discounted rewards in the following variable states; the clause states are not visited again.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Buchholz, P., Scheftelowitsch, D. Computation of weighted sums of rewards for concurrent MDPs. Math Meth Oper Res 89, 1–42 (2019). https://doi.org/10.1007/s00186-018-0653-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00186-018-0653-1

Keywords

Navigation