Abstract
We consider sets of Markov decision processes (MDPs) with shared state and action spaces and assume that the individual MDPs in such a set represent different scenarios for a system’s operation. In this setting, we solve the problem of finding a single policy that performs well under each of these scenarios by considering the weighted sum of value vectors for each of the scenarios. Several solution approaches as well as the general complexity of the problem are discussed and algorithms that are based on these solution approaches are presented. Finally, we compare the derived algorithms on a set of benchmark problems.


Similar content being viewed by others
Notes
This corresponds also to the corresponding slack variable being zero.
Alternatively, an Octave implementation with solvers sqp and glpk is also available and shows a similar performance.
References
Amato C, Bernstein DS, Zilberstein S (2007) Solving POMDPs using quadratically constrained linear programs. In: Proceedings of the 20th international joint conference on artificial intelligence, IJCAI 2007. Hyderabad, India, January 6–12, 2007, pp 2418–2424
Berman A, Plemmons RJ (1994) Nonnegative matrices in the mathematical sciences. Classics in applied mathematics. SIAM, Philadelphia
Bertsimas D, Mišić VV (2017) Robust product line design. Oper Res 65(1):19–37
Bertsimas D, Silberholz J, Trikalinos T (2016) Optimal healthcare decision making under multiple mathematical models: application in prostate cancer screening. Health Care Manag Sci 21:105–118
Björklund H, Vorobyov S (2007) A combinatorial strongly subexponential strategy improvement algorithm for mean payoff games. Discrete Appl Math 155(2):210–229. https://doi.org/10.1016/j.dam.2006.04.029
Caro F, Das-Gupta A (2015) Robust control of the multi-armed bandit problem. Ann Oper Res. https://doi.org/10.1007/s10479-015-1965-7
Castillo AC, Castro PM, Mahalec V (2018) Global optimization of MIQCPs with dynamic piecewise relaxations. J Glob Optim 71(4):691–716. https://doi.org/10.1007/s10898-018-0612-7
Colvin M, Maravelias CT (2010) Modeling methods and a branch and cut algorithm for pharmaceutical clinical trial planning using stochastic programming. Eur J Oper Res 203(1):205–215
d’Epenoux F (1963) A probabilistic production and inventory problem. Manag Sci 10(1):98–108. https://doi.org/10.1287/mnsc.10.1.98
Dupacová J, Consigli G, Wallace SW (2000) Scenarios for multistage stochastic programs. Ann Oper Res 100(1–4):25–53. https://doi.org/10.1023/A:1019206915174
Ehrgott M (2005) Multicriteria optimization, 2nd edn. Springer, Berlin. https://doi.org/10.1007/3-540-27659-9
Feinberg EA, Schwartz A (eds) (2002) Handbook of Markov decision processes. Kluwer, Boston
Filar J, Vrieze K (1997) Competitive Markov decision processes. Springer, New York
Gandhi A, Gupta V, Harchol-Balter M, Kozuch MA (2010) Optimality analysis of energy-performance trade-off for server farm management. Perform Eval 67(11):1155–1171
Garey MR, Johnson DS (1978) Computers and intractability: a guide to the theory of NP-completeness. Freeman, San Francisco
Givan R, Leach SM, Dean TL (2000) Bounded-parameter Markov decision processes. Artif Intell 122(1–2):71–109
Hager WW (1989) Updating the inverse of a matrix. SIAM Rev 31(2):221–239
Iyengar GN (2005) Robust dynamic programming. Math Oper Res 30(2):257–280
Kaelbling LP, Littman ML, Cassandra AR (1998) Planning and acting in partially observable stochastic domains. Artif Intell 101(1–2):99–134
Klamroth K, Köbis E, Schöbel A, Tammer C (2013) A unified approach for different concepts of robustness and stochastic programming via non-linear scalarizing functionals. Optimization 62(5):649–671
Mercier L, Hentenryck PV (2008) Amsaa: a multistep anticipatory algorithm for online stochastic combinatorial optimization. In: Perron L, Trick MA (eds) Integration of AI and OR techniques in constraint programming for combinatorial optimization problems, 5th international conference, CPAIOR 2008, Paris, France, May 20–23, 2008, Proceedings. Lecture Notes in Computer Science, vol 5015, pp 173–187. Springer
Nesterov Y, Nemirovskii A (1994) Interior-point polynomial algorithms in convex programming. Society for Industrial and Applied Mathematics, Philadelphia
Nilim A, Ghaoui LE (2005) Robust control of Markov decision processes with uncertain transition matrices. Oper Res 53(5):780–798
Papadimitriou CH, Tsitsiklis JN (1987) The complexity of Markov decision processes. Math Oper Res 12(3):441–450
Park J, Boyd S (2017) Heuristics for nonconvex quadratically constrained quadratic programming. CoRR arXiv:1703.07870v2
Puterman ML (2005) Markov decision processes. Wiley, London
Qualizza A, Belotti P, Margot F (2012) Linear programming relaxations of quadratically constrained quadratic programs. In: Lee J, Leyffer S (eds) Mixed integer nonlinear programming, vol 154. Springer, New York
Raskin J, Sankur O (2014) Multiple-environment Markov decision processes. CoRR arXiv:1405.4733
Rockafellar RT, Wets RJ (1991) Scenarios and policy aggregation in optimization under uncertainty. Math Oper Res 16(1):119–147
Roijers DM, Scharpff J, Spaan MTJ, Oliehoek FA, de Weerdt M, Whiteson S (2014) Bounded approximations for linear multi-objective planning under uncertainty. In: Chien SA, Do MB, Fern A, Ruml W (eds) Proceedings of the twenty-fourth international conference on automated planning and scheduling, ICAPS 2014, Portsmouth, New Hampshire, USA, June 21–26, 2014. http://www.aaai.org/ocs/index.php/ICAPS/ICAPS14/paper/view/7929
Ruszczyński A, Shapiro A (2009) Lectures on stochastic programming. SIAM, Philadelphia. https://doi.org/10.1137/1.9780898718751
Satia JK, Lave RE (1973) Markovian decision processes with uncertain transition probabilities. Oper Res 21(3):728–740
Serfozo RF (1979) An equivalence between continuous and discrete time Markov decision processes. Oper Res 27(3):616–620
Sigaud O, Buffet O (eds) (2010) Markov decision processes in artificial intelligence. Wiley-ISTE, London
Singh SP, Cohn D (1997) How to dynamically merge Markov decision processes. In: Jordan MI, Kearns MJ, Solla SA(eds) Advances in neural information processing systems 10, [NIPS Conference, Denver, Colorado, USA, 1997]. The MIT Press, pp 1057–1063
Singh SP, Jaakkola TS, Jordan MI (1994) Learning without state-estimation in partially observable Markovian decision processes. In: Cohen WW, Hirsh H (eds) Machine learning, proceedings of the eleventh international conference, Rutgers University, New Brunswick, NJ, USA, July 10–13, 1994, pp 284–292
Steimle LN, Kaufman DL, Denton BT (2018) Multi-model Markov decision processes. Technical report, Optimization-online
Vielma JP (2015) Mixed integer linear programming formulation techniques. SIAM Rev 57(1):3–57
Walraven E, Spaan MTJ (2015) Planning under uncertainty with weighted state scenarios. In: Meila M, Heskes T (eds) Proceedings of the thirty-first conference on uncertainty in artificial intelligence, UAI 2015, July 12–16, 2015, Amsterdam, The Netherlands, pp 912–921. AUAI Press
White CC, Eldeib HK (1994) Markov decision processes with imprecise transition probabilities. Oper Res 42(4):739–749
White CC, White DJ (1989) Markov decision processes. Eur J Oper Res 39(6):1–16
Wierman A, Andrew LL, Tang A (2012) Power-aware speed scaling in processor sharing systems: optimality and robustness. Perform Eval 69(12):601–622
Wiesemann W, Kuhn D, Rustem B (2013) Robust Markov decision processes. Math Oper Res 38(1):153–183
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Action dependent rewards
Assume that rewards of a concurrent MDP depend on actions and successor states. Then \({\varvec{R}}_k^a(s,s')\) is the reward that is obtained in state s of MDP k if action a is chosen and \(s'\) is the successor state. Let
be a concurrent MDP with action-dependent rewards. We assume that this MDP should be analyzed for the discounted reward with discount factor \(\gamma \in ({0,1})\). The transformation into an MDP with rewards that do not depend on the action results in the following MDP which depends on \(\gamma \).
Observe that the state space \({\tilde{{\mathcal {S}}}}\) consists of states \(s \in {\mathcal {S}}\) and states \((s,a,s')\). The modified MDP alternates between states \(s \in {\tilde{{\mathcal {S}}}} \cap {\mathcal {S}}\) and states \((s,a,s') \in {\tilde{{\mathcal {S}}}} {\setminus } {\mathcal {S}}\). Rewards are only gained in the latter states.
For some policy \({\varvec{{\varPi }}}\) defined on \({\mathcal {S}}\) we define a policy \({\varvec{{\tilde{{\varPi }}}}}\) on \({\tilde{{\mathcal {S}}}}\) as follows
Thus, a policy for the MDP with action-dependent rewards can be uniquely translated into a policy for the MDP with action-independent rewards. Define the following sequences of vectors
The vectors \(\varvec{g}^h\) and \(\varvec{{\tilde{g}}}^h\) contain the expected discounted rewards over the course of h transitions under the policy \({\varvec{{\varPi }}}\) resp. \({\varvec{{\tilde{{\varPi }}}}}\). Now we show the main relation between \(\varvec{g}^h\) and \(\varvec{{\tilde{g}}}^h\).
Theorem 6
For all \(s \in {\mathcal {S}}\) and all \(h \in {{\mathbb {N}}}_0\) it is \(\varvec{g}^h(s) = \varvec{{\tilde{g}}}^{2h}(s) = \varvec{{\tilde{g}}}^{2h+1}(s)\).
Proof
We prove the correspondence of the values in the vectors for h and 2h by induction. For \(h=0\) we have by definition of the vectors \(\varvec{g}^0(s) = \varvec{{\tilde{g}}}^0(s) = 0\). Now assume that the correspondence holds for \(h \ge 0\), then
which implies that it holds for \(h+1\). Now we consider \(\varvec{{\tilde{g}}}^{2h+1}(s)\). For \(h = 0\) we have
because \(\varvec{{\tilde{r}}}(s) = 0\) and \(\varvec{{\tilde{g}}}^0={\mathbf {0}}\). Now assume that the result holds for \(2h+1\), we show that it holds for \(2h+3\).
\(\square \)
Let \(G^{\varvec{{\varPi }}}(\gamma ) = \varvec{\alpha }\lim _{h \rightarrow \infty } \varvec{g}^h\) and \({\tilde{G}}^{\varvec{{\tilde{{\varPi }}}}}(\sqrt{\gamma }) = \varvec{{\tilde{\alpha }}}\lim _{h \rightarrow \infty } \varvec{{\tilde{g}}}^h\) the discounted gains for the two MDPs with discount factors \(\gamma \in ({0,1})\) and \(\sqrt{\gamma }\), respectively.
Theorem 7
For any policy \({\varvec{{\varPi }}}\) and any discount factor \(\gamma \in ({0,1})\) the relation
holds.
Proof
First, we show the existence of \(G^{\varvec{{\varPi }}}(\gamma )\). Since \({\varvec{R}}^a_k(s, s')\) is finite for all \(a \in {\mathcal {A}}\) and for all \(s, s' \in {\mathcal {S}}\) and can be bounded by a real value \(R \ge \left| {{\varvec{R}}^a_k(s, s')}\right| \), the sequence \(\left( \varvec{g}^h\right) _{h \in {{\mathbb {N}}}_0}\) adheres to
By induction, it follows that
for \(i \in {{\mathbb {N}}}\) which implies the Cauchy convergence criterion. Then Theorem 6 implies that \(\varvec{g}(s) = \lim _{h \rightarrow \infty } \varvec{g}^h(s) = \lim _{h \rightarrow \infty } \varvec{{\tilde{g}}}^{2h}(s) = \lim _{h \rightarrow \infty } \varvec{{\tilde{g}}}^h(s) = \varvec{{\tilde{g}}}(s)\) for all \(s \in {\mathcal {S}}\) and we have
\(\square \)
Analogously the following result can be shown.
Theorem 8
For any policy \({\varvec{{\tilde{{\varPi }}}}}\) in the action-independent MDP \(\left( {{\tilde{{\mathcal {S}}}}, \varvec{{\tilde{\alpha }}}, \left( {{\varvec{{\tilde{P}}}}^a_k}\right) _{a \in {\mathcal {A}}}, \varvec{{\tilde{r}}}}\right) \), the policy \({\varvec{{\varPi }}}\) in the action-dependent MDP derived by projection of \({\varvec{{\tilde{{\varPi }}}}}\) on \({\mathcal {S}}\) is subject to
This implies that every policy (including the optimal policy) in the MDP with action-independent rewards yields the same expected discounted reward as its projection in the MDP with action-dependent rewards. Together with Theorem 7 this means that there exists a one-to-one correspondence between policies in both MDPs which yields the same expected discounted reward, and optimal policies in one MDP are also optimal in the other one. This completes the transformation from the action-dependent reward model.
Appendix B: Proof of Theorem 1
Theorem 1
The decision problem defined in Definition 1 is NP-complete.
We perform a reduction from 3-SAT. Given a 3-SAT instance with n variables \(x_1, \ldots , x_n\) and m clauses \(C_1, \ldots , C_m\) where each clause contains three literals, we construct a concurrent MDP \({\mathcal {M}}= \{1, 2\}\) consisting of two MDPs, a vector \(\varvec{w}\in {{\mathbb {R}}}^2\) and a real number \(g \in {{\mathbb {R}}}\) such that the instance will be satisfiable if and only if there is a policy \({\varvec{{\varPi }}}\) for the concurrent MDP that yields the value g.
The first part of our construction are the states. They are arranged in three groups.
-
First, we create a specially designated sink state \(s_0\) which yields 0 reward in both MDPs.
-
Then, we transfer the variables of the Boolean satisfiability problem into states of the MDPs: for each variable x we create two states \(s_{x}, s_{x}'\) in both MDPs. The reward is 0 in states of the form \(s_{x}\) and 1 in states of the form \(s_{x}'\).
-
Last, we create states for clauses: for each clause C, a state \(s_C\) is created. Again, the reward in \(s_C\) is zero for all clauses.
The second part of the construction are the actions. We create three actions \({\mathcal {A}}= \{1, 2, 3\}\) with the following semantics.
-
In sink state \(s_0\), it is \({\varvec{P}}^a_k(s_0, s_0) = 1\) for all \(a \in {\mathcal {A}}\) and \(k \in {\mathcal {M}}\).
-
In the variable states, we define \({\varvec{P}}^1_1(s_x, s_{x}') = {\varvec{P}}^2_1(s_x, s_0) = {\varvec{P}}^3_1(s_x, s_0) = 1\) and \({\varvec{P}}^1_2(s_x, s_0) = {\varvec{P}}^2_2(s_x, s_{x}') = {\varvec{P}}^3_2(s_x, s_{x}') = 1\), that is, we define actions in \(s_x\) to lead to different outcomes in the MDPs. The motivation is to force a mutually exclusive choice of values for the Boolean variables in the concurrent MDP. In the auxiliary variable states, it is \({\varvec{P}}^a_k(s_x', s_x) = 1\) for all actions \(a \in {\mathcal {A}}\) and MDPs \(k \in {\mathcal {M}}\); the idea behind these states is to exploit non-linearity of the problem. The construction is visualized in Fig. 3 where the upper part corresponds to the first MDP and the lower part corresponds to the second MDP in \({\mathcal {M}}\).
-
In the clause states, we define actions as follows. In a clause \(C = L_1 \vee L_2 \vee L_3\), the chosen action represents the literal that evaluates to true. Hence, we define \({\varvec{P}}^a_k(C, s)\) by setting \({\varvec{P}}^a_k(C, s) = 1\) in the cases
-
\(L_a = x, k = 1, s = s_x\)
-
\(L_a = \lnot x, k = 1, s = s_0\)
-
\(L_a = \lnot x, k = 2, s = s_x\)
-
\(L_a = x, k = 2, s = s_0\)
-
A graphical sketch of this setup can be seen in Fig. 4. Again, the upper part of the drawing corresponds to the first MDP in \({\mathcal {M}}\) while the lower part corresponds to the second MDP.
The idea behind this construction is to infer functions \(\beta :\{1, \ldots , n\} \rightarrow \{0, 1\}\) that map variables to values and \(\nu :\{1, \ldots , m\} \rightarrow \{1, 2, 3\}\) that map the clauses to the satisfying variables. This is done to create a mapping from policies to variable assignments in the SAT problem. Furthermore, we define the initial distribution \(\alpha \) with \(\alpha (s_C) = {1}/{m}\) for all clauses C for both MDPs and weights \(\varvec{w}= ({1}/{2}, {1}/{2})\). Concerning the value, we set an auxiliary constant \(q := \frac{1}{1 - \gamma ^2}\) and the required value \(g := \frac{\gamma ^2 q}{2}\) where \(\gamma \) is a non-zero discount factor in the concurrent MDP.
We prove the validity of the reduction. First, we show that if there is an assignment \(\beta :\{1, \ldots , n\} \rightarrow \{0, 1\}\) that satisfies the SAT instance, then there also exists a policy \(\pi \) such that \(\sum _{k=1}^{K} \varvec{w}(k) G^{{\varvec{{\varPi }}}}_k \ge g\). We construct the policy in two steps. In the first step, we set \(\pi _{s_x}(1) = 1 \Leftrightarrow \beta (x) = 1\) for all variables x. In the second step, it follows from the existence of a satisfying assignment that in each clause, a literal is satisfied, defining a function \(\nu :\{1, \ldots , m\} \rightarrow \{1, 2, 3\}\) that defines the number of a satisfied literal in every clause. Thus, we set \(\pi _{s_C}(a) = 1 \Leftrightarrow \nu (C) = a\).
We verify that the constructed policy yields the given value. As in each clause the satisfying literal is chosen, the value of this state will be 0 in one MDP and \(\gamma ^2 q\) in the other one.
Now we show that if there is no satisfying assignment, then the value of the concurrent MDP will be lower than g. Given any assignment \(\beta \) and any assignment \(\nu \), the induced policy will lead from at least one clause state to the sink state \(s_0\) with nonzero probability in both MDPs, yielding a lower value. However, we must take care of stationary but not pure policies that still might induce the desired value. One can observe that if the stationary policy is not pure in a state \(s_x\) for a variable x, then the cumulative discounted reward in this state is \(\frac{p\gamma }{1 - p^2\gamma ^2}\) for some real \(0< p < 1\). Deriving the value of a clause state from which one can arrive to this variable state, we get, summing over both MDPs, a summand
Let \(f(p) = \frac{p}{1 - p^2 \gamma ^2} + \frac{1 - p}{1 - (1 - p)^2 \gamma ^2}\). Computing the derivative, we obtain
which has its roots at
The only roots of interest are the real ones, and thus, we investigate the pair
It can be seen that for \(0< \gamma < 1\), the value \(\sqrt{4 - \gamma ^2}\) is at least \(\sqrt{3} > 1\), and the root term in (29) is thus greater than one. This means that the whole term (29) is either greater than one or negative. Hence, the possible extreme points of f in [0, 1] may lie at 0, 1, or \({1}/{2}\). We can see that \(f(0) = f(1) = q\) while \(f({1}/{2}) = \frac{1}{1 - {1}/{4} \gamma ^2} < q\). Hence, a non-pure policy in a variable state will have a lower cumulative discounted reward. Concerning the clause states, we observe that a non-pure policy cannot yield higher rewards than a pure one, as the expected discounted reward in a clause state is linear in the expected discounted rewards in the following variable states; the clause states are not visited again.
Rights and permissions
About this article
Cite this article
Buchholz, P., Scheftelowitsch, D. Computation of weighted sums of rewards for concurrent MDPs. Math Meth Oper Res 89, 1–42 (2019). https://doi.org/10.1007/s00186-018-0653-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00186-018-0653-1