Abstract
We consider two-person zero-sum mean payoff undiscounted stochastic games and obtain sufficient conditions for the existence of a saddle point in uniformly optimal stationary strategies. Namely, these conditions enable us to bring the game, by applying potential transformations, to a canonical form in which locally optimal strategies are globally optimal, and hence the value for every initial position and the optimal strategies of both players can be obtained by playing the local game at each state. We show that these conditions hold for the class of additive transition (AT) games, that is, the special case when the transitions at each state can be decomposed into two parts, each controlled completely by one of the two players. An important special case of AT-games form the so-called BWR-games which are played by two players on a directed graph with positions of three types: Black, White and Random. We give an independent proof for the existence of a canonical form in such games, and use this result to derive the existence of a canonical form (and hence, of a saddle point in uniformly optimal stationary strategies) in a wide class of games, which includes stochastic games with perfect information (PI), switching controller (SC) games and additive rewards, additive transition (ARAT) games. Unlike the proof for AT-games, our proof for the BWR-case does not rely on the existence of a saddle point in stationary strategies. We also derive some algorithmic consequences from these our reductions to BWR-games, in terms of solving PI-, and ARAT-games in sub-exponential time.
Similar content being viewed by others
Notes
Shapley’s original stochastic games were assumed to have positive stopping probabilities, i.e., at each state v, \(\sum_{u\in V} p_{k\ell}^{vu} <1\), and with probability \(1-\sum_{u\in V} p_{k\ell}^{vu}\), the game stops at state v if actions k and ℓ are selected by the players.
Note that a BW-game on an arbitrary digraph G=(V B ∪V W ,E) can be reduced to a BW-game on a bipartite graph \(G=(V_{B}'\cup V_{W}', E')\) (where the values are halved) by splitting every v∈V B into two nodes \(v'\in V_{W}'\) and \(v''\in V_{B}'\) with an additional arc (v′,v″)∈E′, and subdividing each arc (v,u)∈E, where v∈V W , with a new node \(v_{u}\in V_{B}'\); new arcs have reward 0.
We thank an anonymous reviewer for pointing out this connection to us.
log denotes the logarithm to the base 2
This assumption is without loss of generality since one can add a loop to each terminal vertex.
References
Andersson D, Miltersen PB (2009) The complexity of solving stochastic games on graphs. In: Proc 20th ISAAC. LNCS, vol 5878, pp 112–121
Beffara E, Vorobyov S (2001) Adapting Gurvich-Karzanov-Khachiyan’s algorithm for parity games: Implementation and experimentation. Technical Report 2001-020, Department of Information Technology, Uppsala University, available at https://www.it.uu.se/research/reports/#2001
Beffara E, Vorobyov S (2001) Is randomized Gurvich-Karzanov-Khachiyan’s algorithm for parity games polynomial? Technical Report 2001-025, Department of Information Technology, Uppsala University, available at https://www.it.uu.se/research/reports/#2001
Blackwell D (1962) Discrete dynamic programming. Ann Math Stat 33:719–726
Boros E, Elbassioni K, Gurvich V, Makino K (2009) Every stochastic game with perfect information admits a canonical form. RRR-09-2009, RUTCOR, Rutgers University
Boros E, Elbassioni K, Gurvich V, Makino K (2012) Discounted approximations of undiscounted stochastic games and Markov decision processes are already poor in the almost deterministic case. RRR-24-2012, RUTCOR, Rutgers University
Boros E, Elbassioni K, Gurvich V, Makino K (2010) A pumping algorithm for ergodic stochastic mean payoff games with perfect information. In: Proc 14th IPCO. LNCS, vol 6080. Springer, Berlin, pp 341–354
Boros E, Elbassioni K, Gurvich V, Makino K (2012) On canonical forms for two-person zero-sum limit average payoff stochastic games. RRR-15-2012, RUTCOR, Rutgers University
Boros E, Elbassioni K, Gurvich V, Makino K (2012) A potential reduction algorithm for two-person zero-sum limiting average payoff stochastic games. RRR-13-2012, RUTCOR, Rutgers University
Bewley T, Kohlberg E (1978) On stochastic games with stationary optimal strategies. Math Oper Res 3(2):104–125
Chatterjee K, Henzinger TA (2008) Reduction of stochastic parity to stochastic mean-payoff games. Inf Process Lett 106(1):1–7
Chatterjee K, Jurdziński M, Henzinger TA (2004) Quantitative stochastic parity games. In: Proc 15th SODA, pp 121–130
Condon A (1992) The complexity of stochastic games. Inf Comput 96:203–224
Eherenfeucht A, Mycielski J (1979) Positional strategies for mean payoff games. Int J Game Theory 8:109–113
Federgruen A (1980) Successive approximation methods in undiscounted stochastic games. Oper Res 1:794–810
Filar JA (1981) Ordered field property for stochastic games when the player who controls transitions changes from state to state. J Optim Theory Appl 34(4):503–515
Flesch J, Thuijsman F, Vrieze OJ (2007) Stochastic games with additive transitions. Eur J Oper Res 179(2):483–497
Gallai T (1958) Maximum-minimum Sätze über Graphen. Acta Math Acad Sci Hung 9:395–434
Gillette D (1957) Stochastic games with zero stop probabilities. In: Dresher M, Tucker AW, Wolfe P (eds) Contribution to the theory of games III. Annals of mathematics studies, vol 39. Princeton University Press, Princeton, pp 179–187
Gurvich V, Karzanov A, Khachiyan L (1988) Cyclic games and an algorithm to find minimax cycle means in directed graphs. USSR Comput Math Math Phys 28:85–91
Halman N (2007) Simple stochastic games, parity games, mean payoff games and discounted payoff games are all LP-type problems. Algorithmica 49(1):37–50
Hardy GH, Littlewood JE (1931) Notes on the theory of series (xvi): two Tauberian theorems. J Lond Math Soc 6:281–286
Hoffman AJ, Karp RM (1966) On nonterminating stochastic games. Manag Sci, Ser A 12(5):359–370
Howard RA (1960) Dynamic programming and Markov processes. Technology press and Willey, New York
Jurdziński M (1998) Deciding the winner in parity games is in UP ∩ co-UP. Inf Process Lett 68(3):119–124
Jurdziński M, Paterson M, Zwick U (2006) A deterministic subexponential algorithm for solving parity games. In: Proc 17th SODA, pp 117–123
Karp RM (1978) A characterization of the minimum cycle mean in a digraph. Discrete Math 23:309–311
Kemeny JG, Snell JL (1963) Finite Markov chains. Springer, Berlin
Korevaar J (2010) Tauberian theory: a century of developments. Grundlehren der mathematischen Wissenschaften. Springer, Berlin
Kratsch D, McConnell RM, Mehlhorn K, Spinrad JP (2003) Certifying algorithms for recognizing interval graphs and permutation graphs. In: SODA’03: proceedings of the fourteenth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, Philadelphia, pp 158–167
Krishnamurthy N, Parthasarathy T, Ravindran G (2010) Orderfield property of mixtures of stochastic games. Math Stat Probab 72(1):246–275
Liggett TM, Lippman SA (1969) Stochastic games with perfect information and time-average payoff. SIAM Rev 4:604–607
Mertens JF, Neyman A (1981) Stochastic games. Int J Game Theory 10:53–66
Miltersen PB (2011) Discounted stochastic games poorly approximate undiscounted ones. Manuscript. Technical report
Mine H, Osaki S (1970) Markovian decision process. Elsevier, New York
Moulin H (1976) Extension of two person zero sum games. J Math Anal Appl 5(2):490–507
Moulin H (1976) Prolongement des jeux à deux joueurs de somme nulle. Bull Soc Math Fr Mem 45
Pisaruk NN (1999) Mean cost cyclical games. Math Oper Res 24(4):817–828
Parthasarathy T, Raghavan TES (1981) An orderfield property for stochastic games when one player controls transition probabilities. J Optim Theory Appl 33:375–392. doi:10.1007/BF00935250
Raghavan TES, Tijs SH, Vrieze OJ (1985) On stochastic games with additive reward and transition structure. J Optim Theory Appl 47:451–464. doi:10.1007/BF00942191
Shapley LS (1953) Stochastic games. Proc Natl Acad Sci USA 39:1095–1100
Shapley LS, Snow RN (1950) Basic solutions of discrete games. Ann Math Stud 24:27–35
Sinha S (1989) A contribution to the theory of stochastic games. PhD thesis, Indian Statistical Institute, New Delhi, India
Sznajder R, Filar JA (1992) Some comments on a theorem of Hardy and Littlewood. J Optim Theory Appl 75(1):201–208
von Neumann J (1928) Zur Theorie der Gesellschaftsspiele. Math Ann 100:295–320
Vrieze OJ (1980) Stochastic games with finite state and action spaces. PhD thesis, Centrum voor Wiskunde en Informatica, Amsterdam, The Netherlands
Zwick U, Paterson M (1996) The complexity of mean payoff games on graphs. Theor Comput Sci 158(1–2):343–359
Acknowledgements
We thank the anonymous referees for careful reading and many helpful remarks. This research was partially supported by the Scientific Grant-in-Aid from Ministry of Education, Science, Sports, and Culture of Japan. The first author also acknowledges the partial support of NSF Grant IIS-0803444 and NSF Grant CMMI-0856663.
Author information
Authors and Affiliations
Corresponding author
Additional information
Part of this research was done at the Mathematisches Forschungsinstitut Oberwolfach during a stay within the Research in Pairs Program from March 7 to March 18, 2011.
Appendices
Appendix A: Related Results from Theory of Markov Chains
Given a n×n transition matrix P, the Cesáro partial sums \(\frac{1}{k+1} \sum_{i=1}^{k} P^{i}\) converge, as k→∞, to the limit Markov matrix Q such that:
-
(i)
PQ=QP=QQ=Q;
-
(ii)
\(\operatorname {rank}(I - P) + \operatorname {rank}Q = n\).
-
(iii)
For each n-vector c system Px=x, Qx=c has a unique solution.
-
(iv)
matrix I−(P−Q) is nonsingular and
$$ H(\delta ) = \sum_{i = 0}^\infty \delta ^i \bigl(P^i - Q\bigr) \rightarrow H = \bigl(I - (P - Q)\bigr)^{-1} - Q \quad \mbox{as } \delta \rightarrow1^-. $$(62) -
(v)
$$H(\delta ) Q = Q H(\delta ) = H Q = Q H = 0 \quad \mbox{and} \quad (I - P) H = H (I - P) = I -Q. $$
Claim (iv) (which is used in Sect. 7.4) was proved in 1962 by Blackwell [4], while for the remaining four claims, he cited the text-book in finite Markov chains by Kemeny and Snell [28] (that was published, in fact, one year later, in 1963).
Appendix B: Proof of Lemma 3
Let us first fix the strategy \(\bar{\beta}\) of Black, and compute the uniformly best response by White by solving a controlled Markov chain problem (see, e.g., [35]). It is well-known that this can be done by solving a linear program whose dual LP provides us with a potential vector y∈ℝV such that
holds for all states v∈V. Let us next fix \(\bar{\alpha}\) and compute similarly the best response of Black, providing analogously a potential vector z∈ℝV satisfying
for all states v∈V. Since adding a constant to a potential vector does not change the potential transformation and the value matrices, we can assume w.l.o.g. that
Let us define a matrix valued mapping B v(d) for all states v∈V and vectors d∈ℝV by
Then we have by (65) that B v(z)≤B v(y) (componentwise), and since the value function of matrix games is monotone we can conclude by (63) and (64), and by the fact that changing the payoff matrix by a constant changes the value of the game by the same constant that
for all states v∈V. Note that if g−z≤g−d≤g−y, then we have B v(z)≤B v(d)≤B v(y) for all v∈V, and hence by the above cited properties of the value function and by (66)
Since the mapping F:g−d↦Val(B(d)) (where Val(B(d))=(Val(B v(d)) : v∈V)) is Lipschitz-continuous and since by property (67) and (66) it maps the compact box [g−z,g−y] into a subset of itself, we can conclude by Brouwer’s theorem that F has a fixed point, that is there exists a potential vector x∈[y,z] (i.e., g−x∈[g−z,g−y]) for which g−x=F(g−x)=Val(B(x)). This implies that g=Val(A(x)), completing our proof. □
Appendix C: Proof of the Implication \(\mathrm{(B2)}\Rightarrow\mathrm{(A1)}\)
Let g,x,y∈ℝV be the vectors satisfying condition (B2). Then there exist strategies and such that, for all states v∈V, the following hold: (1) \(\bar{\alpha}^{v} G^{v}(g)\beta\ge g^{v}\) and \(\bar{\alpha}^{v} A^{v}(x)\beta \ge g^{v}\) for all β∈Δ(L v), and (2) \(\alpha^{v} G^{v}(g)\bar{\beta}^{v}\le g^{v}\) and \(\alpha^{v} A^{v}(y)\bar{\beta}^{v}\le g^{v}\) for all α∈Δ(K v).
Fix a starting position v 0=w. It is enough to show that White can guarantee at least g w while Black can guarantee at most g w. We only show the former statement since the latter can be shown similarly. At time i, we will let White play his/her locally optimal (stationary) strategy \(\bar{\alpha}^{v}\) whenever (s)he is at position v i =v, while Black chooses an arbitrary, not necessarily stationary, strategy , where is the history of the play leading to v i =v and is the set of all such histories. Let us note that and denote by β v,i∈Δ(L v) the Markovian strategy given by .
Consider a play w=v 0,v 1,v 2,… (where each v i is a random variable). By (7) and the fact that potential transformations do not change the Cesáro sum (Sect. 2.4), it is enough to show that \(\mathbb {E}[b_{i}(x)]\ge g^{w}\) for all i. Note that
We prove by induction on i=0,1,2,… , that ∑ v g v⋅Pr[v i =v]≥g w, which will imply the lemma by (68). Indeed, the statement is trivially true for i=0. For any i, we have
and the latter is at least g w by the induction hypothesis. □
Appendix D: Some Examples
Example 2
Vrieze [46, Chap. 8] showed an example, see Fig. 2, for a stochastic game which has values and uniformly optimal stationary strategies, and which has no canonical form. We can see that condition (A2) is violated. In this game we have V={1,2,3}, states 2 and 3 are absorbing with |K 2|=|K 3|=|L 2|=|L 3|=1, while in state 1 we have |K 1|=|L 1|=3. The reward matrix of state one is shown in Fig. 2 together with the transition probabilities which are all zero or one.
This game has values, g=(0,−1,1), and unique uniformly optimal stationary strategies, namely \(\alpha^{1}=(\frac{1}{2},\frac{1}{2},0)\) and \(\beta^{1}=(\frac{1}{2},\frac{1}{2},0)\), and the trivial strategies in states 2 and 3. We have
For a potential vector x∈ℝV we can assume w.l.o.g. that x 1=0, and thus we have
Here, \(\bar {K}^{1}=\{(\frac{1}{2},\frac{1}{2},0),(1,0,0)\}\), \(\bar {L}^{1}=\{(\frac{1}{2},\frac{1}{2},0),(0,1,0)\}\), and only the first vectors are optimal in the matrix game with payoffs A 1(x) (for any potential transformation), and thus α 1 and β 1 given above are the unique optimal strategies. For the canonical form for some potential vector x∈ℝV (x 1=0), we would need the inequalities that α 1 A 1(x)≥0 and A 1(x)β 1≤0, implying that \(-1-\frac{x_{2}+x_{3}}{2}\geq0\) and \(1-\frac{x_{2}+x_{3}}{2}\leq0\), leading to a contradiction. Consequently, this example does not have a canonical form.
Example 3
Raghavan, Tijs, and Vrieze [40] showed an example, see Fig. 3, for an AT game, with a rational input data, in which the optimal values and strategies are irrational. This example is ergodic, with states V={1,2} and values \(g^{1}=g^{2}=-(6-\sqrt{30})^{2}\). The vector \(x=(0,22-4\sqrt{30})\) is a potential transformation providing the canonical form for this example. We have K 1=K 2=L 1=L 2={1,2}, and the strategies \(\alpha ^{1}=\beta^{1}=(\frac{-4+\sqrt{30}}{2},\frac{6-\sqrt{30}}{2})\), and \(\alpha^{2}=\beta^{2}=(\frac{-9+2\sqrt{30}}{3},\frac{12-2\sqrt {30}}{3})\) are the uniformly optimal stationary strategies.
Appendix E: Summary of Implications
Figure 4 summarizes the implications between game properties considered in the paper.
Rights and permissions
About this article
Cite this article
Boros, E., Elbassioni, K., Gurvich, V. et al. On Canonical Forms for Zero-Sum Stochastic Mean Payoff Games. Dyn Games Appl 3, 128–161 (2013). https://doi.org/10.1007/s13235-013-0075-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13235-013-0075-x