Near-optimal no-regret algorithms for zero-sum games
Introduction
Von Neumann's minimax theorem (von Neumann, 1928) lies at the origins of the fields of both algorithms and game theory. Indeed, it was the first example of a static game-theoretic solution concept: If the players of a zero-sum game arrive at a min-max pair of strategies, then no player can improve his payoff by unilaterally deviating, resulting in an equilibrium state of the game. The min-max equilibrium played a central role in von Neumann and Morgenstern's foundations of Game Theory (von Neumann and Morgenstern, 1944), and inspired the discovery of the Nash equilibrium (Nash, 1951) and the foundations of modern economic thought (Myerson, 1999).
At the same time, the minimax theorem is tightly connected to the development of mathematical programming, as linear programming itself reduces to the computation of a min-max equilibrium, while strong linear programming duality is equivalent to the minimax theorem.4 Given the further developments in linear programming in the past century (Karmarkar, 1984, Khachiyan, 1979), we now have efficient algorithms for computing equilibria in zero-sum games, even in very large ones such as poker (Gilpin et al., 2008, Gilpin et al., 2007).
On the other hand, the min-max equilibrium is a static notion of stability, leaving open the possibility that there are no simple distributed dynamics via which stability comes about. This turns out not to be the case, as many distributed protocols for this purpose have been discovered. One of the first protocols suggested for this purpose is fictitious play, whereby players switch rounds playing the pure strategy that optimizes their payoff against the historical play of their opponent (viewed as a distribution over strategies). This simple scheme, suggested by Brown in 1949 (Brown, 1951), was shown to converge to the min-max value of the game by Robinson (1951). However, its convergence rate has recently been shown to be exponentially slow in the number of strategies (Brandt et al., 2010).5 Such poor convergence guarantees do not offer much by way of justifying the plausibility of the min-max equilibrium in a distributed setting, making the following questions rather important: Are there efficient and natural distributed dynamics converging to min-max equilibrium/value? And what is the optimal rate of convergence?
The answer to the first question is, by now, very well understood. A typical source of efficient dynamics converging to min-max equilibria is online optimization. The results here are very general: If both players of a game use a no-regret learning algorithm to adapt their strategies to their opponent's strategies, then the average payoffs of the players converge to their min-max value, and their average strategies constitute an approximate min-max equilibrium, with the approximation converging to 0 (Cesa-Bianchi and Lugosi, 2006). In particular, if a no-regret learning algorithm guarantees average external regret , as a function of the number T of rounds, the number n of “experts,” and the magnitude u of the maximum in absolute value payoff of an expert at each round, we can readily use this algorithm in a game setting to approximate the min-max value of the game to within an additive in T rounds, where u is now the magnitude of the maximum in absolute value payoff in the game, and n an upper bound on the players' strategies.
For instance, if we use the multiplicative weights update algorithm (Freund and Schapire, 1999, Littlestone and Warmuth, 1994), we would achieve approximation to the value of the game in T rounds. Given that the dependence of in the number n of experts and the number T of rounds is optimal for the regret bound of any no-regret learning algorithm (Cesa-Bianchi and Lugosi, 2006), the convergence rate to the value of the game achieved by the multiplicative weights update algorithm is the optimal rate that can be achieved by a black-box reduction of a regret bound to a convergence rate in a zero-sum game.
Nevertheless, a black-box reduction from the learning-with-expert-advice setting to the game-theoretic setting may be lossy in terms of approximation. Indeed, no-regret bounds apply even when playing against an adversary; it may be that, when two players of a zero-sum game update their strategies following a no-regret learning algorithm, faster convergence to the min-max value of the game is possible. As concrete evidence of this possibility, take fictitious play (a.k.a. the “follow-the-leader” algorithm in online optimization): against an adversary, it may be forced not to converge to zero average regret; but if both players of a zero-sum game use fictitious play, their average payoffs do converge to the min-max value of the game, given Robinson's proof.
Motivated by this observation, we investigate the following: Is there a no-regret learning algorithm that, when used by both players of a zero-sum game, converges to the min-max value of the game at a rate faster than with the number T of rounds? We answer this question in the affirmative, by providing a no-regret learning algorithm, called NoRegretEgt, with asymptotically optimal regret behavior of , and convergence rate of to the min-max value of a game, where n is an upper bound on the number of the players' strategies. In particular,
Theorem 1 Let be a sequence of randomized strategies over a set of experts produced by the NoRegretEgt algorithm under a sequence of payoffs observed for these experts, where is observed after is chosen. Then for all T: where is the i-th unit basis vector. Moreover, let be a sequence of randomized strategies over and a sequence of randomized strategies over , and suppose that these sequences are produced when both players of a zero-sum game , , use the NoRegretEgt algorithm to update their strategies under observation of the sequence of payoff vectors and , respectively. Then for all T: where v is the row player's value in the game and . Moreover, for all T, the pair is an (additive) -approximate min-max equilibrium of the game.
In addition, our algorithm provides the first (to the best of our knowledge) example of a strongly-uncoupled distributed protocol converging to the value of a zero-sum game at a rate faster than . Strong-uncoupledness is the property of a distributed game-playing protocol under which the players can observe the payoff vectors of their own strategies at every round ( and for the row and column players respectively), but:
- •
they do not know the payoff tables of the game, or even the number of strategies available to the other player;6
- •
they can only use private storage to keep track of a constant number of observed payoff vectors (or cumulative payoff vectors), a constant number of mixed strategies (or possibly cumulative information thereof), and a constant number of state variables such as the round number.
And what could be the optimal convergence rate of distributed protocols for zero-sum games? We show that, insofar as convergence of the average payoffs of the players to their values in the game is concerned, the convergence rate achieved by our protocol is essentially optimal. Namely, we show the following:7
Theorem 2 Assuming that the players of a zero-sum game do not know their payoff matrices at the beginning of time, any distributed protocol producing sequences of strategies and such that the average payoffs of the players, and , converge to their corresponding value in the game, cannot do so at a convergence rate faster than an additive in the number T of rounds of the protocol. The same is true of any distributed protocol whose average strategies converge to a min-max equilibrium.
Our no-regret learning algorithm provides, to the best of our knowledge, the first example of a strongly-uncoupled distributed protocol converging to the min-max equilibrium of a zero-sum game at a rate faster than , and in fact at a nearly-optimal rate. The strong-uncoupledness arguably adds to the naturalness of our protocol, since no funny bit arithmetic, private computation of the min-max equilibrium, or anything of the similar flavor is allowed. Moreover, the strategies that the players use along the course of the dynamics are fairly natural in that they constitute smoothened best responses to their opponent's previous strategies. Nevertheless, there is a certain degree of careful choreography and interleaving of these strategies, turning our protocol less simple than, say, the multiplicative weights update algorithm. So we view our contribution mostly as an existence proof, leaving the following as an interesting future research direction: Is there a simple variant of the multiplicative weights update method or Zinkevich's algorithm (Zinkevich, 2003) which, when used by the players of a zero-sum game, converges to the min-max equilibrium of the game at the optimal rate of ? Another direction worth exploring is to shift away from our model, which allows players to play mixed strategies and and observe whole payoff vectors and in every round, and prove analogous results for the more restrictive multi-armed bandit setting that only allows players to play pure strategies and observe realized payoffs in every round. Finally, it would be interesting to prove formal lower bounds on the convergence rate of standard learning algorithms, such as the multiplicative weights update method, when both players use the same algorithm.
In Section 2 we provide more detail on the settings of online learning from expert advice and uncoupled dynamics in games, and proceed to the outline of our approach. Sections 3 Nesterov's minimization scheme, 4 Honest game dynamics, 5 No-regret game dynamics present the high-level proof of Theorem 1, while Sections 6 Detailed description of Nesterov's EGT algorithm, 7 The, 8 The, 9 The present the technical details of the proof. Finally, Section 10 presents the proof of Theorem 2.
Section snippets
Learning from expert advice
In the setting of learning from expert advice, a learner has a set of experts to choose from at each round . After committing to a distribution over the experts,8 a vector is revealed to the learner with the payoff achieved by each expert at round t. He can then update his distribution over the experts for the next round, and so forth. The goal of the learner is to minimize his average (external)
Nesterov's minimization scheme
In this section, we introduce Nesterov's Excessive Gap Technique (EGT) algorithm and state the necessary convergence result. The EGT algorithm is a gradient-descent approach for approximating the minimum of a convex function. In this paper, we apply the EGT algorithm to appropriate best-response functions of a zero-sum game. For a more detailed description of this algorithm, see Section 6. Let us define the functions and by In the above
Honest game dynamics
In this section we use game dynamics to simulate the EGT algorithm, by “decoupling” the operations of the algorithm, obtaining the HonestEgtDynamics protocol. Basically, the players help each other perform computations necessary in the EGT algorithm by playing appropriate strategies at appropriate times. In this section, we assume that both players are “honest,” meaning that they do not deviate from their prescribed protocols.
We recall that when the row and column players play x and y
No-regret game dynamics
We use the HonestEgtDynamics protocol as a starting block to design a no-regret protocol.
Detailed description of Nesterov's EGT algorithm
In this section, we explain the ideas behind the Excessive Gap Technique (EGT) algorithm and we show how this algorithm can be used to compute approximate Nash equilibria in two-player zero-sum games. Before we discuss the algorithm itself, we introduce some necessary background terminology.
The Honest EGT Dynamics protocol
In this section, we present the entirety of the HonestEGTDynamics protocol, introduced in Section 4, and compute convergence bounds for the average payoffs. Note that throughout the paper, we present the HonestEgtDynamics protocol, and protocols which follow, as a single block of pseudocode containing instructions for both row and column players. However, this presentation is purely for notational convenience, and our pseudocode can clearly be written as a protocol for the row player and a
The BoundedEgtDynamics(b) protocol
In this section, we describe and analyze the BoundedEgtDynamics protocol in detail. For clarity, we break the algorithm apart into subroutines. The overall structure is very similar to the HonestEgtDynamics protocol, but the players continually check for evidence that the opponent might have deviated from his instructions. We emphasize that if a YIELD failure occurs during an honest execution of BoundedEgtDynamics, both players detect the YIELD failure in the same step.
The NoRegretEgt protocol
Our final NoRegretEgt protocol is presented as Algorithm 4. Note that the state variables for MWU are completely separate from the BoundedEgtDynamics state variables of k, , and . Whenever instructed to run additional rounds of the MWU algorithm, the players work with these MWU-only state variables. We proceed to analyze its performance establishing Theorem 9, Theorem 10.
Lower bounds on optimal convergence rate
In this section, we prove Theorem 2. The main idea is that since the players do not know the payoff matrix A of the zero-sum game, it is unlikely that their historical average strategies will converge to a Nash equilibrium very fast. In particular, the players are unlikely to play a Nash equilibrium in the first round and the error from that round can only be eliminated at a rate of , forcing the convergence rate for the average payoffs and average strategies to the min-max
References (26)
- et al.
Adaptive game playing using multiplicative weights
Games Econ. Behav.
(1999) On the rate of convergence of continuous-time fictitious play
Games Econ. Behav.
(1998)- et al.
The weighted majority algorithm
Inf. Comput.
(1994) The equivalence of linear programs and zero-sum games
Int. J. Game Theory
(2013)Completely Uncoupled Dynamics and Nash Equilibria
(2010)- et al.
From external to internal regret
J. Mach. Learn. Res.
(2007) - et al.
On the rate of convergence of fictitious play
Iterative solution of games by fictitious play
Act. Anal. Product. Alloc.
(1951)- et al.
Prediction, Learning, and Games
(2006) - et al.
Calibrated learning and correlated equilibrium
Biometrika
(1998)
Regret testing: learning to play Nash equilibrium without knowing you have an opponent
Theoretical Econ.
Gradient-based algorithms for finding Nash equilibria in extensive form games
Smoothing techniques for computing Nash equilibria of sequential games
Math. Operations Res.
Cited by (57)
A kernel based learning method for non-stationary two-player repeated games
2020, Knowledge-Based SystemsCitation Excerpt :The Fictitious Play considers that the opponent has a stationary and non-adaptive strategy and that the players make their moves sequentially. This strategy was shown to converge to Nash equilibrium, however, with a poor convergence that scales exponentially with the number of rounds [6–8]. More recent online learning approaches are usually based on regret minimization and aimed at approximating Nash equilibrium [8–10].
On the complexity of computing sparse equilibria and lower bounds for no-regret learning in games
2024, Leibniz International Proceedings in Informatics, LIPIcs
- 1
Supported by a Sloan Foundation Fellowship, a Microsoft Research Fellowship, and NSF Award CCF-0953960 (CAREER) and CCF-1101491.
- 2
Supported by Fannie and John Hertz Foundation, Daniel Stroock Fellowship.
- 3
Work done while the author was a student at MIT. Supported in part by an NSF Graduate Research Fellowship.