Elsevier

Games and Economic Behavior

Volume 92, July 2015, Pages 327-348
Games and Economic Behavior

Near-optimal no-regret algorithms for zero-sum games

https://doi.org/10.1016/j.geb.2014.01.003Get rights and content

Abstract

We propose a new no-regret learning algorithm. When used against an adversary, our algorithm achieves average regret that scales optimally as O(1T) with the number T of rounds. However, when our algorithm is used by both players of a zero-sum game, their average regret scales as O(lnTT), guaranteeing a near-linear rate of convergence to the value of the game. This represents an almost-quadratic improvement on the rate of convergence to the value of a zero-sum game known to be achievable by any no-regret learning algorithm. Moreover, it is essentially optimal as we also show a lower bound of Ω(1T) for all distributed dynamics, as long as the players do not know their payoff matrices in the beginning of the dynamics. (If they did, they could privately compute minimax strategies and play them ad infinitum.)

Introduction

Von Neumann's minimax theorem (von Neumann, 1928) lies at the origins of the fields of both algorithms and game theory. Indeed, it was the first example of a static game-theoretic solution concept: If the players of a zero-sum game arrive at a min-max pair of strategies, then no player can improve his payoff by unilaterally deviating, resulting in an equilibrium state of the game. The min-max equilibrium played a central role in von Neumann and Morgenstern's foundations of Game Theory (von Neumann and Morgenstern, 1944), and inspired the discovery of the Nash equilibrium (Nash, 1951) and the foundations of modern economic thought (Myerson, 1999).

At the same time, the minimax theorem is tightly connected to the development of mathematical programming, as linear programming itself reduces to the computation of a min-max equilibrium, while strong linear programming duality is equivalent to the minimax theorem.4 Given the further developments in linear programming in the past century (Karmarkar, 1984, Khachiyan, 1979), we now have efficient algorithms for computing equilibria in zero-sum games, even in very large ones such as poker (Gilpin et al., 2008, Gilpin et al., 2007).

On the other hand, the min-max equilibrium is a static notion of stability, leaving open the possibility that there are no simple distributed dynamics via which stability comes about. This turns out not to be the case, as many distributed protocols for this purpose have been discovered. One of the first protocols suggested for this purpose is fictitious play, whereby players switch rounds playing the pure strategy that optimizes their payoff against the historical play of their opponent (viewed as a distribution over strategies). This simple scheme, suggested by Brown in 1949 (Brown, 1951), was shown to converge to the min-max value of the game by Robinson (1951). However, its convergence rate has recently been shown to be exponentially slow in the number of strategies (Brandt et al., 2010).5 Such poor convergence guarantees do not offer much by way of justifying the plausibility of the min-max equilibrium in a distributed setting, making the following questions rather important: Are there efficient and natural distributed dynamics converging to min-max equilibrium/value? And what is the optimal rate of convergence?

The answer to the first question is, by now, very well understood. A typical source of efficient dynamics converging to min-max equilibria is online optimization. The results here are very general: If both players of a game use a no-regret learning algorithm to adapt their strategies to their opponent's strategies, then the average payoffs of the players converge to their min-max value, and their average strategies constitute an approximate min-max equilibrium, with the approximation converging to 0 (Cesa-Bianchi and Lugosi, 2006). In particular, if a no-regret learning algorithm guarantees average external regret g(T,n,u), as a function of the number T of rounds, the number n of “experts,” and the magnitude u of the maximum in absolute value payoff of an expert at each round, we can readily use this algorithm in a game setting to approximate the min-max value of the game to within an additive O(g(T,n,u)) in T rounds, where u is now the magnitude of the maximum in absolute value payoff in the game, and n an upper bound on the players' strategies.

For instance, if we use the multiplicative weights update algorithm (Freund and Schapire, 1999, Littlestone and Warmuth, 1994), we would achieve approximation O(ulognT) to the value of the game in T rounds. Given that the dependence of O(lognT) in the number n of experts and the number T of rounds is optimal for the regret bound of any no-regret learning algorithm (Cesa-Bianchi and Lugosi, 2006), the convergence rate to the value of the game achieved by the multiplicative weights update algorithm is the optimal rate that can be achieved by a black-box reduction of a regret bound to a convergence rate in a zero-sum game.

Nevertheless, a black-box reduction from the learning-with-expert-advice setting to the game-theoretic setting may be lossy in terms of approximation. Indeed, no-regret bounds apply even when playing against an adversary; it may be that, when two players of a zero-sum game update their strategies following a no-regret learning algorithm, faster convergence to the min-max value of the game is possible. As concrete evidence of this possibility, take fictitious play (a.k.a. the “follow-the-leader” algorithm in online optimization): against an adversary, it may be forced not to converge to zero average regret; but if both players of a zero-sum game use fictitious play, their average payoffs do converge to the min-max value of the game, given Robinson's proof.

Motivated by this observation, we investigate the following: Is there a no-regret learning algorithm that, when used by both players of a zero-sum game, converges to the min-max value of the game at a rate faster than O(1T) with the number T of rounds? We answer this question in the affirmative, by providing a no-regret learning algorithm, called NoRegretEgt, with asymptotically optimal regret behavior of O(ulognT), and convergence rate of O(ulogn(logT+(logn)3/2)T) to the min-max value of a game, where n is an upper bound on the number of the players' strategies. In particular,

Theorem 1

Let x1,x2,,xt, be a sequence of randomized strategies over a set of experts [n]:={1,2,,n} produced by the NoRegretEgt algorithm under a sequence of payoffs 1,2,,t,[u,u]n observed for these experts, where t is observed after xt is chosen. Then for all T:1Tt=1T(xt)Ttmaxi[n]1Tt=1T(ei)TtO(ulognT), where ei is the i-th unit basis vector.

Moreover, let x1,x2,,xt, be a sequence of randomized strategies over [n] and y1,y2,,yt, a sequence of randomized strategies over [m], and suppose that these sequences are produced when both players of a zero-sum game (A,A), A[u,u]n×m, use the NoRegretEgt algorithm to update their strategies under observation of the sequence of payoff vectors (Ayt)t and (ATxt)t, respectively. Then for all T:|1Tt=1T(xt)T(A)ytv|O(ulogk(logT+(logk)3/2)T), where v is the row player's value in the game and k=max{m,n}. Moreover, for all T, the pair (1Tt=1Txt,1Tt=1Tyt) is an (additive) O(ulogk(logT+(logk)3/2)T)-approximate min-max equilibrium of the game.

In addition, our algorithm provides the first (to the best of our knowledge) example of a strongly-uncoupled distributed protocol converging to the value of a zero-sum game at a rate faster than O(1T). Strong-uncoupledness is the property of a distributed game-playing protocol under which the players can observe the payoff vectors of their own strategies at every round ((Ayt)t and (ATxt)t for the row and column players respectively), but:

  • they do not know the payoff tables of the game, or even the number of strategies available to the other player;6

  • they can only use private storage to keep track of a constant number of observed payoff vectors (or cumulative payoff vectors), a constant number of mixed strategies (or possibly cumulative information thereof), and a constant number of state variables such as the round number.

The precise details of our model and comparison to other models in the literature are given in Section 2.2. Notice that, without the assumption of strong-uncoupledness, there can be trivial solutions to the problem. Indeed, if the payoff tables of the game were known to the players in advance, they could just privately compute their min-max strategies and use these strategies ad infinitum. If the payoff tables were unknown but the type of information the players could privately store were unconstrained, they could engage in a protocol for recovering their payoff tables, followed by the computation of their min-max strategies. Even if they also didn't know each other's number of strategies, they could interleave phases in which they either recover pieces of their payoff matrices, or they compute min-max solutions of recovered square sub-matrices of the game until convergence to an exact equilibrium is detected. Arguably, such protocols are of limited interest in highly distributed game-playing settings.

And what could be the optimal convergence rate of distributed protocols for zero-sum games? We show that, insofar as convergence of the average payoffs of the players to their values in the game is concerned, the convergence rate achieved by our protocol is essentially optimal. Namely, we show the following:7

Theorem 2

Assuming that the players of a zero-sum game (A,A) do not know their payoff matrices at the beginning of time, any distributed protocol producing sequences of strategies (xt)t and (yt)t such that the average payoffs of the players, 1Tt(xt)T(A)yt and 1Tt(xt)TAyt, converge to their corresponding value in the game, cannot do so at a convergence rate faster than an additive Ω(1/T) in the number T of rounds of the protocol. The same is true of any distributed protocol whose average strategies converge to a min-max equilibrium.

Our no-regret learning algorithm provides, to the best of our knowledge, the first example of a strongly-uncoupled distributed protocol converging to the min-max equilibrium of a zero-sum game at a rate faster than 1T, and in fact at a nearly-optimal rate. The strong-uncoupledness arguably adds to the naturalness of our protocol, since no funny bit arithmetic, private computation of the min-max equilibrium, or anything of the similar flavor is allowed. Moreover, the strategies that the players use along the course of the dynamics are fairly natural in that they constitute smoothened best responses to their opponent's previous strategies. Nevertheless, there is a certain degree of careful choreography and interleaving of these strategies, turning our protocol less simple than, say, the multiplicative weights update algorithm. So we view our contribution mostly as an existence proof, leaving the following as an interesting future research direction: Is there a simple variant of the multiplicative weights update method or Zinkevich's algorithm (Zinkevich, 2003) which, when used by the players of a zero-sum game, converges to the min-max equilibrium of the game at the optimal rate of 1T? Another direction worth exploring is to shift away from our model, which allows players to play mixed strategies xt and yt and observe whole payoff vectors (A)yt and ATxt in every round, and prove analogous results for the more restrictive multi-armed bandit setting that only allows players to play pure strategies and observe realized payoffs in every round. Finally, it would be interesting to prove formal lower bounds on the convergence rate of standard learning algorithms, such as the multiplicative weights update method, when both players use the same algorithm.

In Section 2 we provide more detail on the settings of online learning from expert advice and uncoupled dynamics in games, and proceed to the outline of our approach. Sections 3 Nesterov's minimization scheme, 4 Honest game dynamics, 5 No-regret game dynamics present the high-level proof of Theorem 1, while Sections 6 Detailed description of Nesterov's EGT algorithm, 7 The, 8 The, 9 The present the technical details of the proof. Finally, Section 10 presents the proof of Theorem 2.

Section snippets

Learning from expert advice

In the setting of learning from expert advice, a learner has a set [n]:={1,,n} of experts to choose from at each round t=1,2, . After committing to a distribution xtΔn over the experts,8 a vector t[u,u]n is revealed to the learner with the payoff achieved by each expert at round t. He can then update his distribution over the experts for the next round, and so forth. The goal of the learner is to minimize his average (external)

Nesterov's minimization scheme

In this section, we introduce Nesterov's Excessive Gap Technique (EGT) algorithm and state the necessary convergence result. The EGT algorithm is a gradient-descent approach for approximating the minimum of a convex function. In this paper, we apply the EGT algorithm to appropriate best-response functions of a zero-sum game. For a more detailed description of this algorithm, see Section 6. Let us define the functions f:ΔnR and ϕ:ΔmR byf(x)=maxvΔmxTAvandϕ(y)=minuΔnuTAy. In the above

Honest game dynamics

In this section we use game dynamics to simulate the EGT algorithm, by “decoupling” the operations of the algorithm, obtaining the HonestEgtDynamics protocol. Basically, the players help each other perform computations necessary in the EGT algorithm by playing appropriate strategies at appropriate times. In this section, we assume that both players are “honest,” meaning that they do not deviate from their prescribed protocols.

We recall that when the row and column players play x and y

No-regret game dynamics

We use the HonestEgtDynamics protocol as a starting block to design a no-regret protocol.

Detailed description of Nesterov's EGT algorithm

In this section, we explain the ideas behind the Excessive Gap Technique (EGT) algorithm and we show how this algorithm can be used to compute approximate Nash equilibria in two-player zero-sum games. Before we discuss the algorithm itself, we introduce some necessary background terminology.

The Honest EGT Dynamics protocol

In this section, we present the entirety of the HonestEGTDynamics protocol, introduced in Section 4, and compute convergence bounds for the average payoffs. Note that throughout the paper, we present the HonestEgtDynamics protocol, and protocols which follow, as a single block of pseudocode containing instructions for both row and column players. However, this presentation is purely for notational convenience, and our pseudocode can clearly be written as a protocol for the row player and a

The BoundedEgtDynamics(b) protocol

In this section, we describe and analyze the BoundedEgtDynamics protocol in detail. For clarity, we break the algorithm apart into subroutines. The overall structure is very similar to the HonestEgtDynamics protocol, but the players continually check for evidence that the opponent might have deviated from his instructions. We emphasize that if a YIELD failure occurs during an honest execution of BoundedEgtDynamics, both players detect the YIELD failure in the same step.

The NoRegretEgt protocol

Our final NoRegretEgt protocol is presented as Algorithm 4. Note that the state variables for MWU are completely separate from the BoundedEgtDynamics state variables of k, xk, and yk. Whenever instructed to run additional rounds of the MWU algorithm, the players work with these MWU-only state variables. We proceed to analyze its performance establishing Theorem 9, Theorem 10.

Lower bounds on optimal convergence rate

In this section, we prove Theorem 2. The main idea is that since the players do not know the payoff matrix A of the zero-sum game, it is unlikely that their historical average strategies will converge to a Nash equilibrium very fast. In particular, the players are unlikely to play a Nash equilibrium in the first round and the error from that round can only be eliminated at a rate of Ω(1/T), forcing the Ω(1/T) convergence rate for the average payoffs and average strategies to the min-max

References (26)

  • Y. Freund et al.

    Adaptive game playing using multiplicative weights

    Games Econ. Behav.

    (1999)
  • C. Harris

    On the rate of convergence of continuous-time fictitious play

    Games Econ. Behav.

    (1998)
  • N. Littlestone et al.

    The weighted majority algorithm

    Inf. Comput.

    (1994)
  • I. Adler

    The equivalence of linear programs and zero-sum games

    Int. J. Game Theory

    (2013)
  • Y. Babichenko

    Completely Uncoupled Dynamics and Nash Equilibria

    (2010)
  • A. Blum et al.

    From external to internal regret

    J. Mach. Learn. Res.

    (2007)
  • F. Brandt et al.

    On the rate of convergence of fictitious play

  • G.W. Brown

    Iterative solution of games by fictitious play

    Act. Anal. Product. Alloc.

    (1951)
  • N. Cesa-Bianchi et al.

    Prediction, Learning, and Games

    (2006)
  • D.P. Foster et al.

    Calibrated learning and correlated equilibrium

    Biometrika

    (1998)
  • D. Foster et al.

    Regret testing: learning to play Nash equilibrium without knowing you have an opponent

    Theoretical Econ.

    (2006)
  • A. Gilpin et al.

    Gradient-based algorithms for finding Nash equilibria in extensive form games

  • A. Gilpin et al.

    Smoothing techniques for computing Nash equilibria of sequential games

    Math. Operations Res.

    (2010)
  • Cited by (57)

    • A kernel based learning method for non-stationary two-player repeated games

      2020, Knowledge-Based Systems
      Citation Excerpt :

      The Fictitious Play considers that the opponent has a stationary and non-adaptive strategy and that the players make their moves sequentially. This strategy was shown to converge to Nash equilibrium, however, with a poor convergence that scales exponentially with the number of rounds [6–8]. More recent online learning approaches are usually based on regret minimization and aimed at approximating Nash equilibrium [8–10].

    View all citing articles on Scopus
    1

    Supported by a Sloan Foundation Fellowship, a Microsoft Research Fellowship, and NSF Award CCF-0953960 (CAREER) and CCF-1101491.

    2

    Supported by Fannie and John Hertz Foundation, Daniel Stroock Fellowship.

    3

    Work done while the author was a student at MIT. Supported in part by an NSF Graduate Research Fellowship.

    View full text