Skip to main content
Log in

Learning to Play Efficient Coarse Correlated Equilibria

  • Published:
Dynamic Games and Applications Aims and scope Submit manuscript

Abstract

The majority of the distributed learning literature focuses on convergence to Nash equilibria. Coarse correlated equilibria, on the other hand, can often characterize more efficient collective behavior than even the best Nash equilibrium. However, there are no existing distributed learning algorithms that converge to specific coarse correlated equilibria. In this paper, we provide one such algorithm, which guarantees that the agents’ collective joint strategy will constitute an efficient coarse correlated equilibrium with high probability. The key to attaining efficient correlated behavior through distributed learning involves incorporating a common random signal into the learning environment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

Notes

  1. We will express an action profile \(a \in \mathcal{A}\) as \(a=(a_i, a_{-i})\).

  2. For the proof of Theorem 1, we require \(\delta = \varepsilon \). However, in practice, fixing \(\delta >\varepsilon \) in order to shorten the period length, \(\bar{p},\) often yields similar results, as we demonstrate in Example 1.

References

  1. Alos-Ferrer C, Netzer N (2010) The logit-response dynamics. Games Econ Behav 68:413–427

    Article  MathSciNet  MATH  Google Scholar 

  2. Alpcan T, Basar T (2010) Network security: a decision and game-theoretic approach, 1st edn. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  3. Altman E, Bonneau N, Debbah M (2006) Correlated equilibrium in access control for wireless communications. In: 5th international conference on networking

  4. Arieli I, Babichenko Y (2012) Average testing and the efficient boundary. J Econ Theory 147:2376–2398

    Article  MATH  Google Scholar 

  5. Aumann R (1987) Correlated equilibrium as an expression of bayesian rationality. Econometrica 55(1):1–18

    Article  MathSciNet  MATH  Google Scholar 

  6. Barman S, Ligett K (2015) Finding any nontrivial coarse correlated equilibrium is hard. SIGecom Exch 14:76–79

    Article  Google Scholar 

  7. Blume L (1993) The statistical mechanics of strategic interaction. Games Econ Behav 5:387–424

    Article  MathSciNet  MATH  Google Scholar 

  8. Boussaton O, Cohen J (2012) On the distributed learning of Nash equilibria with minimal information. In: 6th international conference on network games, control, and optimization

  9. Foster DP, Vohra R (1997) Calibrated learning and correlated equilibrium. Games Econ Behav 21:40–55

    Article  MathSciNet  MATH  Google Scholar 

  10. Foster DP, Young HP (1990) Stochastic evolutionary game dynamics. Theor Popul Biol 38:219–232

    Article  MathSciNet  MATH  Google Scholar 

  11. Foster DP, Young HP (2006) Regret testing: learning to play Nash equilibrium without knowing you have an opponent. Theor Econ 1:341–367

    Google Scholar 

  12. Frihauf P, Krstic M, Basar T (2012) Nash equilibrium seeking in noncooperative games. IEEE Trans Autom Control 57(5):1192–1207

    Article  MathSciNet  MATH  Google Scholar 

  13. Germano F, Lugosi G (2007) Global Nash convergence of Foster and Young’s regret testing. Games Econ Behav 60:135–154

    Article  MathSciNet  MATH  Google Scholar 

  14. Gharesifard B, Cortes J (2012) Distributed convergence to Nash equilibria by adversarial networks with directed topologies. In: 51st IEEE conference on decision and control

  15. Han Z, Niyato D, Saad W, Baar T, Hjrungnes A (2012) Game theory in wireless and communication networks: theory, models, and applications, 1st edn. Cambridge University Press, Cambridge

    Google Scholar 

  16. Hart S (2005) Adaptive heuristics. Econometrica 73(5):1401–1430

    Article  MathSciNet  MATH  Google Scholar 

  17. Hart S, Mas-Colell A (2000) A simple adaptive procedure leading to correlated equilibrium. Econometrica 68:1127–1150

    Article  MathSciNet  MATH  Google Scholar 

  18. Hart S, Mas-Colell A (2003) Uncoupled dynamics do not lead to nash equilibrium. Am Econ Rev 93(5):1830–1836

    Article  Google Scholar 

  19. Ho YC, Sun FK (1974) Value of information in two-team zero-sum problems. J Optim Theory Appl 14(5):557–571

    Article  MathSciNet  MATH  Google Scholar 

  20. Jiang A, Leyton-Brown K (2011) Polynomial-time computation of exact correlated equilibrium in compact games. In: Proceedings of the 12th ACM electronic commerce conference (ACM-EC)

  21. Lasauce S, Tembine H (2011) Game theory and learning for wireless networks, 1st edn. Elsevier, Amsterdam

    Google Scholar 

  22. MacKenzie A, DaSilva L (2006) Game theory for wireless engineers, 1st edn. Morgan & Claypool Publishers, San Rafael

    Google Scholar 

  23. Marden JR (2017) Selecting efficient correlated equilibria through distributed learning. Games Econ Behav 106:114–133

    Article  MathSciNet  MATH  Google Scholar 

  24. Marden JR, Shamma JS (2012) Revisiting log-linear learning: asynchrony, completeness and payoff-based implementation. Games Econ Behav 75(2):788–808

    Article  MathSciNet  MATH  Google Scholar 

  25. Marden JR, Shamma JS (2014) Game theory and distributed control. In: Young HP, Zamir S (eds) Handbook of game theory, vol 4. Elsevier, Amsterdam

    Google Scholar 

  26. Marden JR, Young HP, Arslan G, Shamma JS (2009) Payoff based dynamics for multi-player weakly acyclic games. SIAM J Control Optim 48(1):373–396

    Article  MathSciNet  MATH  Google Scholar 

  27. Marden JR, Young HP, Pao LY (2014) Achieving Pareto optimality through distributed learning. SIAM J Control Optim 52:2753–2770

    Article  MathSciNet  MATH  Google Scholar 

  28. Menache I, Ozdaglar A (2011) Network games: theory, models, and dynamics, 1st edn. Morgan & Claypool Publishers, San Rafael

    MATH  Google Scholar 

  29. Papadimitriou C (2005) Computing correlated equilibria in multiplayer games. In: Proceedings of the annual ACM symposium on theory of computing

  30. Papadimitriou C, Roughgarden T (2008) Computing correlated equilibria in multi-player games. J ACM 55:1–29

    Article  MathSciNet  MATH  Google Scholar 

  31. Poveda J, Quijano N (2013) Distributed extremum seeking for real-time resource allocation. In: American control conference

  32. Pradelski B, Young HP (2012) Learning efficient Nash equilibria in distributed systems. Games Econ Behav 75:882–897

    Article  MathSciNet  MATH  Google Scholar 

  33. Wang B, Han Z, Liu K (2009) Peer-to-peer file sharing game using correlated equilibrium. In: 43rd annual conference on information sciences and systems, 2009. CISS 2009, pp 729–734

  34. Young HP (1993) The evolution of conventions. Econometrica 61(1):57–84

    Article  MathSciNet  MATH  Google Scholar 

  35. Young HP (2009) Learning by trial and error. Games Econ Behav 65:626–643

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jason R. Marden.

Additional information

This research was supported by ONR grant #N00014-17-1-2060, NSF grant #ECCS-1638214, the NASA Aeronautics scholarship program, the Philanthropic Educational Organization, and the Zonta International Amelia Earhart fellowship program, and funding from King Abdullah University of Science and Technology (KAUST).

Appendix

Appendix

The formulation of the decision-making process defined in Sect. 3 ensures that the evolution of the agents’ states over the periods \(\{0, 1, 2, \dots \}\) can be represented as a finite ergodic Markov chain over the state space

$$\begin{aligned} X = X_1 \times \dots \times X_n \end{aligned}$$
(23)

where \(X_i = S_i\times \{C,D\}\) denotes the set of possible states of agent i. Let \(P^\varepsilon \) denote this Markov chain for some \(\varepsilon > 0\), and \(\delta = \varepsilon \). Proving Theorem 1 requires characterizing the stationary distribution of the family of Markov chains \(\{P^\varepsilon \}_{\varepsilon > 0}\) for all sufficiently small \(\varepsilon \). We employ the theory of resistance trees for regular perturbed processes, introduced in [34], to accomplish this task. We begin by reviewing this theory and then proceed with the proof of Theorem 1.

1.1 Background: Resistance Trees

Define \(P^0\) as the transition matrix for some nominal Markov process, and let \(P^{\varepsilon }\) be a perturbed version of this nominal process where the size of the perturbation is \(\varepsilon > 0\). Throughout this paper, we focus on the following class of Markov chains.

Definition 2

A family of Markov chains defined over a finite state space X, whose transition matrices are denoted by \(\{P^\varepsilon \}_{\varepsilon > 0}\), is called a regular perturbed process of a nominal process \(P^0\) if the following conditions are satisfied for all \(x,x^\prime \in X\):

  1. (1)

    There exists a constant \(c>0\) such that \(P^\varepsilon \) is aperiodic and irreducible for all \(\varepsilon \in (0,c]\).

  2. (2)

    \(\lim _{\varepsilon \rightarrow 0} P^{\varepsilon }_{x \rightarrow x'}= P^0_{x \rightarrow x'}\).

  3. (3)

    If \(P^\varepsilon _{x \rightarrow x'} > 0\) for some \(\varepsilon >0\), then there exists a constant \(r(x \rightarrow x') \ge 0\) such that

    $$\begin{aligned} 0<\lim _{\varepsilon \rightarrow 0} \frac{P^\varepsilon _{x \rightarrow x'}}{\varepsilon ^{r(x \rightarrow x')}}<\infty . \end{aligned}$$
    (24)

    The constant \(r(x \rightarrow x')\) is referred to as the resistance of the transition \(x \rightarrow x'.\)

For any \(\varepsilon > 0\), let \(\mu ^{\varepsilon } = \{\mu ^{\varepsilon }_x \}_{x \in X} \in \Delta (X)\) denote the unique stationary distribution associated with \(P^{\varepsilon }\). The theory of resistance trees presented in [34] provides efficient mechanisms for computing the support of the limiting stationary distribution, i.e., \(\lim _{\varepsilon \rightarrow 0^+} \mu ^{\varepsilon }\), commonly referred to as the stochastically stable states.

Definition 3

A state \(x \in X\) is stochastically stable [10] if \(\lim _{\varepsilon \rightarrow 0^+}\mu _x^\varepsilon >0\), where \(\mu ^\varepsilon \) is the stationary distribution corresponding to \(P^\varepsilon .\)

In this paper, we adopt the technique provided in [34] for identifying the stochastically stable states through a graph theoretic analysis over the recurrent classes of the unperturbed process \(P^0\). To that end, let \(Y_0, Y_1, \dots , Y_m\) denote the recurrent classes of \(P^0\). Define \(\mathcal{P}_{i j}\) to be the set of all paths connecting \(Y_i\) to \(Y_j\), i.e., a path \(p \in \mathcal{P}_{i j}\) is of the form \(p=\{(x_1, x_2), (x_2, x_3), \dots , (x_{k-1}, x_k)\}\) where \(x_1 \in Y_i\) and \(x_k \in Y_j\). The resistance associated with transitioning from \(Y_i\) to \(Y_j\) is defined as

$$\begin{aligned} r(Y_i , Y_j) = \min _{p \in \mathcal{P}_{i j}} \sum _{(x,x') \in p} r(x,x'). \end{aligned}$$
(25)

The recurrent classes \(Y_0,Y_1,\ldots ,Y_m\) satisfy the following properties: (i) there is a zero resistance path, i.e., a sequence of transitions each with zero resistance, from any state \(x \in X\) to at least one state y in one of the recurrent classes; (ii) for any recurrent class \(Y_i\) and any states \(y_i,y_i' \in Y_i\), there is a zero resistance path from \(y_i\) to \(y_i'\); and (iii) for any state \(y_i \in Y_i\) and \(y_j \in Y_j\), \(Y_i \ne Y_j\), any path from \(y_i\) to \(y_j\) has strictly positive resistance.

The first step in identifying the stochastically stable states is to identify the resistance between the various recurrent classes. The second step focuses on analyzing spanning trees of the weighted, directed graph \(\mathcal {G}\) whose vertices are recurrent classes of the process \(P^0,\) and whose edge weights are given by the resistances between classes in (25). Denote \(\mathcal {T}_{i}\) to be the set of all spanning trees of \(\mathcal {G}\) rooted at recurrent class \(Y_i\). Next, we compute the stochastic potential of each recurrent class which is defined as follows:

Definition 4

The stochastic potential of recurrent class \(Y_i\) is

$$\begin{aligned} \gamma (Y_i) = \min _{\mathcal{T} \in \mathcal {T}_{i}} \sum _{(Y, Y')\in \mathcal{T}} r(Y,Y') \end{aligned}$$

The following theorem characterizes the recurrent classes that are stochastically stable.

Theorem 3

([34]) Let \(P^0\) be the transition matrix for a stationary Markov process over the finite state space X with recurrent communication classes \(Y_1,\ldots ,Y_m\). For each \(\varepsilon > 0\), let \(P^\varepsilon \) be a regular perturbation of \(P^0\) with a unique stationary distribution \(\mu ^\varepsilon \). Then:

  1. (1)

    As \(\varepsilon \rightarrow 0\), \(\mu ^\varepsilon \) converges to a stationary distribution \(\mu ^0\) of \(P^0.\)

  2. (2)

    A state \(x \in X\) is stochastically stable if and only if x is contained in a recurrent class \(Y_j\) that minimizes \(\gamma (Y_j).\)

1.2 Proof of Theorem 1

We begin by restating the main results associated with Theorem 1 (setting \(\delta = \varepsilon \)) using the terminology defined in the previous section.

  • If \(q(S) \cap \mathrm{CCE} \ne \emptyset \), then a state \(x=\{x_i = [s_i, m_i] \}_{i \in N}\) is stochastically stable if and only if (i) \(m_i = C\) for all \(i \in N\) and (ii) the strategy profile \(s = (s_1, \dots , s_n)\) constitutes an efficient coarse correlated equilibrium, i.e.,

    $$\begin{aligned} q(s) \in \underset{ q \in q(S) \cap \mathrm{CCE}}{\arg \max } \ \sum _{i \in N} \sum _{a \in \mathcal{A}} \ U_i(a) q^a. \end{aligned}$$
    (26)
  • If \(q(S) \cap \mathrm{CCE} = \emptyset \), then a state \(x=\{x_i = [s_i, m_i] \}_{i \in N}\) is stochastically stable if and only if (i) \(m_i = C\) for all \(i \in N\) and (ii) the strategy profile \(s = (s_1, \dots , s_n)\) constitutes an efficient action profile, i.e.,

    $$\begin{aligned} q(s) \in \underset{ q \in q(S)}{\arg \max } \ \sum _{i \in N} \sum _{a \in \mathcal{A}} \ U_i(a) q^a. \end{aligned}$$
    (27)

For convenience, and with an abuse of notation, define

$$\begin{aligned} U_i(s) := \sum _{a\in \mathcal {A}}U_i(a)q^a(s) \end{aligned}$$
(28)

to be agent i’s expected utility with respect to distribution q(s), where \(s\in S.\)

The proof of Theorem 1 will consist of the following steps:

  1. (i)

    Define the unperturbed process, \(P^0\).

  2. (ii)

    Determine the recurrent classes of process \(P^0\).

  3. (iii)

    Establish transition probabilities of process \(P^\varepsilon \).

  4. (iv)

    Determine the stochastically stable states of \(P^\varepsilon \) using Theorem 3.

Part 1: Defining the unperturbed process

The unperturbed process \(P^0\) is effectively the process identified in Sect. 3 where \(\varepsilon = 0\). Rather than dictate the entire process as done previously, here we highlight the main attributes of the unperturbed process that may not be obvious upon initial inspection.

  • If agent i is content, i.e., \(x_i = [s_i^b, C]\), the trial action is \(s_i^t = s_i^b\) with probability 1. Otherwise, if agent i is discontent, the trial action is selected according to (22).

  • The baseline utility \(u_i^b\) in (8) associated with joint baseline strategy \(s^b\) is now of the form

    $$\begin{aligned} u_i^b = U_i(s^b) . \end{aligned}$$
    (29)

    This results from invoking the law of large numbers since \(\bar{p}= \lceil 1/\varepsilon ^{nc+1}\rceil \). The trial utility \(u_i^t\) and acceptance utility \(u_i^a\) are also of the same form.

  • A content player will only become discontent if \(u_i^a < u_i^b\) where associated payoffs are computed according to (29).

Part 2: Recurrent classes of the unperturbed process

The second part of the proof analyzes the recurrent classes of the unperturbed process \(P^0\) defined above. The following lemma identifies the recurrent classes of \(P^0\).

Lemma 1

A state \(x = (x_1,x_2,\ldots ,x_n)\in X\) belongs to a recurrent class of the unperturbed process \(P^0\) if and only if the state x fits into one of following two forms:

  • Form #1: The state for each agent \(i \in N\) is of the form \(x_i = \left[ s_i^b,C\right] \) where \(s_i^b \in S_i\). Each state of this form comprises a distinct recurrent classes. We represent the set of states of this form by \(C^0\).

  • Form #2: The state for each agent \(i \in N\) is of the form \(x_i = \left[ s_i^b,D\right] \) where \(s_i^b \in S_i\). All states of this form comprise a single recurrent class, represented by \(D^0\).

Proof:

We begin by showing that any state \(x \in C^0\) is a recurrent class of the unperturbed process. According to \(P^0\), if the system reaches state x, then it remains at x with certainty for all future time. Hence, each \(x\in C^0\) is a recurrent class of \(P^0.\) Next, we show that \(D^0\) constitutes a single recurrent class. Consider any two states \(x,y\in D^0\). According to the unperturbed process, \(P^0\), the probability of transitioning from x to y is strictly positive \(\left( \ge \prod _{i\in N}1/|S_i|\right) \); hence, the resistance of the transition \(x \rightarrow y\) is 0. Further note that the probability of transitioning to any state not in \(D^0\) is zero. Hence, \(D^0\) forms a single recurrent class of \(P^0\).

The last part of the proof involves proving that any state \( x = \{[s_i^b, m_i]\}_{i \in N} \notin C^0 \cup D^0\) is not recurrent in \(P^0\). Since \(x\notin {C^0\cup D^0}\), it consists of both content and discontent players. Denote the set of discontent players by \(J= \{i\in N\,:\,m_i = D\} \ne \emptyset \). We will show that the discontent players J will play a sequence of strategies with positive probability that drives at least one content player to become discontent. Repeating this argument at most n times shows that any state x of the above form will eventually transition to the all discontent state, proving that x is not recurrent.

To that end, let \(x(1) = x\) be the state at the beginning of the 1-st period. According to the unperturbed process \(P^0\), each discontent player randomly selects a strategy \(s_i \in S_i\) which becomes part of the player’s state at the ensuing stage. Suppose each discontent agent selects a trial strategy \(s_i = (a_i^1, \dots , a_i^w) \in \mathcal{A}_i^w \subset S_i\) during the 1-st period, i.e., the discontent players select strategies of the finest granularization. Note that each agent selects a strategy with probability \(\ge {1\mathop {/}|S_i|}.\) Here, the trial payoff for each player \(i \in N\) associated with the joint strategies \(s = (\{s_i^b\}_{i \notin J}, \{s_i\}_{i \in J})\) is

$$\begin{aligned} u_i^t(s)= & {} \int _{0}^{1} U_i(s(z)) \mathrm{d}z \end{aligned}$$
(30)
$$\begin{aligned}= & {} \frac{1}{w} U_i({a}) + \int _{w}^{1} U_i(s'(z)) \mathrm{d}z, \end{aligned}$$
(31)

for some \({a} \in \mathcal{A}\) as \(s_i(z) = s_i(z')\) for any \(z,z' \in [0,1/w]\) for any agent \(i \in N\). If \(u_i^t < u_i^b\) for any agent \(i \notin J\), agent i becomes discontent in the next stage and we are done.

For the remainder of the proof suppose \(u_i^t(s) \ge u_i^b(s^b)\) for all agents \(i \notin J\). This implies all agents \(N {\setminus } J\) will be content at the beginning of the second stage. By interdependence, there exists a collective action \(\tilde{a}_J \in \prod _{j \in J} \mathcal{A}_j\) and an agent \(i \notin J\) such that \(U_i(a) \ne U_i(\tilde{a}_J, a_{N{\setminus } J})\). Suppose each discontent agent selects a trial strategy \(s_i' = (\tilde{a}_i^1, a_i^2, \dots , a_i^w) \in \mathcal{A}_i^w \subset S_i\) during the second period, i.e., only the first component of the strategy changed. The trial payoff for each player \(i \in N\) associated with the joint strategies \(s' = (\{s_i^b\}_{i \notin J}, \{s_i'\}_{i \in J})\) is

$$\begin{aligned} u_i^t(s')= & {} \int _{0}^{1} U_i(s'(z)) \mathrm{d}z \\= & {} \frac{1}{w} U_i(\tilde{a}_J, a_{N{\setminus } J}) + \int _{w}^{1} U_i(s'(z)) \mathrm{d}z \\\ne & {} u_i^t(s) \end{aligned}$$

If \(u_i^t(s') < u_i^t(s)\), agent i will become discontent at the ensuing stage and we are done. Otherwise, agent i will stay content at the ensuing stage. However, if each discontent agent selects a trial strategy \(s_i'' = (a_i^1, a_i^2, \dots , a_i^w) \in \mathcal{A}_i^w \subset S_i\) during the third period, we know \(u_i^t(s'') < u_i^t(s')\), where \(s'' = (\{s_i^b\}_{i \notin J}, \{s_i''\}_{i \in J})\). Hence, agent i will become discontent at the beginning of period 4. This argument can be repeated at most n times, completing the proof. \(\square \)

Part 3: Transition probabilities of process \(P^\varepsilon \)

Here, we establish the transition probability \(P^\varepsilon _{x\rightarrow x^+}\) for a pair of arbitrary states, \(x,x^+\in X.\) Let \(x_i = [s_i,m_i]\), \(x_i ^+= [s_i^+,m_i^+]\) for \(i\in N,\)\(s = (s_1,s_2,\ldots ,s_n),\) and \(s^+ = (s_1^+,s_2^+,\ldots ,s_n^+).\) Then,

(32)

Note that the strategy selections and state transitions are conditioned on state x; for notational brevity we do not explicitly write this dependence. Here, \(s^t\) and \(s^a\) represent the joint trial and acceptance strategies during the period before the transition to \(x^+\). The double summation in (32) is over all possible trial actions, \(\tilde{s}^t\in S\), and acceptance strategies, \(\tilde{s}^a\in S\). However, recall from (14) to (17) that, when transitioning from x to \(x^+\), not all strategies can serve as intermediate trial and acceptance strategies. In particular, transitioning to state \(x^+\) requires that \(s^a = s^+;\) hence if \(\tilde{s}^a\ne s^+,\) then \({\text { Pr }}[x^+\,|\,s^t = \tilde{s}^t,\, s^a = \tilde{s}^a]=0,\) so we can rewrite (32) as:

(33)

There are three cases for the transition probabilities in (33). Before proceeding, we make the following observations. The last term in (33), \({\text { Pr }}[ s^t = \tilde{s}^t]\), is defined in Sect. 3; we will not repeat the definition here. For the first two terms, agents’ state transition and strategy selection probabilities are independent when conditioned state x and on the joint trial and acceptance strategy selections. Hence, we can write the first term as:

$$\begin{aligned} {\text { Pr }}[x^+\,|\,s^t = \tilde{s}^t, s^a = s^+]= \prod _{i\in N}{\text { Pr }}[x_i^+\,|\,s^t = \tilde{s}^t, s^a = s^+] \end{aligned}$$
(34)

and the second term as:

$$\begin{aligned} {\text { Pr }}[ s^a = s^+\,|\,s^t = \tilde{s}^t] = \prod _{i\in N}{\text { Pr }}[ s_i^a = s_i^+\,|\,s^t = \tilde{s}^t]. \end{aligned}$$
(35)

The following three cases specify individual agents’ probability of choosing the acceptance strategy \(s_i^a\) in (35) and transitioning to state \(x_i^+\) in (34).

Case (i) agentiis content in statex, i.e.,\(m_i = C\), and did not experiment, \(s_i^t = s_i\):

For (35), since \(s_i^a\in \{s_i^t,s_i\}\) we know that

$$\begin{aligned} {\text { Pr }}[ s_i^a = s_i^+\,|\,s^t = \tilde{s}^t] =\left\{ \begin{array}{ll} 1 &{}\quad \text {if } s_i^+ = s_i\\ 0 &{} \quad \text {otherwise} \end{array} \right. . \end{aligned}$$

In (34), for any trial strategy \(s^t = \tilde{s}^t\), the probability of transitioning to a state \(x_i^+\) depends on realized average payoffs \(u_i^b\) and \(u_i^a\). In particular, if \(x_i^+ = [s_i^+,C]\), then we must have that \(u_i^a\ge u_i^b - \varepsilon \), so

$$\begin{aligned}&{\text { Pr }}\biggl [x_i^+ = [s_i^+,C]\,|\,s^a = s^+, s^t = \tilde{s}^t\biggr ]\\&\quad = \int _0^1 {\text { Pr }}[u_i^b = \eta ] \int _{\eta -\varepsilon }^1 {\text { Pr }}[u_i^a = \nu \,|\,s^t = \tilde{s}^t, s^a = s^+] \mathrm{d}\nu \mathrm{d}\eta . \end{aligned}$$

Then, the probability that \(x_i^+ = [s_i^+,D]\) is

$$\begin{aligned} 1 - {\text { Pr }}\biggl [x_i^+ = [s_i^+,C]\,|\,s^a = s^+, s^t = \tilde{s}^t\biggr ]. \end{aligned}$$

Case (ii) agentiis content and experimented,\(s_i^t\ne s_i:\) For (35), agent i’s acceptance strategy depends on its average baseline and trial payoffs, \(u_i^b\) and \(u_i^t\). Recall, if \(u_i^t\ge u_i^b+\varepsilon ,\) then \(s_i^a = s_i\), i.e., agent i’s acceptance strategy is simply its baseline strategy from state x. Otherwise \(s_i^a = s_i^t.\) Utilities \(u_i^b\) and \(u_i^t\) depend on joint strategies s and \(s^t\) and on the common random signals sent during the corresponding phases. Therefore,

$$\begin{aligned}&{\text { Pr }}[ s_i^a = s_i^+\,|\,s^t = \tilde{s}^t\ne s]\nonumber \\&\quad =\int _0^1\int _0^1 {\text { Pr }}[ s_i^a = s_i^+\,|\,u_i^b = \eta , u_i^t = \nu , s_i^t = s_i]\times \, {\text { Pr }}[u_i^b = \eta ]{\text { Pr }}[u_i^t = \nu \,|\,s^t = \tilde{s}^t]\mathrm{d}\eta \mathrm{d}\nu \end{aligned}$$

In (34), since agent i remains content and sticks with its acceptance strategy from the previous period,

$$\begin{aligned}&{\text { Pr }}[x_i^+\,|\,s^a = s^+, s^t = \tilde{s}^t]= \left\{ \begin{array}{ll} 1 &{} \quad \text {if }s_i^+ = s_i^a\\ 0 &{} \quad \text {otherwise} \end{array} \right. . \end{aligned}$$

Case (iii) agentiis discontent:

For (35),

$$\begin{aligned} {\text { Pr }}[ s_i^a = s_i^+\,|\,s^t = \tilde{s}^t] = \left\{ \begin{array}{ll} 1 &{}\quad \text {if } s_i^+ = s_i^t\\ 0 &{}\quad \text {otherwise} \end{array} \right. . \end{aligned}$$

In (34), agent i’s probability of becoming content depends only on its received payoff during the acceptance phase; it becomes content with probability \(\varepsilon ^{1 - u_i^a}\) and remains discontent with probability \(1 - \varepsilon ^{1 - u_i^a}\). Hence, if \(x_i^+ = [s_i^+,C]\),

$$\begin{aligned}&{\text { Pr }}\biggl [x_i^+ = [s_i^+,C] \,|\,s^a = s^+, s^t = \tilde{s}^t\biggr ]= \int _0^1 \varepsilon ^{1 - \eta }{\text { Pr }}[u_i^a = \eta \,|\,s^a = s^+, s^t = \tilde{s}^t]\mathrm{d}\eta . \end{aligned}$$

Then,

$$\begin{aligned}&{\text { Pr }}\biggl [x_i^+ = [s_i^+,D] \,|\,s^a = s^+, s^t = \tilde{s}^t\biggr ]= 1 - {\text { Pr }}\biggl [x_i^+ = [s_i^+,C] \,|\,s^a = s^+, s^t = \tilde{s}^t\biggr ] \end{aligned}$$

Now that we have established transition probabilities for process \(P^\varepsilon \), we may state the following lemma.

Lemma 2

The process \(P^\varepsilon \) is a regular perturbation of \(P^0.\)

It is straightforward to see that \(P^\varepsilon \) satisfies the first two conditions of Definition 2 with respect to \(P^0\). The fact that transition probabilities satisfy the third condition, Eq. (24), follows from the fact that the dominant terms in \(P^\varepsilon _{x\rightarrow y}\) are polynomial in \(\varepsilon \). This is immediately clear in all but the incorporation of realized utilities into the transition probabilities, as in (33). However, for any joint strategy, s, and associated average payoff \(u_i\), since

$$\begin{aligned} \mathbb {E}[u_i] = \mathbb {E}\left[ {1\over \bar{p}}\sum _{\tau = \ell }^{\ell +\bar{p}-1} U_i(s(z(\tau )))\right] = U_i(s). \end{aligned}$$

for any time period of length \(\bar{p}\) in which joint strategy s is played throughout the entire period. Moreover, \({\text { Var }}\bigl [U_i(s(z(\tau )))\bigr ] \le 1.\) Therefore, we may use Chebyshev’s inequality and the fact that \(\bar{p} = \lceil 1\mathbin {/} \varepsilon ^{nc+2} \rceil \) to see that

$$\begin{aligned} {\text { Pr }}\Bigl [\bigl | u_i - U_i(s)\bigr |\ge \varepsilon \Bigr ] \le { {\text { Var }}\bigl [U_i(s(z(\tau )))\bigr ]\over \bar{p} \varepsilon ^2}\le \varepsilon ^{nc}. \end{aligned}$$
(36)

Note that this applies for all average utilities, \(u_i^b, u_i^t,\) and \(u_i^a\) in the aforementioned state transition probabilities.

Part 4: Determining the stochastically stable states

We begin by defining

$$\begin{aligned}&C^{\star } := \{x = \{[s_i,m_i]\}_{i\in N}\,:\,q(s)\in \text {CCE} \text { and } m_i = C,\,\forall i\in N\}\subseteq C^0 \end{aligned}$$

Here, we show that, if \(C^{\star }\) is non-empty, then a state x is stochastically stable if and only if q(s) satisfies (26). The fact that q(s) must satisfy (27) when \(C^{\star } = \emptyset \) follows in a similar manner. To accomplish this task, we (1) establish resistances between recurrent classes and (2) compute stochastic potentials of each recurrent class.

1.3 Resistances Between Recurrent Classes

We summarize resistances between recurrent classes in the following claim.

Claim: 1

Resistances between recurrent classes satisfy:

For \(x \in C^0\) with corresponding joint strategy s,

$$\begin{aligned} r(D^0\rightarrow x) = \sum _{i\in N}(1 - U_i(s)). \end{aligned}$$
(37)

For a transition of the form \(x\rightarrow y\), where \(x\in C^{\star }\) and \(y \in (C^0\cup D^0){\setminus } \{x\},\)

$$\begin{aligned} r(x\rightarrow y)\ge 2c. \end{aligned}$$
(38)

For a transition of the form \(x\rightarrow y\) where \(x\in C^0{\setminus } C^{\star }\) and \(y\in (C^0\cup D^0 ){\setminus } \{x\}\),

$$\begin{aligned} r(x\rightarrow y)\ge c. \end{aligned}$$
(39)

For every \(x\in C^0{\setminus } C^{\star }\), there exists a path \(x = x^0\rightarrow x^1\rightarrow \cdots \rightarrow x^m\in C^{\star }\cup D^0\) with resistance

$$\begin{aligned} r(x^j\rightarrow x^{j+1}) = c,\;\forall j\in \{0,1,\ldots ,m-1\}. \end{aligned}$$
(40)

These resistances are computed in a similar manner to the proof establishing resistances in [23]; however, care must be taken due to the fact that there is a small probability that average received utilities fall outside of the window \(U_i(s)\pm \varepsilon \) during a phase in which joint strategy s is played. We illustrate this by proving (37) in detail; the proofs are omitted for other types of transitions for brevity.

Proof:

Let \(x\in D^0\), \(x^+\in C^0\) with \(x_i = [s_i,D]\) and \(x_i^+ = [s_i^+,C]\) for each \(i\in N.\) Again, for notational brevity, we drop the dependence on state x in the following probabilities. Note that all agents must select \(s^t = s_i^+\) in order to transition to state \(x_i = [s_i^+,C];\) otherwise the transition probability is 0. we have

$$\begin{aligned}&P^\varepsilon _{x\rightarrow x^+} {\mathop {=}\limits ^{(a)}}{\text { Pr }}[x^+\,|\,s^a = s^+, s^t = s^+]\\&\qquad \quad \qquad \times {\text { Pr }}[ s^a = s^+\,|\,s^t = s^+]{\text { Pr }}[ s^t = s^+]\\&\qquad \quad \,\,\,\, {\mathop {=}\limits ^{(b)}}{\text { Pr }}[x^+\,|\,s^a = s^+, s^t = s^+]{\text { Pr }}[ s^t = s^+]\\&\qquad \quad \,\,\,\, {\mathop {=}\limits ^{(c)}}{\text { Pr }}[x^+\,|\,s^a = s^+, s^t = s^+]\prod _{i\in N}1\mathop {/}|S_i|\\&\qquad \quad \quad = \prod _{i\in N}{1\over |S_i|}{\text { Pr }}[x_i^+\,|\,s^a = s^+,s^t = s^+] \end{aligned}$$

where: (a) follows from the fact that \(s_i^a = s_i^t\) since \(m_i = D\) in state x for all \(i\in N\), (b) \({\text { Pr }}[ s^a = s^+\,|\,s^t = s^+] = 1\) since all agents are discontent and hence commit to their trial strategies during the acceptance period, and (c) \({\text { Pr }}[ s^t = s^+] = \prod _{i\in N}1\mathop {/}|S_i|\) since each discontent agent selects its trial strategy uniformly at random from \(S_i\).

We now show that

$$\begin{aligned} 0< \lim _{\varepsilon \rightarrow 0^+} \frac{P^\varepsilon _{x\rightarrow x^+}}{\varepsilon ^{\sum _{i\in N}1 - U_i(s^+)}}<\infty \end{aligned}$$
(41)

satisfying (24). For notational simplicity, we define

$$\begin{aligned} U_i^+&:= U_i(s^+)+\varepsilon ,\nonumber \\ U_i^-&:=U_i(s^+) - \varepsilon . \end{aligned}$$
(42)

We first lower bound \(P^\varepsilon _{x\rightarrow x^+} :\)

$$\begin{aligned} P^\varepsilon _{x\rightarrow x^+}&= \prod _{i\in N}{1\over |S_i|}{\text { Pr }}[x_i^+ \,|\,s^a = s^+, s^t =s^+]\nonumber \\&= \prod _{i\in N}{1\over |S_i|}\int _0^1 {\text { Pr }}[u_i^a = \eta \,|\,s^a = s^+, s^t = s^+]\varepsilon ^{1-\eta }\mathrm{d}\eta \nonumber \\&\quad \ge \prod _{i\in N}{1\over |S_i|}\int _{U_i^-}^{U_i^+}{\text { Pr }}[u_i^a = \eta \,|\,s^a = s^+, s^t = s^+ ]\varepsilon ^{1-\eta }\mathrm{d}\eta \nonumber \\&\quad {\mathop {\ge }\limits ^{(a)}}\prod _{i\in N}{\varepsilon ^{1 - U_i^-}\over |S_i|}\int _{U_i^-}^{U_i^+} {\text { Pr }}[u_i^a = \eta \,|\,s^a = s^+, s^t = s^+] \mathrm{d}\eta \nonumber \\&\quad {\mathop {\ge }\limits ^{(b)}}\prod _{i\in N}{\varepsilon ^{1 - U_i^-}\over |S_i|}(1-\varepsilon ^{nc})\nonumber \\&\quad ={\varepsilon ^{\sum _{i\in N} 1 - U_i^-} + O(\varepsilon ^{nc})\over \prod _{i\in N} |S_i|} \end{aligned}$$
(43)

where (a) is from the fact that \(\varepsilon ^{1-\eta }\) is continuous and increasing in \(\eta \) for \(\varepsilon \in (0,1),\) and (b) follows from (36). Continuing in a similar fashion, it is straightforward to show

$$\begin{aligned} P^\varepsilon _{x\rightarrow x^+}\le \varepsilon ^{\sum _{i\in N}(1-U_i^+)} + O(\varepsilon ^{nc}). \end{aligned}$$
(44)

Given (43) and (44), and the fact that \(U_i^+\) and \(U_i^-\) satisfy (42), we have that \(P_{x\rightarrow x^+}^\varepsilon \) satisfies (24) with resistance \(\sum _{i\in N}\left( 1 - U_i(s^+)\right) \) as desired. \(\square \)

1.4 Stochastic Potentials

The following lemma specifies stochastic potentials of each recurrent class. Using resistances from Claim 1, the stochastic potentials follow from the same arguments as in [23]. The proof is repeated below for completeness.

Lemma 3

Let \(x\in C^0{\setminus } C^{\star }\) with corresponding joint strategy s, and let \(x^{\star }\in C^{\star }\) with corresponding joint strategy \(s^\star .\) The stochastic potentials of each recurrent class are:

$$\begin{aligned} \gamma (D^0)&= c|C^0{\setminus } C^\star | + 2c|C^\star |,\\ \gamma (x)&= \left( |C^0{\setminus } C^\star | - 1\right) c + 2c|C^\star | + \sum _{i\in N}(1 - U_i(s)),\\ \gamma (x^\star )&= |C^0{\setminus } C^\star |c + 2c\left( |C^\star |-1\right) + \sum _{i\in N}(1 - U_i(s^\star )), \end{aligned}$$

Proof:

In order to establish the stochastic potentials for each recurrent class, we will lower and upper bound them.

Lower bounding the stochastic potentials: To lower bound the stochastic potentials of each recurrent class, we determine the lowest possible resistance that a tree rooted at each of these classes may have.

(1) Lower bounding \(\gamma (D^0)\):

$$\begin{aligned} \gamma (D^0) \ge c|C^0{\setminus } C^\star | + 2c|C^\star | \end{aligned}$$

In a tree rooted at \(D^0\), each state in \(C^0\) must have an exiting edge. In order to exit a state in \(C^0{\setminus } C^\star \), only a single agent must experiment, contributing resistance c. To exit a state in \(C^\star \), at least two agents must experiment, contributing resistance 2c.

(2) Lower bounding \(\gamma (x)\), \(x\in C^0{\setminus } C^\star \):

$$\begin{aligned} \gamma (x) \ge \left( |C^0{\setminus } C^\star | - 1\right) c + 2c|C^\star | + \sum _{i\in N}(1 - U_i(s)) \end{aligned}$$

Here, each state in \(C^0{\setminus } \{x\}\) must have an exiting edge, which contributes resistance \(\left( |C^0{\setminus } C^\star | - 1\right) c + 2c|C^\star |.\) The recurrent class \(D^0\) must also have an exiting edge, contributing at least resistance \(\sum _{i\in N}(1 - U_i(s)).\)

(3) Lower bounding \(\gamma (x^\star )\), \(x^\star \in C^\star \):

$$\begin{aligned} \gamma (x^\star ) \ge |C^0{\setminus } C^\star |c + 2c\left( |C^\star |-1\right) + \sum _{i\in N}(1 - U_i(s^\star )) \end{aligned}$$

Again, each state in \(C^0{\setminus } \{x^\star \}\) must have an exiting edge, which contributes resistance \(\left( |C^0{\setminus } C^\star | - 1\right) c + 2c|C^\star |.\) The recurrent class \(D^0\) must also have an exiting edge, contributing resistance at least \(\sum _{i\in N}(1 - U_i(s^\star )).\)

Upper bounding the stochastic potentials: In order to upper bound the stochastic potentials, we construct trees rooted at each recurrent class which have precisely the resistances established above.

(1) Upper bounding \(\gamma (D^0)\):

$$\begin{aligned} \gamma (D^0) \le c|C^0{\setminus } C^\star | + 2c|C^\star | \end{aligned}$$

Begin with an empty graph with vertices X. For each state \(x\in C^0{\setminus } C^\star \), add a path ending in \(C^\star \cup D^0\) so that each edge has resistance c. This is possible due to Claim 1. Now eliminate redundant edges; this contributes resistance at most \(c|C^0{\setminus } C^\star |\) since each state in \(C^0{\setminus } C^\star \) has exactly one outgoing edge. Finally, add an edge \(x^\star \rightarrow D^0\) for each \(x^\star \in C^0;\) this contributes resistance \(2c|C^\star |\).

(2) Upper bounding \(\gamma (x)\), \(x\in C^0{\setminus } C^\star \):

$$\begin{aligned} \gamma (x) \le \left( |C^0{\setminus } C^\star | - 1\right) c + 2c|C^\star | + \sum _{i\in N}(1 - U_i(s)), \end{aligned}$$

This follows by a similar argument to the previous upper bound, except here we add an edge \(D^0 \rightarrow x\) which contributes resistance \(\sum _{i\in N}(1 - U_i(s))\).

(3) Upper bounding \(\gamma (x^\star )\), \(x^\star \in C^\star :\)

$$\begin{aligned} \gamma (x^\star ) \le |C^0{\setminus } C^\star |c + 2c\left( |C^\star |-1\right) + \sum _{i\in N}(1 - U_i(s^\star )), \end{aligned}$$

This follows from an identical argument to the previous bound. \(\square \)

We now use Lemma 3 to complete the proof of Theorem 1. For the first part, suppose \(C^{\star }\) is non-empty, and let

$$\begin{aligned} x^\star \in {{\mathrm{arg\,max}}}_{x\in C^\star } \sum {U_i(s)}, \end{aligned}$$

where joint strategy s corresponds to state x. Then,

$$\begin{aligned} \gamma (x^\star )&= |C^0{\setminus } C^\star |c + 2c\left( |C^\star |-1\right) + \sum _{i\in N}(1 - U_i(s^*))\\&< |C^0{\setminus } C^\star |c + 2c|C^\star | \quad \text {(since }c\ge n) \\&= \gamma (D). \end{aligned}$$

For \(x \in C^0\),

$$\begin{aligned} \gamma (x^\star )&= |C^0{\setminus } C^\star |c + 2c\left( |C^\star |-1\right) + \sum _{i\in N}(1 - U_i(s^\star ))\\&< |C^0{\setminus } C^\star - 1|c + 2c\left( |C^\star |\right) + \sum _{i\in N}(1 - U_i(s)) \\&=\gamma (x). \end{aligned}$$

For \(x\in C^\star \) with

$$\begin{aligned}&x\notin {{\mathrm{arg\,max}}}_{x\in C^\star } \sum {U_i(s)},\\&\gamma (x^\star ) = |C^0{\setminus } C^\star |c + 2c\left( |C^\star |-1\right) + \sum _{i\in N}(1 - U_i(s^\star ))\\&\qquad < |C^0{\setminus } C^\star |c + 2c\left( |C^\star |-1\right) + \sum _{i\in N}(1 - U_i(s)\\&\qquad =\gamma (x). \end{aligned}$$

Applying Theorem 3, \(x^\star \) is stochastically stable. Since all other states have strictly larger stochastic potential, only states \(x^\star \in C^\star \) with \(x^\star \in {{\mathrm{arg\,max}}}_{x\in C^\star } \sum {U_i(s)}\) are stochastically stable. From state \(x^\star \), if each agent plays according to its baseline strategy, then the probability that joint action \(a\in \mathcal {A}\) is played at any given time is \({\text { Pr }}(a = a^\prime ) = q^{a^\prime (s^\star )}.\) This implies that a CCE which maximizes the sum of agents’ payoffs is played with high probability as \(\varepsilon \rightarrow 0,\) after sufficient time has passed.

The second part of the theorem follows similarly by considering the case when \(C^\star = \emptyset .\)

\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Borowski, H.P., Marden, J.R. & Shamma, J.S. Learning to Play Efficient Coarse Correlated Equilibria. Dyn Games Appl 9, 24–46 (2019). https://doi.org/10.1007/s13235-018-0244-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13235-018-0244-z

Keywords

Navigation