Reinforcement Learning for Non-stationary Discrete-Time Linear–Quadratic Mean-Field Games in Multiple Populations

uz Zaman, Muhammad Aneeq; Miehling, Erik; Başar, Tamer

doi:10.1007/s13235-022-00448-w

Reinforcement Learning for Non-stationary Discrete-Time Linear–Quadratic Mean-Field Games in Multiple Populations

Published: 10 May 2022

Volume 13, pages 118–164, (2023)
Cite this article

Dynamic Games and Applications Aims and scope Submit manuscript

876 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

Scalability of reinforcement learning algorithms to multi-agent systems is a significant bottleneck to their practical use. In this paper, we approach multi-agent reinforcement learning from a mean-field game perspective, where the number of agents tends to infinity. Our analysis focuses on the structured setting of systems with linear dynamics and quadratic costs, named linear–quadratic mean-field games, evolving over a discrete-time infinite horizon where agents are assumed to be partitioned into finitely many populations connected by a network of known structure. The functional forms of the agents’ costs and dynamics are assumed to be the same within populations, but differ between populations. We first characterize the equilibrium of the mean-field game which further prescribes an $\epsilon $-Nash equilibrium for the finite population game. Our main focus is on the design of a learning algorithm, based on zero-order stochastic optimization, for computing mean-field equilibria. The algorithm exploits the affine structure of both the equilibrium controller and equilibrium mean-field trajectory by decomposing the learning task into first learning the linear terms and then learning the affine terms. We present a convergence proof and a finite-sample bound quantifying the estimation error as a function of the number of samples.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unified reinforcement Q-learning for mean field game and control problems

Article 15 January 2022

Mean Field Games

Data Availability Statement:

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Notes

See Moon and Başar, [23] for a justification of this assumption.
In other words, equilibrium mean-field trajectories are allowed to vary in time; see definition (A3) in Subramanian and Mahajan, [26].
The notation for $\pi $ has been overwritten from Sect. 2 as the network case will be the topic of the remainder of the paper.
If populations l and k are disconnected then $C_Z^{lk}=0$.
The zero-order stochastic optimization algorithm requires access to a finite length mean-field trajectory. This rollout length (also called the truncation length) is tied to the accuracy of the ZSO algorithm. For a stable system (which is the case in this paper) the rollout length is $\mathcal {O}(\log (1/{\bar{\delta }}))$, where ${\bar{\delta }}$ is the accuracy of cost estimate $J_l$. The reader is referred to [22], Section 2.2.2, for a detailed explanation of this ZSO-based truncation scheme.
This is in contrast to earlier work [13] where only the affine terms needed update due to the assumed stationarity of the MFE.

References

Achdou Y, Dao M-K, Ley O, Tchou N (2020) Finite horizon mean field games on networks. Calc Var Partial Differ Equ 59(5):1–34
Article MathSciNet MATH Google Scholar
Anahtarcı B, Karıksız CD, Saldi N (2019) Fitted Q-learning in mean-field games. arXiv preprint arXiv:1912.13309
Bauso D (2017) Consensus via multi-population robust mean-field games. Syst Control Lett 107:76–83
Article MathSciNet MATH Google Scholar
Bauso D, Tembine H, Başar T (2016) Opinion dynamics in social networks through mean-field games. SIAM J Control Optim 54(6):3225–3257
Article MathSciNet MATH Google Scholar
Bensoussan A, Sung K, Yam SCP, Yung S-P (2016) Linear-quadratic mean field games. J Optim Theory Appl 169(2):496–529
Article MathSciNet MATH Google Scholar
Bryson AE, Ho Y-C (1975) Applied optimal control, revised printing. Hemisphere, New York
Google Scholar
Caines, PE, Huang M (2019) Graphon mean field games and the GMFG equations: $\varepsilon $-Nash equilibria. In: 2019 IEEE 58th conference on decision and control (CDC), pp 286–292. IEEE
Camilli F, Marchi C (2016) Stationary mean field games systems defined on networks. SIAM J Control Optim 54(2):1085–1103
Article MathSciNet MATH Google Scholar
Carmona R, Delarue F (2018) Probabilistic theory of mean field games with applications I. Springer, Cham
Book MATH Google Scholar
Delarue F (2017) Mean field games: a toy model on an Erdös–Renyi graph. ESAIM Proc Surv 60:1–26
Article MATH Google Scholar
Elie R, Pérolat J, Laurière M, Geist M, Pietquin O (2019) Approximate fictitious play for mean field games. arXiv preprint arXiv:1907.02633
Fazel M, Ge R, Kakade SM, Mesbahi M (2018) Global convergence of policy gradient methods for the linear quadratic regulator. In: International conference on machine learning, pp 1467–1476
Fu Z, Yang Z, Chen Y, Wang Z (2020) Actor-critic provably finds Nash equilibria of linear-quadratic mean-field games. In: International conference on learning representation
Gao S, Caines PE, Huang M (2020) LQG graphon mean field games. arXiv preprint arXiv:2004.00679
Gu D (2007) A differential game approach to formation control. IEEE Trans Control Syst Technol 16(1):85–93
Article Google Scholar
Guo X, Hu A, Xu R, Zhang J (2019) Learning mean-field games. In: Advances in neural information processing systems
Huang M, Zhou M (2018) Linear quadratic mean field games–Part I: the asymptotic solvability problem. arXiv preprint arXiv:1811.00522
Huang M, Malhamé RP, Caines PE et al (2006) Large population stochastic dynamic games: Closed-loop Mckean-Vlasov systems and the Nash certainty equivalence principle. Commun Inf Syst 6(3):221–252
Article MathSciNet MATH Google Scholar
Huang M, Caines PE, Malhamé RP (2007) Large-population cost-coupled LQG problems with nonuniform agents: individual-mass behavior and decentralized $\varepsilon $-Nash equilibria. IEEE Trans Autom Control 52(9):1560–1571
Article MathSciNet MATH Google Scholar
Lasry J-M, Lions P-L (2007) Mean field games. Jpn J Math 2(1):229–260
Article MathSciNet MATH Google Scholar
Lewis FL, Zhang H, Hengster-Movric K, Das A (2013) Cooperative control of multi-agent systems: optimal and adaptive design approaches. Springer, Berlin
MATH Google Scholar
Malik D, Pananjady A, Bhatia K, Khamaru K, Bartlett P, Wainwright M (2019) Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. In: The 22nd international conference on artificial intelligence and statistics, pp 2916–2925. PMLR
Moon J, Başar T (2014) Discrete-time LQG mean field games with unreliable communication. In: 53rd IEEE conference on decision and control, pp 2697–2702. IEEE
Saldi N, Başar T, Raginsky M (2018) Markov-Nash equilibria in mean-field games with discounted cost. SIAM J Control Optim 56(6):4256–4287
Article MathSciNet MATH Google Scholar
Spall JC (2005) Introduction to stochastic search and optimization: estimation, simulation, and control, vol 65. Wiley, Hoboken
MATH Google Scholar
Subramanian J, Mahajan A (2019) Reinforcement learning in stationary mean-field games. In: International conference on autonomous agents and multiagent systems, pp 251–259
Yang Z, Chen Y, Hong M, Wang Z (2019) Provably global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost. In: Advances in neural information processing systems, pp 8351–8363
Zaman MAu, Zhang K, Miehling E, Başar T (2020a) Approximate equilibrium computation for discrete-time linear-quadratic mean-field games. In: 2020 American control conference (ACC), pp 333–339. IEEE
Zaman MAu, Zhang K, Miehling E, Başar T (2020b) Reinforcement learning in non-stationary discrete-time linear-quadratic mean-field games. In: 2020 59th IEEE conference on decision and control (CDC), pp 2278–2284. IEEE
Zeng Y, Wu Q, Zhang R (2019) Accessing from the sky: a tutorial on UAV communications for 5G and beyond. Proc IEEE 107(12):2327–2375
Article Google Scholar
Zhu Q, Başar T (2011) A multi-resolution large population game framework for smart grid demand response management. In: International conference on network games, control and optimization (NetGCooP 2011), pp 1–8. IEEE

Download references

Acknowledgements

We thank the anonymous reviewers for their useful suggestions. We also thank Dr. Kaiqing Zhang for his technical expertise and useful discussions.

Author information

Authors and Affiliations

Coordinated Science Laboratory, University of Illinois at Urbana–Champaign, Urbana, IL, 61801, USA
Muhammad Aneeq uz Zaman, Erik Miehling & Tamer Başar

Authors

Muhammad Aneeq uz Zaman
View author publications
You can also search for this author in PubMed Google Scholar
Erik Miehling
View author publications
You can also search for this author in PubMed Google Scholar
Tamer Başar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Muhammad Aneeq uz Zaman.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Research leading to this work was supported in part by AFOSR Grant FA9550-19-1-0353. This article is part of the topical collection “Multi-agent Dynamic Decision Making and Learning” edited by Konstantin Avrachenkov, Vivek S. Borkar and U. Jayakrishnan Nair.

Appendices

Appendix

Proof of Proposition 1

Proof

We organize the proof of Proposition 1 in two parts. In the first part we prove the existence and uniqueness of the equilibrium mean-field $\bar{\mathsf{Z}}$. In the second part we prove the form of the equilibrium controller.

Part I: The Hamiltonian function for generic agent in population l given a mean-field trajectory $\bar{\mathsf{Z}}= (\bar{\mathsf{Z}}_0,\bar{\mathsf{Z}}_1,\ldots )$ is

$$\begin{aligned}&H^l_t(Z^l_t,U^l_t,\bar{\mathsf{Z}}_t,\zeta ^l_t) = \frac{1}{2} \big (\big \Vert Z^l_t \big \Vert ^2_{Q^l} + \big \Vert U^l_t \big \Vert ^2_{C_U^l} +\big \Vert Z^l_t-\bar{Z}^l_t \big \Vert ^2_{{C_Z^{ll}}} \nonumber \\&+ \sum _{k \in \mathcal {L}_l} \big \Vert Z^l_t-\bar{Z}^k_t - \beta ^{lk} \big \Vert ^2_{{C_Z^{lk}}} \big ) + (\zeta ^l_{t+1})^\top ({A^l} Z^l_t +{B^l} U^l_t + W^l_t) \end{aligned}$$

(29)

for $l \in \mathcal {L}$ where $\zeta _t^l$ is the co-state variable. Notice that we have scaled the cost function by 1/2 to simplify expressions. Recall the dynamics of generic agent in population l,

$$\begin{aligned} Z^l_{t+1} = {A^l} Z^l_t +{B^l} U^l_t +W^l_t. \end{aligned}$$

(30)

The co-state dynamics and the optimal control are obtained using the Hamiltonian and the conditions for optimality,

$$\begin{aligned} \frac{\partial H^l_t}{\partial Z^l_t}&= {A^l}^\top \zeta ^l_{t+1} + Q^l Z^l_t + \sum _{k \in \mathcal {L}_l} {C_Z^{lk}} (Z^l_t - \bar{Z}^k_t - \beta ^{lk}) - \varDelta M^l_t= \zeta ^l_t, \nonumber \\ \frac{\partial H^l_t}{\partial U^l_t}&= {C_U^l} U^l_t + {B^l}^\top \zeta ^l_{t+1} = 0 \implies U^l_t = - (C_U^l)^{-1} {B^l}^\top \zeta ^l_{t+1}, \end{aligned}$$

(31)

where $\varDelta M^l_t$ is a Martingale difference sequence,

$$\begin{aligned} \varDelta M^l_t = {A^l}^\top \zeta ^l_{t+1} - {A^l}^\top \mathbb {E}[ \zeta ^l_{t+1} \mid \mathcal {F}^l_t] \end{aligned}$$

where $\mathcal {F}^l_t$ is the $\sigma $-algebra generated by process $Z^l_t$ up to time t and $\beta ^{ll} = 0$. Aggregating the dynamics of the generic agent and co-state (with the assumption that the tracking signal $\bar{Z}^l$ is also the expected trajectory of $Z^l_t$) we obtain the dynamics of equilibrium mean-field of population l denoted by ${\bar{Z}}^{l*}$ and aggregated co-state $\bar{\upzeta }^{l*}$,

$$\begin{aligned} \bar{Z}^{l*}_{t+1}&= {A^l} {\bar{Z}}^{l*}_t - {B^l} (C_U^l)^{-1}{B^l}^\top \bar{\upzeta }^{l*}_{t+1}, \nonumber \\ \bar{\upzeta }^{l*}_t&= {A^l}^\top \bar{\upzeta }^{l*}_{t+1} + Q^l {\bar{Z}}^{l*}_t + \sum _{k \in \mathcal {L}_l}{C_Z^{lk}} \big ( \bar{Z}_t^{l*} - \bar{Z}_t^{k*} - \beta ^{lk} \big ). \end{aligned}$$

(32)

Defining the mean-field trajectory of the system as $\bar{\mathsf{Z}}^*_t = (\bar{Z}^{1*}_t,\ldots ,\bar{Z}^{{L}*}_t) \in \mathbb {R}^{mL}$ and the co-state trajectory of the system as $\bar{\upzeta }^*_t = (\bar{\upzeta }^{1*}_t,\ldots ,\bar{\upzeta }^{L*}_t) \in \mathbb {R}^{mL}$, their dynamics are

$$\begin{aligned} \bar{\mathsf{Z}}^*_{t+1}&= \mathsf{A} \bar{\mathsf{Z}}^*_t - \mathsf{B} \mathsf{R}^{-1} \mathsf{B}^\top \bar{\upzeta }^*_{t+1}, \end{aligned}$$

(33)

$$\begin{aligned} \bar{\upzeta }^*_t&= \mathsf{A}^\top \bar{\upzeta }^*_{t+1} + \mathsf{Q}\bar{\mathsf{Z}}^*_t - {\hat{\upbeta }}, \end{aligned}$$

(34)

where $\mathsf{A}$ and $\mathsf{B}$ are defined in (23) and

$$\begin{aligned}&\mathsf{R}= {{\,\mathrm{diag}\,}}(C^1_U,\ldots ,C^L_U), \nonumber \\&{\hat{\upbeta }} = \begin{bmatrix} \sum _{k \in \mathcal {L}_1} C^{1k}_Z \beta ^{1k} \\ \vdots \\ \sum _{k \in \mathcal {L}_L} C^{Lk}_Z \beta ^{Lk} \end{bmatrix}, \mathsf{Q}= \begin{bmatrix} Q^1 + \sum _{k \in \mathcal {L}_1 } C^{1k}_Z &{} \cdots &{} -C^{1L}_Z \\ -C^{21}_z &{} \cdots &{} -C^{2L}_Z \\ \vdots &{} \ddots &{} \vdots \\ -C^{L1}_Z &{} \cdots &{} Q^L + \sum _{k \in \mathcal {L}_L}C^{Lk}_Z \end{bmatrix} \end{aligned}$$

(35)

To solve the set of Eqs. (33)-(34), we use the sweep method [6] and assume $\bar{\upzeta }^*_t$ has the form $\bar{\upzeta }^*_t = \mathsf{S}_t \bar{\mathsf{Z}}^*_t + \mathsf{L}_t$. Under this assumption Eq. (33) yields,

$$\begin{aligned} \bar{\mathsf{Z}}^*_{t+1}&= \mathsf{A}\bar{\mathsf{Z}}^*_t - \mathsf{B}\mathsf{R}^{-1} \mathsf{B}^\top (\mathsf{S}_{t+1} \bar{\mathsf{Z}}^*_{t+1} + \mathsf{L}_{t+1}) \nonumber \\ \bar{\mathsf{Z}}^*_{t+1}&= (I + \mathsf{B} \mathsf{R}^{-1} \mathsf{B}^\top \mathsf{S}_{t+1})^{-1} \big (\mathsf{A} \bar{\mathsf{Z}}^*_t - \mathsf{B} \mathsf{R}^{-1} \mathsf{B}^\top \mathsf{L}_{t+1}\big ), \end{aligned}$$

(36)

and Eq. (34) results in

$$\begin{aligned} \mathsf{S}_t \bar{\mathsf{Z}}^*_t + \mathsf{L}_t = \mathsf{A}^\top (\mathsf{S}_{t+1} \bar{\mathsf{Z}}^*_{t+1} + \mathsf{L}_{t+1}) + \mathsf{Q}\bar{\mathsf{Z}}^*_t - {\hat{\upbeta }}. \end{aligned}$$

(37)

Substituting Eq. (36) into (37) yields

$$\begin{aligned}&\mathsf{S}_t \bar{\mathsf{Z}}^*_t + \mathsf{L}_t = \mathsf{A}^\top \mathsf{L}_{t+1} + \mathsf{Q}\bar{\mathsf{Z}}^*_t - {\hat{\upbeta }} + \nonumber \\& \mathsf{A}^\top \mathsf{S}_{t+1} (I + \mathsf{B}\mathsf{R}^{-1} \mathsf{B}^\top \mathsf{S}_{t+1})^{-1} \big (\mathsf{A}\bar{\mathsf{Z}}^*_t - \mathsf{B}\mathsf{R}^{-1} \mathsf{B}^\top \mathsf{L}_{t+1} \big ). \end{aligned}$$

(38)

Comparing coefficients of $\bar{\mathsf{Z}}^*_t$ yields

$$\begin{aligned} \mathsf{S}_t&= \mathsf{Q}+ \mathsf{A}^\top \mathsf{S}_{t+1} (I + \mathsf{B}\mathsf{R}^{-1} \mathsf{B}^\top \mathsf{S}_{t+1})^{-1} \mathsf{A}\nonumber \\ \mathsf{S}_t&= \mathsf{A}^\top \mathsf{S}_{t+1} \mathsf{A}+ \mathsf{Q}- \mathsf{A}^\top \mathsf{S}_{t+1} \mathsf{B}(\mathsf{R}+ \mathsf{B}^\top \mathsf{S}_{t+1} \mathsf{B})^{-1} \mathsf{B}^\top \mathsf{S}_{t+1} \mathsf{A}\end{aligned}$$

(39)

where the last equation is obtained using the Woodbury Matrix Identity. Comparing the remaining terms, we obtain the independent backwards process

$$\begin{aligned} \mathsf{L}_t&= \mathsf{A}^\top (I - \mathsf{S}_{t+1} (I + \mathsf{B}\mathsf{R}^{-1} \mathsf{B}^\top \mathsf{S}_{t+1})^{-1} \mathsf{B}\mathsf{R}^{-1} \mathsf{B}^\top ) \mathsf{L}_{t+1} - {\hat{\upbeta }} \nonumber \\&= \mathsf{A}^\top (I + \mathsf{S}_{t+1} \mathsf{B}\mathsf{R}^{-1} \mathsf{B}^\top )^{-1} \mathsf{L}_{t+1} - {\hat{\upbeta }}. \end{aligned}$$

(40)

For the infinite horizon case, consider the limiting Algebraic Riccati Equation,

$$\begin{aligned} \mathsf{S}=&\mathsf{A}^\top \mathsf{S}\mathsf{A}- \mathsf{A}^\top \mathsf{S}\mathsf{B}(\mathsf{R}+ \mathsf{B}^\top \mathsf{S}\mathsf{B})^{-1} \mathsf{B}^\top \mathsf{S}\mathsf{A}+ \mathsf{Q}. \end{aligned}$$

(41)

If a unique solution to the above ARE exists and the matrix $\mathsf{A}^\top (I + \mathsf{S}\mathsf{B}\mathsf{R}^{-1} \mathsf{B}^\top )$ is stable, then the limiting backwards Eq. (40) becomes

$$\begin{aligned} \mathsf{L}=&\mathsf{A}^\top (I + \mathsf{S}\mathsf{B}\mathsf{R}^{-1} \mathsf{B}^\top )^{-1} \mathsf{L}- \upbeta . \end{aligned}$$

(42)

If the Riccati Eq. (41) admits a unique positive definite solution $\mathsf{S}$, then the MFE will be unique, linear and follow dynamics,

$$\begin{aligned} \bar{\mathsf{Z}}^*_{t+1}&= (I + \mathsf{B}\mathsf{R}^{-1} \mathsf{B}^\top \mathsf{S})^{-1} \big (\mathsf{A}\bar{\mathsf{Z}}^*_t - \mathsf{B}\mathsf{R}^{-1} \mathsf{B}^\top \mathsf{L}\big ) \nonumber \\&= \mathsf{F}^* \bar{\mathsf{Z}}^*_t + \mathsf{C}^* \end{aligned}$$

(43)

where $ \mathsf{F}^* = (I + \mathsf{B}\mathsf{R}^{-1} \mathsf{B}^\top \mathsf{S})^{-1} \mathsf{A}$ and $\mathsf{C}^* = - (I + \mathsf{B}\mathsf{R}^{-1} \mathsf{B}^\top \mathsf{S})^{-1}\mathsf{B}\mathsf{R}^{-1} \mathsf{B}^\top \mathsf{L}$.

Now we prove that the ARE (41) has a unique positive definite solution. We split up the matrix $\mathsf{Q}= \mathsf{Q}_1 + \mathsf{Q}_2$ such that,

$$\begin{aligned} \mathsf{Q}_1&= {{\,\mathrm{diag}\,}}(Q^1,\ldots ,Q^L), \end{aligned}$$

(44)

$$\begin{aligned} \mathsf{Q}_2&= \begin{bmatrix} \sum _{k \in \mathcal {L}_1 } C^{1k}_Z &{} \cdots &{} -C^{1L}_Z \\ \vdots &{} \ddots &{} \vdots \\ -C^{L1}_Z &{} \cdots &{} \sum _{k \in \mathcal {L}_L}C^{Lk}_Z \end{bmatrix}. \end{aligned}$$

(45)

Both matrices $\mathsf{Q}_1$ and $\mathsf{Q}_2$ are symmetric positive semi-definite, since $Q^l$ and ${C_Z^{lk}} = C^{kl}_Z$ are symmetric positive semi-definite matrices.

As a consequence of the observability of the pair $({A^l},Q^{1/2}_l)$, the pair $(\mathsf{A},\mathsf{Q}^{1/2}_1)$ is also observable. Hence, for any vector x in the eigenspace of $\mathsf{A}$, $x^\top \mathsf{Q}_1 x > 0$. For such a vector,

$$\begin{aligned} x^\top (\mathsf{Q}_1 + \mathsf{Q}_2)x > 0 \end{aligned}$$

(46)

since $\mathsf{Q}_2 \ge 0$. This in turn implies that pair

$$\begin{aligned} \big (\mathsf{A},(\mathsf{Q}_1 + \mathsf{Q}_2)^{1/2} \big ) = \big (\mathsf{A},{\mathsf{Q}}^{1/2} \big ) \end{aligned}$$

(47)

is also observable. Finally for all $l \in [L], {C_U^l}> 0 \implies \mathsf{R}> 0$ and $({A^l},{B^l})$ being controllable for all $l \in [L]$ implies $(\mathsf{A},\mathsf{B})$ is controllable. This is a sufficient condition for the existence and uniqueness of the solution to Riccati Eq. (41). Moreover, $\mathsf{F}^*$ is also stable. Due to this the matrix in Eq. (42), $\mathsf{A}^\top (I + \mathsf{S}\mathsf{B}\mathsf{R}^{-1} \mathsf{B}^\top )^{-1} = \mathsf{F}^{*\top }$ is also stable; hence, a unique $\mathsf{L}$ satisfies the limiting Eq. (42). And hence due to Theorem 3.34 in Carmona and Delarue, [9] there exists a unique equilibrium mean-field trajectory given by (43).

Part II: To prove existence and uniqueness (and characterization) of the equilibrium controller, we formulate the problem (12)-(13) as a tracking control problem with the mean-field trajectory $\bar{\mathsf{Z}}^*$ given. Restating the conditions of optimality (31),

$$\begin{aligned} \frac{\partial H^l_t}{\partial Z^l_t}&= {A^l}^\top \zeta ^l_{t+1} + Q^l Z^l_t + \sum _{k \in \mathcal {L}_l} {C_Z^{lk}} (Z^l_t - \bar{Z}^k_t - \beta ^{lk}) - \varDelta M^l_t= \zeta ^l_t, \nonumber \\ \frac{\partial H^l_t}{\partial U^l_t}&= {C_U^l} U^l_t + {B^l}^\top \zeta ^l_{t+1} = 0 \implies U^l_t = - (C_U^l)^{-1} {B^l}^\top \zeta ^l_{t+1} \end{aligned}$$

(48)

where $\varDelta M^l_t$ is a Martingale difference sequence,

$$\begin{aligned} \varDelta M^l_t = {A^l}^\top \zeta ^l_{t+1} - \mathbb {E}[{A^l}^\top \zeta ^l_{t+1} \mid \mathcal {F}^l_t] \end{aligned}$$

where $\mathcal {F}^l_t$ is the $\sigma $-algebra generated by process $Z^l_t$ up to time t and $\beta ^{ll} = 0$. Assuming the form of the co-state $\zeta ^l_t = P^l_t Z^l_t + s^l_t$ and substituting into the equations we obtain

$$\begin{aligned}&Z^l_{t+1} = (I + {B^l} (C_U^l)^{-1} {B^l}^\top P^l_{t+1})^{-1} ({A^l} Z^l_t - {B^l} (C_U^l)^{-1} {B^l}^\top s^l_{t+1}), \\&P^l_t Z^l_t + s^l_t = {A^l}^\top (P^l_{t+1} \mathbb {E}[Z^l_{t+1} \mid \mathcal {F}^l_t] + s^l_{t+1}) + Q^l Z^l_t + \sum _{k \in \mathcal {L}_l} {C_Z^{lk}} (Z^l_t - \bar{Z}^{k*}_t). \end{aligned}$$

Substituting the first equation into the second yields

$$\begin{aligned} P^l_t Z^l_t + s^l_t =&{A^l}^\top (P^l_{t+1} (I + {B^l} (C_U^l)^{-1} {B^l}^\top P^l_{t+1})^{-1} ({A^l} Z^l_t - {B^l} (C_U^l)^{-1} {B^l}^\top s^l_{t+1}) \\ {}&+ s^l_{t+1}) + Q^l Z^l_t + \sum _{k \in \mathcal {L}_l} {C_Z^{lk}} (Z^l_t - \bar{Z}^{k*}_t - \beta ^{lk}). \end{aligned}$$

Comparing coefficients of $Z^l_t$ yields the Riccati equation

$$\begin{aligned} P^l_t = {A^l}^\top P^l_{t+1} (I + {B^l} (C_U^l)^{-1} {B^l}^\top P^l_{t+1})^{-1} {A^l} + Q^l + \sum _{k \in \mathcal {L}_l} {C_Z^{lk}} \end{aligned}$$

(49)

and comparing the remaining terms yields a backwards recursive expression for $s^l_t$,

$$\begin{aligned} s^l_t =&-{A^l}^\top \big (P^l_{t+1} (I + {B^l} (C_U^l)^{-1} {B^l}^\top P^l_{t+1})^{-1}{B^l} (C_U^l)^{-1} {B^l}^\top + I \big )s^l_{t+1} \nonumber \\&- \sum _{k \in \mathcal {L}_l} {C_Z^{lk}} (\bar{Z}^{k*}_t + \beta ^{lk}). \end{aligned}$$

(50)

The infinite horizon Riccati equation is

$$\begin{aligned} P^l = {A^l}^\top P^l (I + {B^l} (C_U^l)^{-1} {B^l}^\top P^l)^{-1} {A^l} + Q^l + \sum _{k \in \mathcal {L}_l} {C_Z^{lk}}. \end{aligned}$$

(51)

Given the above, if the pair $({A^l}, (Q^l + \sum {C_Z^{lk}})^{1/2})$ is observable, then the Riccati equation will have unique positive definite solution $P^l$. Since the pair $({A^l}, (Q^l)^{1/2})$ is observable, the pair $({A^l}, (Q^l + \sum {C_Z^{lk}})^{1/2})$ is also observable. The reason for that is that for any vector x in the eigenspace of ${A^l}$, $x^\top Q^l x > 0$ due to the observability of the pair $({A^l}, (Q^l)^{1/2})$. This implies that $x^\top (Q^l + \sum {C_Z^{lk}}) x > 0$ as ${C_Z^{lk}} \ge 0$. This implies that the pair $({A^l}, (Q^l + \sum {C_Z^{lk}})^{1/2})$ is also observable. Hence, there exists a unique positive definite $P^l$ that satisfies the Riccati equation.

Now we characterize the form of equilibrium control law. Using (50) we obtain

$$\begin{aligned} s^l_t = - \sum _{k \in [L]} {C_Z^{lk}} (\bar{Z}^{k*}_t + \beta ^{lk}) + {H^l} s^l_{t+1} \end{aligned}$$

(52)

where

$$\begin{aligned} {H^l} = ((E^l)^{-1} {A^l})^\top \text { and } E^l = I + {B^l} (C_U^l)^{-1} {B^l}^\top {P^l} \end{aligned}$$

(53)

where ${P^l}$ is the solution to the Riccati Eq. (51) and $\bar{Z}^{k*}$ represents the k’th population’s mean-field trajectory from $\bar{\mathsf{Z}}^{*}$. The stability of the sequence $s^l_t$ is dependent on the matrix $H^l$ being stable. We know that the matrix $(H^l)^\top = (E^l)^{-1} {A^l}$ is the closed-loop gain matrix of the LQR system $(A^l,B^l,Q^l + \sum _k C_Z^{lk}, C_U^l)$. This matrix is bound to be stable since its corresponding Riccati equation of the LQR system (51) has a unique positive definite solution. As a result, the matrix $H^l$ is also stable and hence the sequence $s^l_t$ is bounded. This yields the existence and uniqueness of the equilibrium controller. The closed-loop dynamics of generic agent l are thus

$$\begin{aligned} Z^l_{t+1}&= {H^l}^\top Z^l_t - (E^l)^{-1} {B^l} (C_U^l)^{-1} {B^l}^\top s^l_{t+1}, \nonumber \\&= {H^l}^\top Z^l_t - (E^l)^{-1} {B^l} (C_U^l)^{-1} {B^l}^\top \sum _{i=0}^{\infty } (H^l)^i \sum _{k \in [L]} {C_Z^{lk}} (\bar{Z}^{k*}_{t+i+1} + \beta ^{lk}), \nonumber \\&= {H^l}^\top Z^l_t - (E^l)^{-1} {B^l} (C_U^l)^{-1} {B^l}^\top \sum _{i=0}^{\infty } (H^l)^i \mathsf{C}^l_Z (\bar{\mathsf{Z}}^{*}_{t+i+1} + \upbeta ^l) \end{aligned}$$

where $\mathsf{C}^l_Z = (C^{l1}_Z,C^{l2}_Z,\ldots )$ and $\upbeta ^l = (\beta ^{l1},\ldots , \beta ^{lL}) \in \mathbb {R}^{mL}$. Since $\bar{\mathsf{Z}}^{(s)}$ is assumed to follow affine dynamics $\bar{\mathsf{Z}}^{*}_{t+1} = \mathsf{F}^* \bar{\mathsf{Z}}^{*}_t + \mathsf{C}^{*}$, the above can further be simplified to

$$\begin{aligned}&Z^l_{t+1}= {H^l}^\top Z^l_t \nonumber \\&- (E^l)^{-1} {B^l} (C_U^l)^{-1} {B^l}^\top \sum _{i=0}^{\infty } (H^l)^i \mathsf{C}^l_Z ((\mathsf{F}^*)^{i+1}\bar{\mathsf{Z}}^{*}_{t} + (I - (\mathsf{F}^*)^i)(I-(\mathsf{F}^*))^{-1} \mathsf{C}^{*} + \upbeta ^l). \end{aligned}$$

Rewriting in terms of the controller $(K^*_{l,1},K^*_{l,2})$,

$$\begin{aligned} Z^l_{t+1} = {A^l} Z^l_t - {B^l} K^*_{l,1} \begin{bmatrix} Z^l_t \\ \bar{\mathsf{Z}}^*_t \end{bmatrix} - {B^l} K^*_{l,2} \end{aligned}$$

(54)

where

$$\begin{aligned} K^*_{l,1} = \begin{bmatrix} G^l {A^l}& (I - G^l {B^l}) (C_U^l)^{-1} {B^l}^\top \sum _{i=0}^{\infty } (H^l)^i \mathsf{C}^l_Z (\mathsf{F}^*)^{i+1} \end{bmatrix} \end{aligned}$$

(55)

where $G^l = ({C_U^l} + {B^l}^\top {P^l} {B^l})^{-1} {B^l}^\top {P^l}$ and

$$\begin{aligned} K^*_{l,2} = (I - G^l {B^l}) (C_U^l)^{-1} {B^l} \sum _{i=0}^{\infty } (H^l)^{i} \mathsf{C}^l_Z ((I - (\mathsf{F}^*)^i)(I - (\mathsf{F}^*))^{-1} \mathsf{C}^{(s)} + \upbeta ^l) \end{aligned}$$

(56)

which completes the proof. $\square $

Proof of Theorem 2

Proof

The following investigates the dependence of the $\epsilon $-Nash bound $J^\mathsf{(N)}_{n,l}({\tilde{\phi }}) - \inf _{\pi ^n \in \varPi ^n} J^\mathsf{(N)}_{n,l}((\pi ^{n,l},{\tilde{\phi }}^{-n,l}),{\tilde{\phi }}^{-l})$ on $\mathsf{N} = (N_l)_{l \in {L}}$. We begin by writing the above quantity as

$$\begin{aligned}&J^\mathsf{(N)}_{n,l}({\tilde{\phi }}) - \inf _{\pi ^n \in \varPi ^n} J^\mathsf{(N)}_{n,l}((\pi ^{n,l},{\tilde{\phi }}^{-n,l}),{\tilde{\phi }}^{-l}) = \nonumber \\&\quad J^\mathsf{(N)}_{n,l}({\tilde{\phi }}) - J_l(\phi ^{l*}, \bar{\mathsf{Z}}^*) + J_l(\phi ^{l*}, \bar{\mathsf{Z}}^*) \nonumber \\&\quad - \inf _{\pi ^n \in \varPi ^n} J^\mathsf{(N)}_{n,l}((\pi ^{n,l},{\tilde{\phi }}^{-n,l}),{\tilde{\phi }}^{-l}). \end{aligned}$$

(57)

The first expression on the RHS of equation (57) can be bounded as follows

$$\begin{aligned} J^\mathsf{(N)}_{n,l}({\tilde{\phi }})&\le \limsup _{T \rightarrow \infty } \frac{1}{T} \sum _{t=0}^{T-1} \mathbb {E}\bigg [ \Vert Z^{l*}_t \Vert _{Q^l}^2 + \Vert U^{l*}_t \Vert _{C_U^l}^2 + \sum _{k \in \mathcal {L}_l} \big ( \Vert Z^{l*}_t - \bar{Z}^{k*}_t - \beta ^{lk} \Vert _{{C_Z^{lk}}}^2 \nonumber \\& + \Vert \bar{Z}^{k*}_t - Y^{k*}_t \Vert _{{C_Z^{lk}}}^2 + 2 ( Z^{l*}_t - \bar{Z}^{k*}_t - \beta ^{lk} )^\top {C_Z^{lk}} ( \bar{Z}^{k*}_t - Y^{k*}_t ) \big )\bigg ] \nonumber \\&= J_l(\phi ^{l*}, \bar{\mathsf{Z}}^*) + \limsup _{T \rightarrow \infty } \frac{1}{T} \sum _{t=0}^{T-1} \sum _{k \in \mathcal {L}_l} \mathbb {E}\big [ \Vert \bar{Z}^{k*}_t - Y^{k*}_t \Vert _{{C_Z^{lk}}}^2 \nonumber \\& + 2 ( Z^{l*}_t - \bar{Z}^{k*}_t - \beta ^{lk} )^\top {C_Z^{lk}} ( \bar{Z}^{k*}_t - Y^{k*}_t ) \big ] \nonumber \\&\le J_l(\phi ^{l*}, \bar{\mathsf{Z}}^*) + \mathcal {O}\Big ( \sum _{k \in \mathcal {L}_l} \sqrt{\limsup _{T \rightarrow \infty } \varepsilon ^k_T} \Big ) \end{aligned}$$

(58)

where $\beta ^{ll} = 0$ and $Y^{l*}_t$, $Y_t^{k*}$ are the empirical mean-field trajectories

$$\begin{aligned} Y^{l*}_t = \frac{1}{N_l - 1} \sum _{\begin{array}{c} n'\in [N_l]\\ n'\ne n \end{array}} Z^{n',l}_t, Y^{k*}_t = \frac{1}{N_k} \sum _{n' \in [N_k]} Z^{n',k}_t \end{aligned}$$

(59)

of populations l and $k \in \mathcal {L}_l \setminus \{l\}$, respectively, under equilibrium controller ${\tilde{\phi }}$, and

$$\begin{aligned} \varepsilon ^k_T = \frac{1}{T} \sum _{t=0}^{T-1} \mathbb {E}\Vert \bar{Z}^{k*}_t - Y^{k*}_t \Vert _{{C_Z^{lk}}}^2 \end{aligned}$$

(60)

for $k \in \mathcal {L}_l$. The last inequality in (58) is due to the fact that $(\phi ^{l*})_{l \in [L]}$ are stabilizing control laws and $\bar{\mathsf{Z}}^*$ is also stable. Using techniques similar to proof of Theorem 2 in Zaman et al. [29], we will now bound $\sum _{k \in \mathcal {L}_l} \sqrt{\limsup _{T \rightarrow \infty } \varepsilon ^k_T}$ for $k \in \mathcal {L}_l$. The dynamics of the empirical mean-field trajectory $Y^{k*}_t$ can be expressed using (59), (9), and the form of the equilibrium control law from Proposition 1,

$$\begin{aligned} Y^{k*}_{t+1} = (A^k - B^k K^{1*}_{k,1}) Y^{k*}_t - B^k K^{2*}_{k,1} \bar{\mathsf{Z}}^* - K^*_{k,2} + {\hat{\omega ^k_t}} \end{aligned}$$

where ${\hat{\omega ^k_t}} := \sum _{n' \in [N_k]} W^{n',k}_t/N_k$ is a Gaussian random variable with zero mean and covariance $\varSigma ^k_w/N_k$. The covariance matrix for the stationary distribution of $Y^{k*}_t$, denoted by ${\hat{\sigma ^k}}$, is the solution to the Lyapunov equation

$$\begin{aligned} {\hat{\sigma ^k}} = \varSigma ^k_w/N_k + (A^k - B^k K^{1*}_{k,1}) {\hat{\sigma ^k}} (A^k - B^k K^{1*}_{k,1})^\top \end{aligned}$$

hence

$$\begin{aligned} {{\,\mathrm{Tr}\,}}({\hat{\sigma ^k}}) = \mathcal {O}(1/N_k). \end{aligned}$$

(61)

Next, we define $\mathsf{Y}^* := (Y^{1*}_t,\ldots ,Y^{L*}_t) \in \mathbb {R}^{mL}$ as the joint empirical mean-field trajectory under equilibrium controller with dynamics

$$\begin{aligned} \mathsf{Y}^*_{t+1} = (\mathsf{A}- \mathsf{B}\mathsf{K}^{1*}_1) \mathsf{Y}^*_t - \mathsf{B}\mathsf{K}^{2*}_1 \bar{\mathsf{Z}}^*_t - \mathsf{B}\mathsf{K}^*_2 + {\hat{\upomega _t}} \end{aligned}$$

where $\mathsf{A}, \mathsf{B}$ are defined in (35) ,

$$\begin{aligned} \mathsf{K}^{1*}_1 = {{\,\mathrm{diag}\,}}(K^{1*}_{1,1},\ldots , K^{1*}_{L,1}), \mathsf{K}^{2*}_1 = \begin{bmatrix} K^{2*}_{1,1} \\ \vdots \\ K^{2*}_{L,1} \end{bmatrix}, \mathsf{K}^{*}_{2} = \begin{bmatrix} K^{*}_{1,2} \\ \vdots \\ K^{*}_{L,2} \end{bmatrix} \end{aligned}$$

and ${\hat{\upomega _t}} = ({\hat{\omega }}^1_t, \ldots , {\hat{\omega ^L_t}}) \in \mathbb {R}^{mL}$. Note that $\mathsf{F}^* = \mathsf{A}- \mathsf{B}(\mathsf{K}^{1*}_1 + \mathsf{K}^{2*}_1)$ and $\mathsf{C}^* = - \mathsf{B}\mathsf{K}^*_2$ since the equilibrium mean-field trajectory is generated by the equilibrium mean-field controller. Consequently, the stationary distribution of $\mathsf{Y}^*_t$ is $\bar{\mathsf{Z}}^*_{\infty }$ where $\bar{\mathsf{Z}}^*_{\infty }:=\lim _{t\rightarrow \infty }\bar{\mathsf{Z}}^*_t$. As a result, $\mathbb {E}[\mathsf{Y}^*_t] - \bar{\mathsf{Z}}^*_t \rightarrow 0$ as $t \rightarrow \infty $ which implies

$$\begin{aligned} \mathbb {E}[Y^{k*}_t] - \bar{Z}^{k*}_t \rightarrow 0, \end{aligned}$$

(62)

for all $l\in [L]$. Using (62) and (61),

$$\begin{aligned} \limsup _{T \rightarrow \infty } \frac{1}{T} \sum _{t=0}^{T-1} \mathbb {E}\Vert \bar{Z}^{k*}_t - Y^{k*}_t \Vert _{{C_Z^{lk}}}^2 = \mathbb {E}_{Y^{k*}_t \sim \mathcal {N}(\bar{Z}^{k*}_{\infty },{\hat{\sigma ^k}})} \Vert \bar{Z}^{k*}_t - Y^{k*}_t \Vert _{{C_Z^{lk}}}^2 = \mathcal {O}\Big (\frac{1}{N_k} \Big ) \end{aligned}$$

and so

$$\begin{aligned} \sum _{k \in \mathcal {L}_l} \sqrt{\limsup _{T \rightarrow \infty } \frac{1}{T} \sum _{t=0}^{T-1} \mathbb {E}\Vert \bar{Z}^{k*}_t - Y^{k*}_t \Vert _{{C_Z^{lk}}}^2} = \mathcal {O}\Big (1/\sqrt{\min _{k \in \mathcal {L}_l} N_k} \Big ). \end{aligned}$$

Hence, we have the first inequality,

$$\begin{aligned} J^\mathsf{(N)}_{n,l}({\tilde{\phi }}) - J_l(\phi ^{l*}, \bar{\mathsf{Z}}^*) = \mathcal {O}\bigg (1 / \sqrt{\min _{k \in \mathcal {L}_l} N_k} \bigg ) \end{aligned}$$

Next, for the second term in (57), we denote the trajectory of agent n in population l which minimizes the following cost by $Z^{n,l}_t$,

$$\begin{aligned}&\inf _{\pi ^n \in \varPi ^n} J^\mathsf{(N)}_{n,l} ((\pi ^{n,l},{\tilde{\phi }}^{-n,l}),{\tilde{\phi }}^{-l}) \\&\quad = \limsup _{T \rightarrow \infty } \frac{1}{T} \sum _{t=0}^{T-1} \mathbb {E}\big [ \Vert Z^{n,l}_t \Vert _{Q^l}^2 + \Vert U^{n,l}_t \Vert _{C_U^l}^2 + \sum _{k \in \mathcal {L}_l} \Vert Z^{n,l}_t - Y^k_t - \beta ^{lk} \Vert _{{C_Z^{lk}}}^2 \big ] \\&\quad \ge J_l(\phi ^{l*}, \bar{\mathsf{Z}}^*) + \limsup _{T \rightarrow \infty } \frac{2}{T} \sum _{t=0}^{T-1} ( Z^{i*}_t - \bar{Z}^{k*}_t - \beta ^{lk} )^\top {C_Z^{lk}} ( \bar{Z}^{k*}_t - Y^{k*}_t ). \end{aligned}$$

Using the same process as in (58) we arrive at,

$$\begin{aligned}&J_l(\phi ^{l*}, \bar{\mathsf{Z}}^*) - \inf _{\pi ^n \in \varPi ^n} J^\mathsf{(N)}_{n,l} ((\pi ^{n,l},{\tilde{\phi }}^{-n,l}),{\tilde{\phi }}^{-l}) = \mathcal {O}\Bigg (1 / \sqrt{\min _{k \in \mathcal {L}_l} N_k} \Bigg ) \end{aligned}$$

which concludes the proof. $\square $

Proof of Lemma 2

Proof

Let $J_l^2 (K_{l,2}^{(r)})$ denote the cost of the control offset $K_{l,2}^{(r)}$ in the controller. This is the abridged version of the real cost $J_l^2 ((K_{l,1},K_{l,2}^{(r)}),\bar{\mathsf{Z}})$ but since none of the other parameters are changing we can disregard them for this proof. Let $\bar{K}^*_{l,2}$ denote the control offset which minimizes cost $J_l^2 (K_{l,2})$. First we study some properties of cost $J_l^2$.

Let us define two sublevel sets based on the initial cost,

$$\begin{aligned} \mathcal {G}^0_l&:= \{ K_{l,2} \mid J_l^2(K_{l,2}) \le 4 J_l^2 (K^{(1)}_{l,2}) \}, \\ \mathcal {G}^1_l&:= \{ K_{l,2} \mid J_l^2(K_{l,2}) \le 10 J_l^2 (K^{(1)}_{l,2}) \}. \end{aligned}$$

We will show that $K^{(r)}_{l,2} \in \mathcal {G}^0_l$ and $K^{(r)}_{l,2} + r_2 D \in \mathcal {G}^1_l$ for $r \in [R_2]$, where $r_2$ is the smoothing radius and D is a random $p \times m$ matrix generated on a unit sphere. We start by proving some properties of $J_l^2$ over these sets.

Lemma 3

The cost $J_l^2 (K_{l,2})$ satisfies the following properties.

1.
The cost $J_l^2 (K_{l,2})$ can be written down as
$$\begin{aligned} J_l^2 (K_{l,2}) = (K_{l,2} - {\bar{K}}^*_{l,2})^\top {\mathbf {\mathsf{{A}}}}_l (K_{l,2} - {\bar{K}}^*_{l,2}) - \frac{1}{4} {\mathbf {\mathsf{{c}}}}^\top _l {\mathbf {\mathsf{{A}}}}^{-1}_l {\mathbf {\mathsf{{c}}}}_l + {\mathbf {\mathsf{{d}}}}_l \end{aligned}$$
where constants ${\mathbf {\mathsf{{A}}}}_l$, ${\mathbf {\mathsf{{c}}}}_l$ and ${\mathbf {\mathsf{{d}}}}_l$ are given in the proof of the Lemma.
2.
The cost $J_l^2 (K_{l,2})$ is continuously differentiable with respect to $K_{l,2}$. In addition, any non-empty sublevel set $\mathcal {G}_l(c) := \{ K_{l,2} \mid J_l^2 (K_{l,2}) \le c \}$ for $c > 0$ is compact.
3.
The cost $J_l^2 (K_{l,2})$ is smooth and strongly convex, with coefficients $\varphi ^l_2$ and $\nu ^l$.
4.
Given that $K_{l,2} \in \mathcal {G}^0_l$, there exists a $\rho ^l_2 > 0$ and $\lambda ^l_2 > 0$ s.t. for any $K'_{l,2}$ where $\Vert K'_{l,2} - K_{l,2} \Vert \le \rho ^l_2$, we have $K'_{l,2} \in \mathcal {G}^1_l$ and $|J_l^2 (K'_{l,2}) - J_l^2 (K_{l,2}) |\le \lambda ^l_2 \Vert K'_{l,2} - K_{l,2} \Vert $.

Proof

The proof of this Lemma and the associated constants are provided in Sect. 5. $\square $

We denote the exact gradient of $J_l^2$ with respect to $K_{l,2}$ by $\nabla J_l^2$, the smoothed (with radius $r_2$) gradient by $\nabla _{r_2} J_l^2$ and the stochastic gradient by ${\tilde{\nabla }} J_l^2$. Now we prove the counterpart of Lemma 6 in Malik et al. [22].

Lemma 4

For $K_{l,2} \in \mathcal {G}^0_l$ the gradients $\nabla J_l^2$, $\nabla _{r_2} J_l^2$ and ${\tilde{\nabla }} J_l^2$ satisfy

1.
$\mathbb {E}[{\tilde{\nabla }} J_l^2 (K_{l,2}) ] = \nabla _{r_2} J_l^2 (K_{l,2})$
2.
$\Vert \nabla _{r_2} J_l^2 (K_{l,2}) - \nabla J_l^2 (K_{l,2}) \Vert _2 \le \varphi ^l_2 r_2$

Proof

Proof of the first part follows from the proof of Lemma 6 in Malik et al. [22]. For the second part for any $K_{l,2} \in \mathcal {G}^0_l$ and ${\hat{K}}_{l,2}$ sampled uniformly from a unit sphere,

$$\begin{aligned} \Vert \nabla _{r_2} J_l^2 (K_{l,2}) - \nabla J_l^2 (K_{l,2}) \Vert _2&= \Vert \nabla \mathbb {E}[ J_l^2 (K_{l,2} + r_2 {\hat{K}}_{l,2}) ] - \nabla J_l^2 (K_{l,2})\Vert _2 \\&= \Vert \mathbb {E}[ \nabla J_l^2 (K_{l,2} + r_2 {\hat{K}}_{l,2}) - \nabla J_l^2 (K_{l,2}) ]\Vert _2 \\&\le \mathbb {E}[ \Vert \nabla J_l^2 (K_{l,2} + r_2 {\hat{K}}_{l,2}) - \nabla J_l^2 (K_{l,2}) \Vert _2 ] \\&\le \varphi ^l_2 r_2 \end{aligned}$$

The second to last step follows from Jensen’s inequality, and the last step is due to the fact that $r_2 < \rho ^l_2$ and $J_l^2$ is smooth with parameter $\varphi ^l_2$. $\square $

The first part of the Lemma proves that the stochastic gradient is an unbiased estimate of the smoothed gradient and the second part bounds the difference between the exact gradient and the smoothed gradient. We also present a Lemma from Malik et al. [22] which bounds the estimation error between the smoothed gradient and the stochastic gradient with high probability. This Lemma also presents a method to decrease the variance of the stochastic gradient by increasing the minibatch-size $k_2$.

Lemma 5

( [22]) For any $r_2 \in (0,\rho ^l_2)$, the $k_2$-sample minibatch gradient estimate satisfies the bound

$$\begin{aligned}&\Vert {\tilde{\nabla }} J_l^2(K^{(r)}_{l,2}) - \nabla _{r_2} J_l^2 (K^{(r)}_{l,2}) \Vert _2 \\&\quad \le \frac{1}{\sqrt{k_2}} \frac{m(L+1)}{r_2} \bigg (J_l^2(K^{(r)}_{l,2})+ \frac{\lambda ^l_2}{\rho ^l_2} \bigg ) \sqrt{\log \bigg (\frac{2m(L+1)}{\delta _2} \bigg )} \end{aligned}$$

with probability at least $1 - \delta _2$.

Now we need to ensure that the stepsize is less than $\rho ^l_2$ to ensure Lipschitzness. Towards that end, let us first define the optimality gap by $\varDelta _r:=J_l^2(K^{(r)}_{l,2}) - J_l^2({\bar{K}}^*_{l,2})$ and assume that $K^{(r)}_{l,2} \in \mathcal {G}^0_l$. If we use a minibatch size of $k = 1024 \frac{m^2(L+1)^2}{r^2} \big (J_l^2(K^{(r)}_{l,2})+ \frac{\lambda ^l_2}{\rho ^l_2} \big )^2 \log \big (\frac{2m(L+1)}{\delta _2} \big ) \max \big (\frac{1}{\nu ^l \epsilon _2},(\frac{\lambda ^l_2}{\nu ^l \epsilon _2})^2\big )$ then using Lemma 5 we get,

$$\begin{aligned} \Vert {\tilde{\nabla }} J_l^2(K^{(r)}_{l,2}) - \nabla _{r_2} J_l^2 (K^{(r)}_{l,2}) \Vert _2 \le \frac{\sqrt{\nu ^l\epsilon _2}}{32} \end{aligned}$$

with probability at least $1 - \delta _2$. Conditioned on this event

$$\begin{aligned}&\Vert \eta _2 {\tilde{\nabla }} J_l^2(K^{(r)}_{l,2}) \Vert _2&\\&\quad = \eta _2 \Vert {\tilde{\nabla }} J_l^2(K^{(r)}_{l,2}) - \nabla _{r_2} J_l^2 (K^{(r)}_{l,2}) + \nabla _{r_2} J_l^2 (K^{(r)}_{l,2}) - \nabla J_l^2 (K^{(r)}_{l,2}) + \nabla J_l^2 (K^{(r)}_{l,2}) \Vert _2 \\&\quad \le \eta _2 \Vert {\tilde{\nabla }} J_l^2(K^{(r)}_{l,2}) - \nabla _{r_2} J_l^2 (K^{(r)}_{l,2}) \Vert _2 + \eta _2 \Vert \nabla _{r_2} J_l^2 (K^{(r)}_{l,2}) - \nabla J_l^2 (K^{(r)}_{l,2}) \Vert _2 + \eta _2 \Vert \nabla J_l^2 (K^{(r)}_{l,2}) \Vert _2 \\&\quad \le \eta _2 \bigg ( \frac{\sqrt{\nu ^l\epsilon _2}}{32} + \varphi ^l_2 r_2 + \lambda ^l_2 \bigg ) \end{aligned}$$

where the last inequality is obtained by using Lemmas 4 and 5. Since $ \epsilon _2, r < 1$

$$\begin{aligned} \Vert \eta _2 {\tilde{\nabla }} J_l^2(K^{(r)}_{l,2}) \Vert _2 \le \eta _2 \bigg ( \frac{\sqrt{\nu ^l}}{32} + \varphi ^l_2 + \lambda ^l_2 \bigg ); \end{aligned}$$

hence, by choosing $\eta _2 \le \rho ^l_2 \bigg ( \frac{\sqrt{\nu ^l}}{32} + \varphi ^l_2 + \lambda ^l_2 \bigg )^{-1} $ we ensure,

$$\begin{aligned} \Vert \eta _2 {\tilde{\nabla }} J_l^2(K^{(r)}_{l,2}) \Vert _2 \le \rho ^l_2 \end{aligned}$$

(63)

with probability at least $1-\delta _2$. Thus, the size of the step $\Vert \eta _2 {\tilde{\nabla }} J_l^2(K^{(r)}_{l,2}) \Vert _2$ has been shown to be bounded by $\rho ^l_2$; hence, the Lipschitzness properties of $J_l^2$ are satisfied with the corresponding coefficients $\lambda ^l_2$. Notice that using the method shown above $\Vert \eta _2 \nabla J_l^2(K^{(r)}_{l,2}) \Vert _2$ can also be shown to be bounded by $\rho ^l_2$.

Next we will show that with high probability $K^{(r)}_{l,2} \in \mathcal {G}^0_l$ for any $r \in [R_2]$. Let us trivially assume that $\epsilon _2/2 < \varDelta _0$. Now we prove that if for $r \in [R_2]$, $\varDelta _r > \epsilon _2 /2$ then,

$$\begin{aligned} J_l^2(K^{(r+1)}_{l,2}) \le J_l^2(K^{(r)}_{l,2}) . \end{aligned}$$

(64)

Recall that

$$\begin{aligned} K^{(r+1)}_{l,2} = K^{(r)} - \eta _2 {\tilde{\nabla }} J_l^2(K^{(r)}_{l,2}); \end{aligned}$$

similarly we define ${\bar{K}}^{(r+1)}_{l,2}$ as one step in the direction of the exact gradient:

$$\begin{aligned} {\bar{K}}^{(r+1)}_{l,2} = K^{(r)} - \eta _2 \nabla J_l^2(K^{(r)}_{l,2}) \end{aligned}$$

Due to the smoothness property of $J_l^2$,

$$\begin{aligned}&J_l^2({\bar{K}}^{(r+1)}_{l,2}) - J_l^2(K^{(r)}_{l,2}) \\& \le \eta _2 \big \langle \nabla J_l^2(K^{(r)}_{l,2}),\nabla J_l^2(K^{(r)}_{l,2}) \big \rangle + \frac{\varphi ^l_2}{2} \eta _2^2 \Vert \nabla J_l^2(K^{(r)}_{l,2}) \Vert _2^2 \\& = \big ( \frac{\varphi ^l_2}{2} \eta _2^2 - \eta _2 \big ) \Vert \nabla J_l^2(K^{(r)}_{l,2}) \Vert _2^2 \end{aligned}$$

Since $\eta _2 \le 1/\varphi ^l_2$,

$$\begin{aligned} J_l^2({\bar{K}}^{(r+1)}_{l,2}) - J_l^2(K^{(r)}_{l,2})&\le -\frac{\eta _2}{2} \Vert \nabla J_l^2(K^{(r)}_{l,2}) \Vert _2^2 \nonumber \\&\le -\eta _2 \nu ^l \varDelta _r< -\eta _2 \nu ^l \epsilon _2/2 < 0 \end{aligned}$$

(65)

The following Lemma upper bounds the cost gap $|J_l^2({\bar{K}}^{(r+1)}_{l,2}) - J_l^2( K^{(r+1)}_{l,2}) |$.

Lemma 6

It holds with probability at least $1 - \delta _2$, that,

$$\begin{aligned} |J_l^2({\bar{K}}^{(r+1)}_{l,2}) - J_l^2( K^{(r+1)}_{l,2}) |\le \eta _2 \nu ^l \epsilon _2/16 \end{aligned}$$

Proof

Due to the cost being Lipschitz we have,

$$\begin{aligned}&|J_l^2({\bar{K}}^{(r+1)}_{l,2}) - J_l^2( K^{(r+1)}_{l,2}) |\\&\le \lambda ^l_2 \Vert {\bar{K}}^{(r+1)}_{l,2} - K^{(r+1)}_{l,2} \Vert _2 \\&= \eta _2 \lambda ^l_2 \Vert {\tilde{\nabla }} J_l^2(K^{(r)}_{l,2}) - \nabla J_l^2(K^{(r)}_{l,2})\Vert _2 \\&\le \eta _2 \lambda ^l_2 \Vert {\tilde{\nabla }} J_l^2(K^{(r)}_{l,2}) - \nabla _{r_2} J_l^2(K^{(r)}_{l,2}) \Vert _2 + \eta _2 \lambda ^l_2 \Vert \nabla _{r_2} J_l^2(K^{(r)}_{l,2}) - \nabla J_l^2(K^{(r)}_{l,2}) \Vert _2 \\&\le \eta _2 \lambda ^l_2 \Big ( \frac{\nu ^l \epsilon _2}{32 \lambda ^l_2} + \varphi ^l_2 r \Big ) \end{aligned}$$

The last step is due to the fact that

$$\begin{aligned} k_2 = 1024 \frac{m^2(L+1)^2}{r^2} \big (J_l^2(K^{(r)}_{l,2})+ \frac{\lambda ^l_2}{\rho ^l_2} \big )^2 \log \big (\frac{2m(L+1)}{\delta _2} \big ) \max \big (\frac{1}{\nu ^l \epsilon _2},(\frac{\lambda ^l_2}{\nu ^l \epsilon _2})^2\big ) \end{aligned}$$

which can be used along with Lemma 5 to arrive at the inequality $\Vert {\tilde{\nabla }} J_l^2(K^{(r)}_{l,2}) - \nabla _{r_2} J_l^2(K^{(r)}_{l,2}) \Vert _2 \le \nu ^l \epsilon _2/32 \lambda ^l_2$ with probability at least $1 - \delta _2$. Furthermore by having $r \le \frac{\nu ^l \epsilon _2}{32 \varphi ^l_2 \lambda ^l_2}$ we arrive at

$$\begin{aligned} |J_l^2({\bar{K}}^{(r+1)}_{l,2}) - J_l^2( K^{(r+1)}_{l,2}) |\le \eta _2 \nu ^l \epsilon _2 / 16 \end{aligned}$$

(66)

with probability at least $1 - \delta _2$. $\square $

Combining Eq. (65) and Lemma 6 we get,

$$\begin{aligned} J_l^2( K^{(r+1)}_{l,2}) - J_l^2( K^{(r)}_{l,2})&= J_l^2( K^{(r+1)}_{l,2}) - J_l^2({\bar{K}}^{(r+1)}_{l,2}) + J_l^2 ({\bar{K}}^{(r+1)}_{l,2}) - J_l^2( K^{(r)}_{l,2}) \\&< 7 \eta _2 \nu ^l \epsilon _2 /16 < 0 \end{aligned}$$

Hence, if at any iteration r, $\varDelta _r > \epsilon _2/2$, then $\varDelta _{r+1} < \varDelta _r$ with probability at least $1 - \delta $. Now we prove that if $\varDelta _r \le \epsilon _2/2$ then $K^{(r+1)}_{l,2} \in \mathcal {G}^0_l$. Using the expression for $J_l^2$ as in Lemma 3,

$$\begin{aligned} J_l^2 (K^{(r+1)}_{l,2})&= (K^{(r+1)}_{l,2} - {\bar{K}}^*_{l,2})^\top {\mathbf {\mathsf{{A}}}}_l (K^{(r+1)}_{l,2} - {\bar{K}}^*_{l,2}) - \frac{1}{4} {\mathbf {\mathsf{{c}}}}^\top _l {\mathbf {\mathsf{{A}}}}^{-1}_l {\mathbf {\mathsf{{c}}}}_l + {\mathbf {\mathsf{{d}}}}_l \\&= (K^{(r+1)}_{l,2} - K^{(r)}_{l,2} + K^{(r)}_{l,2} - {\bar{K}}^*_{l,2})^\top {\mathbf {\mathsf{{A}}}}_l (K^{(r+1)}_{l,2} - K^{(r)}_{l,2} + K^{(r)}_{l,2} - {\bar{K}}^*_{l,2}) \\& - \frac{1}{4} {\mathbf {\mathsf{{c}}}}^\top _l {\mathbf {\mathsf{{A}}}}^{-1}_l {\mathbf {\mathsf{{c}}}}_l + {\mathbf {\mathsf{{d}}}}_l \\&\le 2 \Vert K^{(r+1)}_{l,2} - K^{(r)}_{l,2} \Vert ^2_{{\mathbf {\mathsf{{A}}}}_l} + 2 \Vert K^{(r)}_{l,2} - {\bar{K}}^*_{l,2} \Vert _{{\mathbf {\mathsf{{A}}}}_l}^2 - \frac{1}{2} {\mathbf {\mathsf{{c}}}}^\top _l {\mathbf {\mathsf{{A}}}}^{-1}_l {\mathbf {\mathsf{{c}}}}_l + 2 {\mathbf {\mathsf{{d}}}}_l \\&\le 2 \Vert K^{(r+1)}_{l,2} - K^{(r)}_{l,2} \Vert ^2_{{\mathbf {\mathsf{{A}}}}_l} + 2 \varDelta _r - \frac{1}{2} {\mathbf {\mathsf{{c}}}}^\top _l {\mathbf {\mathsf{{A}}}}^{-1}_l {\mathbf {\mathsf{{c}}}}_l + 2 {\mathbf {\mathsf{{d}}}}_l \\&\le 2 \Vert K^{(r+1)}_{l,2} - K^{(r)}_{l,2} \Vert ^2_{{\mathbf {\mathsf{{A}}}}_l} + \epsilon _2 + 2 J_l^2({\bar{K}}^*_{l,2}) \\&\le 2 \Vert {\mathbf {\mathsf{{A}}}}_l \Vert _2 (\rho ^l_2)^2 + \epsilon _2 + 2 J_l^2({\bar{K}}^*_{l,2}) \\&= 2 J_l^2 (K^{(0)}_{l,2}) + \epsilon _2 + 2 J_l^2({\bar{K}}^*_{l,2}) \\&\le 4 J_l^2 (K^{(0)}_{l,2}) \end{aligned}$$

where the above quantities are defined in (97). Hence, $K^{(r+1)}_{l,2} \in \mathcal {G}^0_l$ with probability $1 - \delta _2$. The second inequality is due to the fact that $\varDelta _r = J_l^2(K^{(r)}_{l,2}) - J_l^2({\bar{K}}^*_{l,2}) = \Vert K^{(r)}_{l,2} - {\bar{K}}^*_{l,2} \Vert _{{\mathbf {\mathsf{{A}}}}_l}^2$. The third inequality follows from $J_l^2({\bar{K}}^*_{l,2}) = - \frac{1}{4} {\mathbf {\mathsf{{c}}}}^\top _l {\mathbf {\mathsf{{A}}}}^{-1}_l {\mathbf {\mathsf{{c}}}}_l + {\mathbf {\mathsf{{d}}}}_l$ and the fact that $\varDelta _r \le \epsilon _2/2$. The second last inequality follows from the definition of $\rho ^l_2$ (99), and the last one follows from the trivial assumption that $\varDelta _0 = J_l^2(K^{(r)}_{l,2}) - J_l^2({\bar{K}}^*_{l,2}) \ge \epsilon _2/2$.

Now we will show that $J_l^2(K^{(R_2)}_{l,2}) - J_l^2({\bar{K}}^*_{l,2}) \le \epsilon _2/2$ with high probability.

$$\begin{aligned} \varDelta _{r+1} - \varDelta _r&= J_l^2( K^{(r+1)}_{l,2}) - J_l^2( K^{(r)}_{l,2}) \\&= J_l^2( K^{(r+1)}_{l,2}) - J_l^2({\bar{K}}^{(r+1)}_{l,2}) + J_l^2 ({\bar{K}}^{(r+1)}_{l,2}) - J_l^2( K^{(r)}_{l,2}) \\&\le -\eta _2 \nu ^l \varDelta _r + \nu ^l \eta _2 \epsilon _2/16 \end{aligned}$$

with probability at least $1- \delta _2$. The last inequality is due to Eq. (65) and Lemma 6. Hence, we get,

$$\begin{aligned} \varDelta _{r+1} \le (1 - \eta _2 \nu ^l) \varDelta _r + \nu ^l \eta _2 \epsilon _2/16 \end{aligned}$$

with probability at least $1 - \delta _2$. Using a union bound type argument and strong recursion, we get

$$\begin{aligned} \varDelta _{R_2}&\le (1-\eta _2 \nu ^l)^{R_2} \varDelta _0 + \sum _{i=0}^{\infty } (1 - \nu ^l \eta _2)^i \nu ^l \eta _2 \frac{\epsilon _2}{16} \\&= (1-\eta _2 \nu ^l)^{R_2} \varDelta _0 + \frac{\epsilon _2}{16} \end{aligned}$$

with probability at least $1-\delta _2 R_2$. Since $R_2 = \frac{1}{\eta _2 \nu ^l} \log (\frac{4 \varDelta _0}{\epsilon _2})$, $\varDelta _{R_2} \le \frac{\epsilon _2}{2}$ with probability at least $1-\delta _2 R_2$. Furthermore since the cost $J^2_l$ is strongly convex,

$$\begin{aligned} \Vert K^{(R_2)}_{l,2} - {\bar{K}}^*_{l,2} \Vert _2 \le \sqrt{\frac{\epsilon _2}{\nu ^l}} \end{aligned}$$

(67)

with probability at least $1-\delta _2 R_2$. This concludes the proof. $\square $

Proof of Theorem 3

Proof

This proof provides finite sample bounds on the estimation error of the MFE computed by the RL algorithm. Due to the stochastic nature of the RL algorithm, the learned policy of each generic agent has some error which causes an asymmetry in the joint learned policy. This results in an error in the mean-field trajectory computed by the centralized simulator. However, since the errors in the learned policy are restricted to be within carefully crafted bounds ($\mathcal {O}(\epsilon _1)$ and $\mathcal {O}(\epsilon _2)$ for the linear and offset terms, respectively), the accumulated error in the mean-field trajectory is shown to be bounded. Using the (corrective) contraction property of the mean-field update operator (by the assumption in Theorem 3), we prove convergence of the RL algorithm to an $\epsilon $ neighborhood of the MFE taking into account the bounded errors introduced by the stochastic nature of the RL algorithm.

The proof is organized in two parts. The first part deals with providing finite sample bounds for linear terms in the MFE and the second part deals with providing finite sample bounds for the affine terms in the MFE.

Part I: We will start by proving the bound on $\Vert \mathsf{F}^{(S_1)} - \mathsf{F}^* \Vert _2$ and $\Vert K^{(S_1)}_{l,1} - K^*_{l,1} \Vert _2$. From (22) we know that for $s \in [S_1]$, under controllers $(K^{(s)}_{l,1})_{l \in [L]} = \big ([K^{(1,s)}_{l,1}, K^{(2,s)}_{l,1}] \big )_{l \in [L]}$, the mean-field trajectory $\bar{\mathsf{Z}}^{(s)}$ follows stochastic linear dynamics

$$\begin{aligned} \bar{\mathsf{Z}}^{(s)}_{t+1} = \mathsf{F}^{(s)} \bar{\mathsf{Z}}^{(s)}_t + \omega _t, \mathsf{F}^{(s)} = \mathsf{A}- \mathsf{B}(\mathsf{K}^{(1,s)}_1 + \mathsf{K}^{(2,s)}_1) \end{aligned}$$

(68)

where $\mathsf{A}$ and $\mathsf{B}$ are defined in (35) and

$$\begin{aligned} \mathsf{K}^{(1,s)}_1 = {{\,\mathrm{diag}\,}}(K^{(1,s)}_{1,1},\ldots , K^{(1,s)}_{L,1}), \mathsf{K}^{(2,s)}_1 = \begin{bmatrix} K^{(2,s)}_{1,1} \\ \vdots \\ K^{(2,s)}_{L,1} \end{bmatrix}, \omega _t = \begin{bmatrix} \omega ^1_t \\ \vdots \\ \omega ^L_t \end{bmatrix} \end{aligned}$$

Let us similarly define

$$\begin{aligned} \bar{\mathsf{F}}^{(s)}&:= \mathsf{A}- \mathsf{B}(\bar{\mathsf{K}}^{(1,s)}_1 + \bar{\mathsf{K}}^{(2,s)}_1), \nonumber \\ \bar{\mathsf{K}}^{(1,s)}_1&= {{\,\mathrm{diag}\,}}(\bar{K}^{(1,s)}_{1,1},\ldots , \bar{K}^{(1,s)}_{L,1}), \bar{\mathsf{K}}^{(2,s)}_1 = \begin{bmatrix} \bar{K}^{(2,s)}_{1,1} \\ \vdots \\ \bar{K}^{(2,s)}_{L,1} \end{bmatrix} \end{aligned}$$

(69)

where $\bar{K}^{(s+1)}_{l,1} = {\mathop {\mathrm{argmin}}_{K_{l,1}}} \,J^1_l(K_{l,1},\bar{\mathsf{Z}}^{(s)})$. Essentially $\bar{\mathsf{F}}^{(s)}$ represents the mean-field trajectory dynamics consistent with the set of controllers $(K^{(s)}_{l,1})_{l \in [L]}$. The following Lemma characterizes $\bar{K}^{(s+1)}_{l,1}$ and $\bar{\mathsf{F}}^{(s)}$.

Lemma 7

The optimal controller for agent l at iteration $s \in S_1$, $\bar{K}^{(s)}_{l,1}$ for the stochastic control problem, with dynamics

$$\begin{aligned} \mathsf{X}^l_{t+1} = \bar{\mathsf{A}}^{l,(s)} \mathsf{X}^l_t + \bar{B}^l U^l_t + {\bar{W}}^l_t, \bar{\mathsf{A}}^{l,(s)} = \begin{bmatrix} {A^l} &{} 0 \\ 0 &{} \mathsf{F}^{(s)} \end{bmatrix}, \bar{B}^l = \begin{bmatrix} {B^l} \\ 0 \end{bmatrix}, {\bar{W}}^l_t = \begin{bmatrix} W^l_t \\ \omega _t \end{bmatrix}, \end{aligned}$$

and cost

$$\begin{aligned} J_l(\phi ^l, \bar{\mathsf{Z}}^{(s)}) := \sum _{t=0}^{\infty }\big [ \big \Vert \mathsf{X}^l_t \big \Vert ^2_{\bar{\mathsf{Q}}_l} + \big \Vert U^l_t \big \Vert ^2_{C_U^l} \big ], \end{aligned}$$

is given by (106) and mean-field trajectory consistent with $(\bar{K}^{(s)}_{l,1})_{l \in [L]}$ has dynamics matrix $\bar{\mathsf{F}}^{(s+1)}$ where $\bar{\mathsf{F}}^{(s+1)} = \mathbb {T}(\mathsf{F}^{(s)})$ and $\mathbb {T}$ is defined as

$$\begin{aligned} \mathbb {T}(M) = \mathsf{H}^\top + \mathsf{E}^{-1} \mathsf{B}\mathsf{R}^{-1} \mathsf{B}^\top \sum _{i=0}^{\infty } \mathsf{H}^i \mathsf{C}_Z M^{i+1}. \end{aligned}$$

(70)

This operator is also called the mean-field dynamics update operator.

Proof

The proof of this Lemma is provided in Sect. 6. $\square $

The following Lemma introduces some properties of the mean-field dynamics update operator $\mathbb {T}$.

Lemma 8

Assume that

$$\begin{aligned} T_1&:= \big \Vert \mathsf{E}^{-1} \mathsf{B}\mathsf{R}^{-1} \mathsf{B}^T (I - \mathsf{H}^k)^{-1} \mathsf{C}_Z \big \Vert _2< 1, \\ T_2&:= \Big \Vert \mathsf{E}^{-1} \mathsf{B}\mathsf{R}^{-1} \mathsf{B}^T \sum _{k=0}^{\infty } \mathsf{H}^k \mathsf{C}_Z (I - (\mathsf{F}^*)^k)(I - \mathsf{F}^*)^{-1} \Big \Vert _2 < 1. \end{aligned}$$

Then, the operator $\mathbb {T}$ is contractive with coefficient T, and $\mathsf{F}^*$ is its fixed point.

This Lemma can be proved following the proof of Proposition 1 in Zaman et al. [29]. Having characterized $\bar{K}^{(s)}_{l,1}$ we now prove the bound on $\Vert \mathsf{F}^{(S_1)} - \mathsf{F}^* \Vert _2$. For $s \in [S_1 - 1]$,

$$\begin{aligned}&\Vert \mathsf{F}^{(s+1)} - \mathsf{F}^* \Vert _F \le \Vert \mathsf{F}^{(s+1)} - \bar{\mathsf{F}}^{(s+1)} \Vert _F + \Vert \bar{\mathsf{F}}^{(s+1)} - \mathsf{F}^* \Vert _F \nonumber \\&\quad \le \Vert \mathsf{B}\Vert _F (\Vert \mathsf{K}^{(1,s+1)}_1 - \bar{\mathsf{K}}^{(1,s+1)}_1 \Vert _F + \Vert \mathsf{K}^{(2,s+1)}_1 - \bar{\mathsf{K}}^{(2,s+1)}_1 \Vert _F) + T_1 \Vert \mathsf{F}^{(s)} - \mathsf{F}^* \Vert _F \nonumber \\&\quad \le \Vert \mathsf{B}\Vert _F \sum _{l \in [L]} \sigma ^{-1}_{\min }(\bar{\varSigma }^l) \sigma ^{-1}_{\min }({C_U^l}) \epsilon _1 + T_1 \Vert \mathsf{F}^{(s)} - \mathsf{F}^* \Vert _F \end{aligned}$$

(71)

with probability at least $1 - \delta _1 R_1$. The second inequality is due to the definitions of $\mathsf{F}^{(s+1)}$ and $\bar{\mathsf{F}}^{(s+1)}$ ((68)-(69), respectively), the fact that $\bar{\mathsf{F}}^{(s+1)} = \mathbb {T}(\mathsf{F}^{(s)}) $ and the contractive property of $\mathbb {T}$. The third inequality is obtained by using Lemma 1. Using a union bound type argument, we get

$$\begin{aligned} \Vert \mathsf{F}^{(S_1)} - \mathsf{F}^* \Vert _F \le T^{S_1}_1\Vert \mathsf{F}^{(1)} - \mathsf{F}^* \Vert _F + \sum _{j = 0}^{S_1 - 1} T^j_1 \Vert \mathsf{B}\Vert _F \frac{ \epsilon _1}{2} \sum _{l \in [L]} \sigma ^{-1}_{\min }(\bar{\varSigma }^l) \sigma ^{-1}_{\min }({C_U^l}) \end{aligned}$$

with probability at least $1 - \delta _1 S_1 R_1$. Since,

$$\begin{aligned} \epsilon _1 \le \frac{(1-T_1)\epsilon }{\Vert \mathsf{B}\Vert _F \sum _{l \in [L]} \sigma ^{-1}_{\min }(\bar{\varSigma }^l) \sigma ^{-1}_{\min }({C_U^l})}, l \in [L] \text { and } \delta _1 = \frac{\delta }{S_1 R_1} \end{aligned}$$

(72)

we arrive at

$$\begin{aligned} \Vert \mathsf{F}^{(S_1)} - \mathsf{F}^* \Vert _F&\le T^{S_1}_1\Vert \mathsf{F}^{(1)} - \mathsf{F}^* \Vert _F + \sum _{j = 0}^{S_1 - 1} T^j_1 (1-T) \frac{\epsilon }{2} \\&\le T^{S_1}_1\Vert \mathsf{F}^{(1)} - \mathsf{F}^* \Vert _F + \frac{\epsilon }{2} \end{aligned}$$

with probability at least $1 - \delta $. Since $S_1 = \frac{1}{1-T_1}\log (\frac{2 \Vert F^{(1)} - F^* \Vert _F}{\epsilon })$,

$$\begin{aligned} \Vert \mathsf{F}^{(S_1)} - \mathsf{F}^* \Vert _F&\le \epsilon \end{aligned}$$

(73)

with probability at least $1 - \delta $. Now we prove that $\mathsf{F}^{(s)}$ are stable for $s \in [S_1]$. Using reasoning similar to (71), we arrive at

$$\begin{aligned}&\Vert \mathsf{F}^{(s+1)} \Vert _F \le \Vert \mathsf{B}\Vert _F \sum _{l \in [L]} \sigma ^{-1}_{\min }(\bar{\varSigma }^l) \sigma ^{-1}_{\min }({C_U^l}) \epsilon _1 + T \Vert \mathsf{F}^{(s)} \Vert _F \end{aligned}$$

(74)

We know that $\epsilon _1 \le \frac{(1-T_1)\epsilon }{\Vert \mathsf{B}\Vert _F \sum _{l \in [L]} \sigma ^{-1}_{\min }(\bar{\varSigma }^l) \sigma ^{-1}_{\min }({C_U^l})}, l \in [L]$ and $\epsilon < 1$. Moreover, if we assume $\Vert \mathsf{F}^{(s)} \Vert _F < 1$ then $\Vert \mathsf{F}^{(s+1)} \Vert _F < 1$. Now we know that $\Vert \mathsf{F}^{(1)} \Vert _F = 0$ because $\bar{\mathsf{Z}}^{(1)} = 0 < 1$, and using recursion we can show that $\Vert \mathsf{F}^{(s)} \Vert _F < 1$, and hence, $\mathsf{F}^{(s)}$ is stable $s \in [S_1]$.

Now we move on to upper bounding $\Vert K^{(S_1+1)}_{l,1} - K^*_{l,1} \Vert _F$. First we consider for $s \in [S_1 - 1]$,

$$\begin{aligned} \Vert K^{(s+1)}_{l,1} - K^*_{l,1} \Vert _F \le \Vert K^{(s+1)}_{l,1} - \bar{K}^{(s+1)}_{l,1} \Vert _F + \Vert \bar{K}^{(s+1)}_{l,1} - K^*_{l,1} \Vert _F \end{aligned}$$

From Lemma 1 we know that

$$\begin{aligned} \Vert K^{(s+1)}_{l,1} - K^*_{l,1} \Vert _F \le \sigma ^{-1}_{\min } (\bar{\varSigma }^l) \sigma ^{-1}_{\min } ({C_U^l}) \frac{\epsilon _1}{2} + \Vert \bar{K}^{(s+1)}_{l,1} - K^*_{l,1} \Vert _F \end{aligned}$$

Recalling the definitions of $\bar{K}^{(s+1)}_{l,1}$ and $K^*_{l,1}$ from Lemma 7,

$$\begin{aligned} \bar{K}^{(s+1)}_{l,1}&= \begin{bmatrix} G^l {A^l}& (I - G^l {B^l}) (C_U^l)^{-1} {B^l}^\top \sum _{i=0}^{\infty } (H^l)^i \mathsf{C}^l_Z (\mathsf{F}^{(s)})^{i+1} \end{bmatrix} \\ K^*_{l,1}&= \begin{bmatrix} G^l {A^l}& (I - G^l {B^l}) (C_U^l)^{-1} {B^l}^\top \sum _{i=0}^{\infty } (H^l)^i \mathsf{C}^l_Z (\mathsf{F}^*)^{i+1}. \end{bmatrix} \end{aligned}$$

Using these expressions $\Vert K^{(s+1)}_{l,1} - K^*_{l,1} \Vert _F$ can be upper bounded by

$$\begin{aligned} \Vert K^{(s+1)}_{l,1} - K^*_{l,1} \Vert _F \le \sigma ^{-1}_{\min } (\bar{\varSigma }^l) \sigma ^{-1}_{\min } ({C_U^l}) \frac{\epsilon _1}{2} + D^1_l \Vert \mathsf{F}^{(s)} - \mathsf{F}^* \Vert _F \end{aligned}$$

where

$$\begin{aligned} D^1_l = \Vert (I - G^l {B^l}) (C_U^l)^{-1} {B^l}^\top \Vert _F \Vert \mathsf{C}^l_Z \Vert _F /(1 - \Vert {H^l} \Vert _F)^2 . \end{aligned}$$

Using the value of $\epsilon _1$ from (72) we get

$$\begin{aligned} \Vert K^{(S_1+1)}_{l,1} - K^*_{l,1} \Vert _F \le D^2_l \epsilon \end{aligned}$$

(75)

with probability at least $1 - \delta $, where

$$\begin{aligned} D^2_l = \frac{(1-T_1)\sigma ^{-1}_{\min }(\bar{\varSigma }^l) \sigma ^{-1}_{\min }({C_U^l}) }{2 \Vert \mathsf{B}\Vert _F \sum _{k \in [L]} \sigma ^{-1}_{\min }(\bar{\varSigma }^j) \sigma ^{-1}_{\min }(R_j)} \end{aligned}$$

(76)

Part II: Now we focus on the bounds for $\Vert \mathsf{C}^{(S_2)} - \mathsf{C}^* \Vert _2$ and $\Vert K^{(S_2)}_{l,2} - K^*_{l,2} \Vert _F$. We know that in the second part of algorithm, $s \in [S_2]$, $\bar{\mathsf{Z}}^{(s)}$ follows stochastic affine dynamics,

$$\begin{aligned} \bar{\mathsf{Z}}^{(s)}_{t+1} = \mathsf{F}^{(S_1)} \bar{\mathsf{Z}}^{(s)}_t + \mathsf{C}^{(s)} + \omega _t, \end{aligned}$$

where

$$\begin{aligned} \mathsf{F}^{(S_1)} = \mathsf{A}- \mathsf{B}(\mathsf{K}^{(1,S_1)}_1 + \mathsf{K}^{(2,S_1)}_1), \mathsf{C}^{(s)} = - \mathsf{B}\mathsf{K}^{(s)}_{2}, \mathsf{K}^{(s)}_{2} = \begin{bmatrix} K^{(s)}_{1,2} \\ \vdots \\ K^{(s)}_{L,2} \end{bmatrix} \end{aligned}$$

(77)

Let us define $\bar{K}^{(s+1)}_{l,2} = {\mathop {\mathrm{argmin}}_{K_{l,2}}} \,J^2_l((K^{(S_1)}_{l,1},K_{l,2}),\bar{\mathsf{Z}}^{(s)})$. Control offset $\bar{K}^{(s+1)}_{l,2}$ can be characterized using the following Lemma.

Lemma 9

The optimal control offset for agent l at iteration $s \in S_2$, $\bar{K}^{(s)}_{l,2}$ for the stochastic control problem, with drifted dynamics

$$\begin{aligned} \mathsf{X}^l_{t+1} = \bar{\mathsf{A}}^l \mathsf{X}^l_t + \bar{B}^l U^l_t + \bar{\mathsf{C}}^{(s)} + {\bar{W}}^l_t, \end{aligned}$$

where

$$\begin{aligned} \bar{\mathsf{A}}^l = \begin{bmatrix} {A^l} &{} 0 \\ 0 &{} \mathsf{F}\end{bmatrix}, \bar{B}^l = \begin{bmatrix} {B^l} \\ 0 \end{bmatrix}, \bar{\mathsf{C}}^{(s)} = \begin{bmatrix} 0 \\ \mathsf{C}^{(s)} \end{bmatrix} {\bar{W}}^l_t = \begin{bmatrix} W^l_t \\ \omega _t \end{bmatrix}, \end{aligned}$$

and cost with constant tracking

$$\begin{aligned} J_l(\phi ^l, \bar{\mathsf{Z}}^{(s)}) := \sum _{t=0}^{\infty }\big [ \big \Vert \mathsf{X}^l_t - \bar{\upbeta }^l\big \Vert ^2_{\bar{\mathsf{Q}}_l} + \big \Vert U^l_t \big \Vert ^2_{C_U^l} \big ], \end{aligned}$$

is given as follows:

$$\begin{aligned} \bar{K}^{(s)}_{l,2} = (I - G^l {B^l}) (C_U^l)^{-1} {B^l} \sum _{i=0}^{\infty } ({H^l})^{i} \mathsf{C}^l_Z ((I - \mathsf{F}^i)(I - \mathsf{F})^{-1} \mathsf{C}^{(s)} + \upbeta ^l) \end{aligned}$$

(78)

where $\upbeta ^l = (\beta ^{l1},\ldots ,\beta ^{lL}) \in \mathbb {R}^{mL}$ and mean-field trajectory consistent with $(\bar{K}^{(s)}_{l,2})_{l \in [L]}$ has offset $\varLambda (\mathsf{C}^{(s)})$ and operator $\varLambda $ is defined as

$$\begin{aligned} \varLambda (\mathsf{C}^{(s)}) = \mathsf{E}^{-1} \mathsf{B}\mathsf{R}^{-1} \mathsf{B}^\top \sum _{i=0}^{\infty } \mathsf{H}^i [\mathsf{C}_Z(I - \mathsf{F}^i)(I - \mathsf{F})^{-1} \mathsf{C}^{(s)}] + {{\,\mathrm{diag}\,}}(\mathsf{C}^1_Z,\ldots ,\mathsf{C}^L_Z) \upbeta \end{aligned}$$

(79)

where $\upbeta = (\upbeta ^1, \ldots , \upbeta ^L) \in \mathbb {R}^{mLL}$. This operator is also called the mean-field offset update operator.

Proof

The proof of this Lemma is provided in Sect. 7. $\square $

As the operator $\varLambda $ is defined for a fixed matrix $\mathsf{F}$, we define two operators for specific matrices. We define $\bar{\varLambda }$ and $\varLambda ^*$ for the dynamics matrices $\mathsf{F}^{(S_1)}$ and $\mathsf{F}^*$, respectively.

$$\begin{aligned} \bar{\varLambda } (\mathsf{C}^{(s)})&= \mathsf{E}^{-1} \mathsf{B}\mathsf{R}^{-1} \mathsf{B}^\top \sum _{i=0}^{\infty } \mathsf{H}^i [\mathsf{C}_Z(I - (\mathsf{F}^{(S_1)})^i)(I - \mathsf{F}^{(S_1)})^{-1} \mathsf{C}^{(s)}] \nonumber \\&\quad - {{\,\mathrm{diag}\,}}(\mathsf{C}^1_Z,\ldots ,\mathsf{C}^L_Z) \upbeta \end{aligned}$$

(80)

$$\begin{aligned} \varLambda ^* (\mathsf{C}^{(s)})&= \mathsf{E}^{-1} \mathsf{B}\mathsf{R}^{-1} \mathsf{B}^\top \sum _{i=0}^{\infty } \mathsf{H}^i_{p} [\mathsf{C}_Z(I - (\mathsf{F}^*)^i)(I - \mathsf{F}^*)^{-1} \mathsf{C}^{(s)}] \nonumber \\&\quad - {{\,\mathrm{diag}\,}}(\mathsf{C}^1_Z,\ldots ,\mathsf{C}^L_Z) \upbeta \end{aligned}$$

(81)

We require $\varLambda ^*$ to be contractive, and hence, we require the Lipschitz constant $T_2$ of $\varLambda ^*$ to be less than one,

$$\begin{aligned} T_2 := \Big \Vert \mathsf{E}^{-1} \mathsf{B}\mathsf{R}^{-1} \mathsf{B}^\top \sum _{i=0}^{\infty } \mathsf{H}^i \mathsf{C}_Z(I - (\mathsf{F}^*)^i)(I - \mathsf{F}^*)^{-1} \Big \Vert _2 < 1 \end{aligned}$$

(82)

Since $T_2 < 1$, then $\varLambda ^*$ is contractive. Also ${\mathsf{C}}^*$ is the fixed point of operator $\varLambda ^*$, that is ${\mathsf{C}}^* = \varLambda ^*({\mathsf{C}}^*)$. Let us analyze the convergence of $\mathsf{C}^{(s)}$ to $\mathsf{C}^*$. Toward that end, let us consider the following inequality:

$$\begin{aligned}&\Vert {\mathsf{C}}^{(s+1)} - {\mathsf{C}}^* \Vert _2 \nonumber \\& \le \Vert {\mathsf{C}}^{(s+1)} - \bar{\varLambda }({\mathsf{C}}^{(s)}) \Vert _2 + \Vert \bar{\varLambda }({\mathsf{C}}^{(s)}) - \varLambda ^*({\mathsf{C}}^{(s)}) \Vert _2 + \Vert \varLambda ^*({\mathsf{C}}^{(s)}) - {\mathsf{C}}^* \Vert _2 \end{aligned}$$

(83)

We now bound the three terms in (83) separately. Using (77) and (80), the first term can be bounded as follows:

$$\begin{aligned} \Vert {\mathsf{C}}^{(s+1)} - \bar{\varLambda }({\mathsf{C}}^{(s)}) \Vert _2&\le \Vert \mathsf{B}\Vert _2 \Vert {{\,\mathrm{diag}\,}}(K^{(s+1)}_{1,2} - \bar{K}^{(s+1)}_{1,2}, \ldots , K^{(s+1)}_{L,2} - \bar{K}^{(s+1)}_{L,2}) \Vert _2, \nonumber \\&\le \frac{\Vert \mathsf{B}\Vert _2}{\min _{l \in [L]}\sqrt{\nu ^l}} \epsilon \end{aligned}$$

(84)

with probability at least $1 - \delta _2 R_2$, where the last inequality is obtained using Lemma 2 and the fact that $\epsilon _2 \le \epsilon ^2$. The last term in (83) can be similarly bounded

$$\begin{aligned} \Vert \varLambda ^*({\mathsf{C}}^{(s)}) - {\mathsf{C}}^* \Vert _2 \le T_2 \Vert {\mathsf{C}}^{(s)} - {\mathsf{C}}^* \Vert _2 \end{aligned}$$

(85)

To bound the second term in (83) we must first bound the following quantity

$$\begin{aligned}&\big \Vert (I - (\mathsf{F}^{(s)})^k) (I - \mathsf{F}^{(s)})^{-1} - (I - (\mathsf{F}^*)^k)(I - \mathsf{F}^*)^{-1} \big \Vert _2 = \Big \Vert \sum _{i=0}^{k-1} (\mathsf{F}^{(S_1)})^i - (\mathsf{F}^*)^i \Big \Vert _2 \nonumber \\&\quad = \Big \Vert \sum _{i=0}^{k-1} \sum _{j = 0}^{i-1} (\mathsf{F}^{(S_1)})^{i-1-j} (\mathsf{F}^{(S_1)} - \mathsf{F}^*) (\mathsf{F}^*)^j \Big \Vert _2 \le \sum _{i=1}^{k-1} i \bar{F}^{i-1} \Vert \mathsf{F}^{(S_1)} - \mathsf{F}^* \Vert _2 \nonumber \\&\quad \le \frac{\Vert \mathsf{F}^{(S_1)} - \mathsf{F}^* \Vert _2}{(1 - \bar{F})^2} \end{aligned}$$

(86)

Now let us look at the second term in (83),

$$\begin{aligned}&\Vert \bar{\varLambda }({\mathsf{C}}^{(s)}) - \varLambda ^*({\mathsf{C}}^{(s)}) \Vert _2 \nonumber \\&\quad = \Big \Vert \mathsf{E}^{-1} \mathsf{B}\mathsf{R}^{-1} \mathsf{B}^\top \sum _{i=0}^{\infty } \mathsf{H}^i \mathsf{C}_Z[(I - (\mathsf{F}^{(s)})^i) (I - \mathsf{F}^{(s)})^{-1} -(I - (\mathsf{F}^*)^i)(I - \mathsf{F}^*)^{-1}] \mathsf{C}^{(s)} \Big \Vert _2 \nonumber \\&\quad \le \Big \Vert \mathsf{E}^{-1} \mathsf{B}\mathsf{R}^{-1} \mathsf{B}^\top \sum _{i=0}^{\infty } \mathsf{H}^i \mathsf{C}_Z\Big \Vert _2 \frac{\Vert \mathsf{F}^{(S_1)} - \mathsf{F}^* \Vert _2}{(1 - \bar{F})^2} \Vert \mathsf{C}^{(s)} \Vert _2 \nonumber \\&\quad \le \frac{\Vert \mathsf{F}^{(S_1)} - \mathsf{F}^* \Vert _2}{(1 - \bar{F})^2} \Vert \mathsf{C}^{(s)} \Vert _2 = D^3 \Vert \mathsf{C}^{(s)} \Vert _2 \epsilon \end{aligned}$$

(87)

where the first inequality is obtained using (86), the second inequality using the assumptions of Theorem 3 and

$$\begin{aligned} D^3 := (1 - \bar{F})^{-2}. \end{aligned}$$

(88)

Using (84), (85), (87) and a union bound type argument, we obtain

$$\begin{aligned} \Vert {\mathsf{C}}^{(s+1)} - {\mathsf{C}}^* \Vert _2 \le D^3 \Vert {\mathsf{C}}^{(s)} \Vert _2 \epsilon + \frac{\Vert \mathsf{B}\Vert _2}{\min _{l \in [L]} \sqrt{\nu ^l}} \epsilon + T_2 \Vert {\mathsf{C}}^{(s)} - {\mathsf{C}}^* \Vert _2 \end{aligned}$$

(89)

with probability at least $1-\delta - \delta _2 R_2$. Due to the ${\mathsf{C}}^{(s)}$ term in the right hand side of (89) we first find an upper bound for ${\mathsf{C}}^{(s)}$. Toward that end we use (89) to obtain:

$$\begin{aligned}&\Vert {\mathsf{C}}^{(s+1)} - {\mathsf{C}}^* \Vert _2 \\&\quad \le D^3 \Vert {\mathsf{C}}^{(s)} - {\mathsf{C}}^* \Vert _2 \epsilon + T_2 \Vert {\mathsf{C}}^{(s)} - {\mathsf{C}}^* \Vert _2 + \Big (D^3 \Vert \mathsf{B}\Vert _2 \Vert {\mathsf{C}}^* \Vert _2 + \frac{\Vert \mathsf{B}\Vert _2}{\min _{l \in [L]}\sqrt{\nu ^l}} \Big ) \epsilon \\&\quad \le \frac{1+T_2}{2} \Vert {\mathsf{C}}^{(s)} - {\mathsf{C}}^* \Vert _2 + \Big (\frac{1-T_2}{2} \Vert {\mathsf{C}}^* \Vert _2 + \frac{\Vert \mathsf{B}\Vert _2}{\min _{l \in [L]}\sqrt{\nu ^l}} \Big ). \end{aligned}$$

The last inequality is due to the fact $\epsilon \le \min (1,\frac{1 - T_2}{2D^3 \Vert \mathsf{B}\Vert _2})$. Now $\Vert {\mathsf{C}}^{(s)} - {\mathsf{C}}^* \Vert _2$ can be bounded as follows,

$$\begin{aligned}&\Vert {\mathsf{C}}^{(s)} - {\mathsf{C}}^* \Vert _2 \\&\quad \le \bigg (\frac{1+T_2}{2}\bigg )^{s-1} \Vert {\mathsf{C}}^{(1)} - {\mathsf{C}}^* \Vert _2 + \sum ^{s-2}_{i=0} \bigg (\frac{1+T_2}{2}\bigg )^i \Big (\frac{1-T_2}{2} \Vert {\mathsf{C}}^* \Vert _2 + \frac{\Vert \mathsf{B}\Vert _2}{\min _{l \in [L]}\sqrt{\nu ^l}} \Big ) \\&\quad \le \Vert {\mathsf{C}}^{(1)} - {\mathsf{C}}^* \Vert _2 + \Vert {\mathsf{C}}^* \Vert _2 + \frac{2 \Vert \mathsf{B}\Vert _2}{(1 - T_2)\min _{l \in [L]}\sqrt{\nu ^l}} , \end{aligned}$$

and hence,

$$\begin{aligned} \Vert {\mathsf{C}}^{(s)} \Vert _2 \le 2 \Vert {\mathsf{C}}^* \Vert _2 + \Vert {\mathsf{C}}^{(1)} - {\mathsf{C}}^* \Vert _2 + \frac{2 \Vert \mathsf{B}\Vert _2}{(1 - T_2)\min _{l \in [L]}\sqrt{\nu ^l}} =: \bar{C}. \end{aligned}$$

(90)

Now we can write (89) as

$$\begin{aligned} \Vert {\mathsf{C}}^{(s+1)} - {\mathsf{C}}^* \Vert _2 \le T_2 \Vert {\mathsf{C}}^{(s)} - {\mathsf{C}}^* \Vert _2 + \bigg (D^3 \Vert \mathsf{B}\Vert _2 \bar{C} + \frac{\Vert \mathsf{B}\Vert _2}{\sqrt{\nu ^l}} \bigg ) \epsilon \end{aligned}$$

(91)

with probability at least $1-\delta - \delta _2 R_2$, which using a union bound type argument leads to

$$\begin{aligned} \Vert {\mathsf{C}}^{(s)} - {\mathsf{C}}^* \Vert _2&\le (T_2)^{s-1} \Vert {\mathsf{C}}^{(1)} - {\mathsf{C}}^* \Vert _2 + \sum _{i=0}^{s-2} (T_2)^i \bigg (D^3 \Vert \mathsf{B}\Vert _2 \bar{C} + \frac{\Vert \mathsf{B}\Vert _2}{\min _{l \in [L]}\sqrt{\nu ^l}} \bigg ) \epsilon \\&\le (T_2)^{s-1} \Vert {\mathsf{C}}^{(1)} - {\mathsf{C}}^* \Vert _2 + \frac{1}{1-T_2} \bigg (D^3 \Vert \mathsf{B}\Vert _2 \bar{C} + \frac{\Vert \mathsf{B}\Vert _2}{\min _{l \in [L]}\sqrt{\nu ^l}} \bigg ) \epsilon \end{aligned}$$

with probability at least $1- \delta -s \delta _2 R_2$. Plugging in the values $S_2 = \frac{1}{1-T_2}\log (\frac{2 \Vert {\mathsf{C}}^{(1)} - {\mathsf{C}}^* \Vert _2}{\epsilon })$ and $\delta _2 = \frac{\delta }{S_2 R_2}$

$$\begin{aligned} \Vert {\mathsf{C}}^{(S_2)} - {\mathsf{C}}^* \Vert _2 \le D^4 \epsilon \end{aligned}$$

with probability at least $1-2 \delta $, where

$$\begin{aligned} D^4 = \bigg (\frac{1}{2} + \frac{1}{1-T_2} \bigg (D^3 \Vert \mathsf{B}\Vert _2 \bar{C} + \frac{\Vert \mathsf{B}\Vert _2}{\min _{l \in [L]}\sqrt{\nu ^l}} \bigg )\bigg ) \end{aligned}$$

(92)

Now we bound the quantity $\Vert K^{(S_2+1)}_{l,2}-K^*_{l,2} \Vert _2$.

$$\begin{aligned}&\Vert K^{(S_2+1)}_{l,2}-K^*_{l,2} \Vert _2 \\&\quad \le \Vert K^{(S_2+1)}_{l,2}-{\tilde{K}}^{(S_2+1)}_{l,2} \Vert _2 + \Vert {\tilde{K}}^{(S_2+1)}_{l,2} - \lambda ^*_l {\mathsf{C}}^{(S_2)}\Vert _2 + \Vert \lambda ^*_l {\mathsf{C}}^{(S_2)} + K^*_{l,2}\Vert _2 \\&\quad = \Vert K^{(S_2+1)}_{l,2}-{\tilde{K}}^{(S_2+1)}_{l,2} \Vert _2 + \Vert {\tilde{\lambda }}_l {\mathsf{C}}^{(S_2)} - \lambda ^*_l {\mathsf{C}}^{(S_2)}\Vert _2 + \Vert \lambda ^*_l {\mathsf{C}}^{(S_2)} + \lambda ^*_l {\mathsf{C}}^*\Vert _2 \\&\quad \le \sqrt{\frac{1}{\nu ^l}} \epsilon + \bar{C} D^3_l \epsilon + \Vert \lambda ^*_l \Vert _2 D^4 \epsilon \\&\quad \le D^5_l \epsilon \end{aligned}$$

with probability at least $1-2\delta $, where

$$\begin{aligned} D^5_l = \sqrt{\frac{1}{\nu ^l}} + \bar{C} D^3_l + \Vert \lambda ^*_l \Vert _2 D^4 \end{aligned}$$

(93)

Now we prove that global constants $\rho ^l_1, \varphi ^l_1, \lambda ^l_1, \nu ^l, \varphi ^l_2, \rho ^l_2$ and $\lambda ^l_2$ for each $l \in [L]$ do exist and characterize them.

Lemma 10

If $\epsilon \le \frac{1}{\sqrt{m(L+1)}} \min (1, \min _{l \in [L]} c^l_{16}, \min _{l \in [L]} \frac{1}{D^2_l})$ where $c^l_{16}$ and $D^2_l$ are defined in (121) and (76), respectively, then global constants $\mu ^l,\rho ^l_1, \varphi ^l_1, \lambda ^l_1, \nu ^l, \varphi ^l_2, \rho ^l_2$ and $\lambda ^l_2$ for each $l \in [L]$ are defined in (125) and (138).

Proof

The proof of this Lemma is provided in Sect. 8. $\square $

Hence, we have completed the proof of Theorem 3. $\square $

Proof of Lemma 3

Proof

We know from Proposition B2 in Fu et al. [13] that the cost $J_l^2$ is quadratic in $K_{l,2}$,

$$\begin{aligned}&J_l^2((K_{l,1},K_{l,2}),\bar{\mathsf{Z}})\nonumber \\&\quad =\begin{pmatrix} \mu _{K} \\ K_{l,2} \end{pmatrix}^\top \begin{pmatrix} \bar{\mathsf{Q}}_l + K_{l,1}^\top {C_U^l} K_{l,1} &{} -K_{l,1}^\top {C_U^l} \\ - {C_U^l} K_{l,1} &{} {C_U^l} \end{pmatrix} \begin{pmatrix} \mu _{K} \\ K_{l,2} \end{pmatrix} - 2 (\bar{\upbeta }^l)^\top \bar{\mathsf{Q}}_l \mu _{K} \end{aligned}$$

(94)

where

$$\begin{aligned} \mu _{K} = \big (I - \bar{\mathsf{A}}^l + \bar{B}^l K_{l,1} \big )^{-1} \big ( \bar{B}^l K_{l,2} + \bar{\mathsf{C}} \big ) \end{aligned}$$

(95)

As $J_l^2$ is quadratic in $K_{l,2}$, it is continuously differentiable with respect to $K_{l,2}$. Moreover, as the Hessian of $J_l^2$ is positive definite (Proposition 3.3 [13]), the non-empty level sets of $J_l^2$ are ellipsoids and hence the non-empty sublevel sets are compact.

Now we aim to derive the values of the Lipschitz constant $\varphi ^l_2$ and radius $\rho ^l_2$. First we notice that the cost $J_l^2$ can be written in the form,

$$\begin{aligned} J_l^2 (K_{l,2}) = K_{l,2}^\top {\mathbf {\mathsf{{A}}}}_l K_{l,2} + {\mathbf {\mathsf{{c}}}}^\top _l K_{l,2} + {\mathbf {\mathsf{{d}}}}_l \end{aligned}$$

(96)

where

$$\begin{aligned} {\mathbf {\mathsf{{A}}}}_l&= \bigg \Vert \begin{array}{c} (I - \bar{\mathsf{A}}^l + \bar{B}^l K_{l,1} )^{-1} \bar{B}^l \\ I \end{array} \bigg \Vert ^2_{\tiny \begin{pmatrix} \bar{\mathsf{Q}}_l + (K_{l,1})^\top {C_U^l} K_{l,1} &{} -(K_{l,1})^\top {C_U^l} \\ - {C_U^l} K_{l,1} &{} {C_U^l} \end{pmatrix}} \nonumber \\ {\mathbf {\mathsf{{c}}}}_l =&2 \big ((I - \bar{\mathsf{A}}^l + \bar{B}^l K_{l,1} )^{-1} \bar{\mathsf{C}} \big )^\top \big ( \bar{\mathsf{Q}}_l + (K_{l,1})^\top {C_U^l} K_{l,1} \big ) \big ((I - \bar{\mathsf{A}}^l + \bar{B}^l K_{l,1} )^{-1} \bar{B}^l \big ) \nonumber \\&- 2 \big ((I - \bar{\mathsf{A}}^l + \bar{B}^l K_{l,1} )^{-1} \bar{\mathsf{C}} \big )^\top (K_{l,1})^\top {C_U^l} - 2 (\bar{\upbeta }^)l^\top \bar{\mathsf{Q}}_l (I - \bar{\mathsf{A}}^l + \bar{B}^l K_{l,1} )^{-1} \bar{B}^l, \nonumber \\ {\mathbf {\mathsf{{d}}}}_l =&\big ((I - \bar{\mathsf{A}}^l + \bar{B}^l K_{l,1} )^{-1} \bar{\mathsf{C}} \big )^\top \big ( \bar{\mathsf{Q}}_l + (K_{l,1})^\top {C_U^l} K_{l,1} \big ) \big ((I - \bar{\mathsf{A}}^l + \bar{B}^l K_{l,1} )^{-1} \bar{\mathsf{C}} \big ) \nonumber \\&- 2 (\bar{\upbeta }^l)^\top \bar{\mathsf{Q}}_l (I - \bar{\mathsf{A}}^l + \bar{B}^l K_{l,1} )^{-1} \bar{\mathsf{C}}. \end{aligned}$$

(97)

The matrix ${\mathbf {\mathsf{{A}}}}_l$ is symmetric positive definite. Proposition 3.3 of Fu et al. [13] proves smoothness and strong convexity of $J_l^2$ with coefficients $\varphi ^l_2$ and $\nu ^l$ such that

$$\begin{aligned} \nu ^l = \sigma _{\min }({\mathbf {\mathsf{{A}}}}_l), \varphi ^l_2 = \Vert {\mathbf {\mathsf{{A}}}}_l \Vert _2 . \end{aligned}$$

(98)

Recall that the $K_{l,2}$ which minimizes $J_l^2$ is denoted by ${\bar{K}}^*_{l,2}$ and is given by

$$\begin{aligned} {\bar{K}}^*_{l,2} = - \frac{1}{2}{\mathbf {\mathsf{{A}}}}^{-1}_l {\mathbf {\mathsf{{c}}}}_l \end{aligned}$$

which exists since ${\mathbf {\mathsf{{A}}}}_l > 0$. By completing the square we can write the cost as

$$\begin{aligned} J_l^2 (K_{l,2}) = (K_{l,2} - {\bar{K}}^*_{l,2})^\top {\mathbf {\mathsf{{A}}}}_l (K_{l,2} - {\bar{K}}^*_{l,2}) - \frac{1}{4} {\mathbf {\mathsf{{c}}}}^\top _l {\mathbf {\mathsf{{A}}}}^{-1}_l {\mathbf {\mathsf{{c}}}}_l + {\mathbf {\mathsf{{d}}}}_l \end{aligned}$$

As in the statement of the Lemma assume $K_{l,2} \in \mathcal {G}^0_l$ and $\Vert K'_{l,2} - K_{l,2} \Vert _2 \le \rho ^l_2$ where $\rho ^l_2$ satisfies

$$\begin{aligned} \rho ^l_2 = \sqrt{\frac{J_l^2 (K_{i},\bar{\mathsf{Z}})}{\Vert {\mathbf {\mathsf{{A}}}}_l \Vert _2}} \end{aligned}$$

(99)

Then the cost of controller $K'_{l,2}$ is

$$\begin{aligned}&J_l^2 (K'_{l,2}) = (K'_{l,2} - {\bar{K}}^*_{l,2})^\top {\mathbf {\mathsf{{A}}}}_l (K'_{l,2} - {\bar{K}}^*_{l,2}) - \frac{1}{4} {\mathbf {\mathsf{{c}}}}^\top _l {\mathbf {\mathsf{{A}}}}^{-1}_l {\mathbf {\mathsf{{c}}}}_l + {\mathbf {\mathsf{{d}}}}_l \\&= (K'_{l,2} - K_{l,2} + K_{l,2} - {\bar{K}}^*_{l,2})^\top {\mathbf {\mathsf{{A}}}}_l (K'_{l,2} - K_{l,2} + K_{l,2} - {\bar{K}}^*_{l,2}) - \frac{1}{4} {\mathbf {\mathsf{{c}}}}^\top _l {\mathbf {\mathsf{{A}}}}^{-1}_l {\mathbf {\mathsf{{c}}}}_l + {\mathbf {\mathsf{{d}}}}_l \\&\le 2 (K'_{l,2} - K_{l,2})^\top {\mathbf {\mathsf{{A}}}}_l (K'_{l,2} - K_{l,2}) + 2 (K_{l,2} - {\bar{K}}^*_{l,2})^\top {\mathbf {\mathsf{{A}}}}_l (K_{l,2} - {\bar{K}}^*_{l,2}) \\& - \frac{1}{2} {\mathbf {\mathsf{{c}}}}^\top _l {\mathbf {\mathsf{{A}}}}^{-1}_l {\mathbf {\mathsf{{c}}}}_l + 2 {\mathbf {\mathsf{{d}}}}_l \\&= 2 \Vert K'_{l,2} - K_{l,2} \Vert ^2_{{\mathbf {\mathsf{{A}}}}_l} + 2 J_l^2 (K_{l,2}) \\&\le 2 \Vert {\mathbf {\mathsf{{A}}}}_l \Vert _2 (\rho ^l_2)^2 + 8 J_l^2 (K_{i},\bar{\mathsf{Z}}) \le 10 J_l^2 (K_{i},\bar{\mathsf{Z}}) \end{aligned}$$

Hence, $K'_{l,2} \in \mathcal {G}^1_l$. Since $J_l^2$ is smooth with coefficient $\varphi ^l_2$, for any $K'_{l,2} \in \mathcal {G}^1_l$ we have

$$\begin{aligned} \Vert \nabla J_l^2 (K'_{l,2}) \Vert ^2_2 \le 2 \varphi ^l_2 (J_l^2(K'_{l,2}) - J_l^2 ({\bar{K}}^*_{l,2})) \le 20 \varphi ^l_2 J_l^2 (K_{i},\bar{\mathsf{Z}}) \end{aligned}$$

Hence, for $K'_{l,2} \in \mathcal {G}^1_l$, $J_l^2 (K'_{l,2})$ is Lipschitz with coefficient,

$$\begin{aligned} \lambda ^l_2 = \sqrt{20 \varphi ^l_2 J_l^2 (K_{i},\bar{\mathsf{Z}})} \end{aligned}$$

(100)

This concludes the proof. $\square $

Proof of Lemma 7

Proof

Due to certainty equivalence we instead consider the deterministic LQR problem with dynamics

$$\begin{aligned} \mathsf{X}^l_{t+1} = \bar{\mathsf{A}}^{l,(s)}_l \mathsf{X}^l_t + \bar{B}^l U^l_t, \text { where } \bar{\mathsf{A}}^{l,(s)} = \begin{bmatrix} {A^l} &{} 0 \\ 0 &{} \mathsf{F}^{(s)} \end{bmatrix}, \bar{B}^l = \begin{bmatrix} {B^l} \\ 0 \end{bmatrix} \end{aligned}$$

(101)

and cost

$$\begin{aligned} J_l(\phi ^l, \bar{\mathsf{Z}}^{(s)}) := \sum _{t=0}^{\infty }\big [ \big \Vert \mathsf{X}^l_t \big \Vert ^2_{\bar{\mathsf{Q}}_l} + \big \Vert U^l_t \big \Vert ^2_{C_U^l} \big ] \end{aligned}$$

(102)

Since for $s \in [S_1]$ the control offset $K^{(0)}_{l,2}$ and mean-field drift are 0 and the class of controllers is restricted to linear controllers $\phi ^l (\mathsf{X}^l_t) = K_{l,1} \mathsf{X}^l_t$, then optimal controller $K_{l,1}$ for the stochastic drifted-LQR problem (15)-(16) will also be optimal for the deterministic LQR problem shown above. This deterministic problem can be rewritten as a Linear Quadratic Tracking (LQT) problem with dynamics

$$\begin{aligned} Z^l_{t+1} = {A^l} Z^l_t + {B^l} U_t^i \end{aligned}$$

and cost

$$\begin{aligned} J_l(\phi ^l, \bar{\mathsf{Z}}^{(s)}) :=&\sum _{t=0}^{\infty } [ \Vert Z^l_t \Vert ^2_{Q^l} + \Vert U^l_t \Vert ^2_{C_U^l} + \sum _{k \in [L]} \Vert Z^l_t- \bar{Z}^{(k,s)}_t \Vert ^2_{{C_Z^{lk}}} ], \end{aligned}$$

where $\bar{\mathsf{Z}}^{(s,i)}$ is the mean-field trajectory of population l in the joint mean-field trajectory $\bar{\mathsf{Z}}^{(s)}$. This problem can be solved by using the maximum principle approach as shown in Proof of Proposition 1. From Eq. (31) we surmise

$$\begin{aligned} s^l_t = - \sum _{k \in [L]} {C_Z^{lk}} \bar{Z}^{(k,s)}_t + {H^l} s^l_{t+1} \end{aligned}$$

(103)

where ${H^l}$ is defined in (53), ${P^l}$ is the solution to the Riccati Eq. (51), and $\bar{\mathsf{Z}}^{(j,s)}$ represents the jth population’s mean-field trajectory in the joint mean-field trajectory $\bar{\mathsf{Z}}^{(s)}$. Using (101) we write down the closed-loop dynamics of generic agent l.

$$\begin{aligned}&Z^l_{t+1} = {H^l}^\top Z^l_t - (E^l)^{-1} {B^l} (C_U^l)^{-1} {B^l}^\top s^l_{t+1}, \nonumber \\&= {H^l}^\top Z^l_t - (E^l)^{-1} {B^l} (C_U^l)^{-1} {B^l}^\top \sum _{i=0}^{\infty } (H^l)^i \sum _{k \in [L]} {C_Z^{lk}} \bar{\mathsf{Z}}^{(j,s)}_{t+i+1}, \\&= ({A^l} - {B^l} ({C_U^l} + {B^l}^\top {P^l} {B^l})^{-1} {B^l}^\top {P^l} {A^l}) Z^l_t - (E^l)^{-1} {B^l} (C_U^l)^{-1} {B^l}^\top \sum _{i=0}^{\infty } (H^l)^i \mathsf{C}^l_Z \bar{\mathsf{Z}}^{(s)}_{t+i+1} \nonumber \end{aligned}$$

(104)

where $\mathsf{C}^l_Z = (C^{l1}_Z,C^{l2}_Z,\ldots )$. Since $\bar{\mathsf{Z}}^{(s)}$ is assumed to follow linear dynamics $\bar{\mathsf{Z}}^{(s)}_{t+1} = \mathsf{F}^{(s)} \bar{\mathsf{Z}}^{(s)}_t$, this can be further simplified into,

$$\begin{aligned} Z^l_{t+1}&= ({A^l} - {B^l} ({C_U^l} + {B^l}^\top {P^l} {B^l})^{-1} {B^l}^\top {P^l} {A^l}) Z^l_t \nonumber \\&\quad - (E^l)^{-1} {B^l} (C_U^l)^{-1} {B^l}^\top \sum _{i=0}^{\infty } (H^l)^i \mathsf{C}^l_Z (\mathsf{F}^{(s)})^{i+1}\bar{\mathsf{Z}}^{(s)}_{t} \end{aligned}$$

This can be rewritten in terms of the controller $\bar{K}^{(s+1)}_{l,1}$,

$$\begin{aligned} Z^l_{t+1} = {A^l} Z^l_t - {B^l} \bar{K}^{(s+1)}_{l,1} \begin{bmatrix} Z^l_t \\ \bar{\mathsf{Z}}^{(s)}_t \end{bmatrix} \end{aligned}$$

(105)

where

$$\begin{aligned} \bar{K}^{(s+1)}_{l,1} = \begin{bmatrix} G^l {A^l}& (I - G^l {B^l}) (C_U^l)^{-1} {B^l}^\top \sum _{i=0}^{\infty } (H^l)^i \mathsf{C}^l_Z (\mathsf{F}^{(s)})^{i+1} \end{bmatrix} \end{aligned}$$

(106)

where $G^l = ({C_U^l} + {B^l}^\top {P^l} {B^l})^{-1} {B^l}^\top {P^l}$. We know that $\bar{K}^{(s+1)}_{l,1}$ exists since ${H^l}$ is Hurwitz. Now we simulate the behavior of infinitely many agents in population l under controller $\bar{K}^{(s+1)}_{l,1}$ using (104),

$$\begin{aligned} \bar{Z}^l_{t+1} = {H^l}^\top \bar{Z}^l_{t} - (E^l)^{-1} {B^l} (C_U^l)^{-1} {B^l}^\top \sum _{i=0}^{\infty } (H^l)^i \mathsf{C}^l_Z (\mathsf{F}^{(s)})^{i+1} \bar{\mathsf{Z}}_t \end{aligned}$$

(107)

Writing down the closed-loop dynamics for the joint mean-field trajectory we get,

$$\begin{aligned} \bar{\mathsf{Z}}_{t+1} = \bigg (\mathsf{H}^\top - \mathsf{E}^{-1} \mathsf{B}\mathsf{R}^{-1} \mathsf{B}^\top \sum _{i=0}^{\infty } \mathsf{H}^k \mathsf{C}_Z(\mathsf{F}^{(s)})^{k+1} \bigg ) \bar{\mathsf{Z}}_t \end{aligned}$$

(108)

where,

$$\begin{aligned} \mathsf{H}= {{\,\mathrm{diag}\,}}(H^{1}, H^{2},\ldots ), \mathsf{E}= {{\,\mathrm{diag}\,}}(E^1,E^2,\ldots ), \mathsf{C}_Z= ( {\mathsf{C}^1_Z}^\top , {\mathsf{C}_Z^2}^\top ,\ldots )^\top \end{aligned}$$

(109)

Now we define a mean-field dynamics update operator $\mathbb {T}$ as follows

$$\begin{aligned} \mathbb {T}(M) = \mathsf{H}^\top + \mathsf{E}^{-1} \mathsf{B}\mathsf{R}^{-1} \mathsf{B}^\top \sum _{i=0}^{\infty } \mathsf{H}^i \mathsf{C}_Z M^{i+1} \end{aligned}$$

(110)

and $\bar{\mathsf{F}}^{(s+1)} = \mathbb {T}(\mathsf{F}^{(s)})$. $\square $

Proof of Lemma 9

Proof

Using logic similar to proof of Lemma 7, we arrive at the deterministic tracking control problem for agent l where the dynamics of agent l are

$$\begin{aligned} Z^l_{t+1} = {A^l} Z^l_t + {B^l} U_t^i \end{aligned}$$

and cost has constant tracking terms,

$$\begin{aligned} J_l(\phi ^l, \bar{\mathsf{Z}}^{(s)}) :=&\sum _{t=0}^{\infty } [ \Vert Z^l_t \Vert ^2_{Q^l} + \Vert U^l_t \Vert ^2_{C_U^l} + \sum _{k \in [L]} \Vert Z^l_t- \bar{Z}^{(k,s)}_t - \beta ^{lk} \Vert ^2_{{C_Z^{lk}}} ]. \end{aligned}$$

where $\bar{\mathsf{Z}}^{(l,s)}$ is the mean-field trajectory of population l in the joint mean-field trajectory $\bar{\mathsf{Z}}^{(s)}$. This problem can be solved by using the maximum principle approach as shown in Proof of Proposition 1. From Eq. (31) we surmise,

$$\begin{aligned} s^l_t = - \sum _{k \in [L]} {C_Z^{lk}} (\bar{Z}^{(k,s)}_t + \beta ^{lk}) + {H^l} s^l_{t+1} \end{aligned}$$

(111)

where ${H^l}$ is defined in (53). As in proof of Lemma 7 we write down the closed-loop dynamics of generic agent l.

$$\begin{aligned} Z^l_{t+1}&= {H^l}^\top Z^l_t - (E^l)^{-1} {B^l} (C_U^l)^{-1} {B^l}^\top s^l_{t+1}, \\&= {H^l}^\top Z^l_t - (E^l)^{-1} {B^l} (C_U^l)^{-1} {B^l}^\top \sum _{i=0}^{\infty } (H^l)^i \sum _{k \in [L]} {C_Z^{lk}} (\bar{\mathsf{Z}}^{(j,s)}_{t+i+1} + \beta ^{lk}), \\&= {H^l}^\top Z^l_t - (E^l)^{-1} {B^l} (C_U^l)^{-1} {B^l}^\top \sum _{i=0}^{\infty } (H^l)^i \mathsf{C}^l_Z (\bar{\mathsf{Z}}^{(s)}_{t+i+1} + \upbeta ^l) \end{aligned}$$

where $\mathsf{C}^l_Z = (C^{l1}_Z,C^{l2}_Z,\ldots )$ and $\upbeta ^l = (\beta ^{l1},\ldots , \beta ^{lL}) \in \mathbb {R}^{mL}$. Since $\bar{\mathsf{Z}}^{(s)}$ is assumed to follow affine dynamics $\bar{\mathsf{Z}}^{(s)}_{t+1} = \mathsf{F}\bar{\mathsf{Z}}^{(s)}_t + \mathsf{C}^{(s)}$, this can be further simplified into,

$$\begin{aligned} Z^l_{t+1}=&{H^l}^\top Z^l_t \nonumber \\&- (E^l)^{-1} {B^l} (C_U^l)^{-1} {B^l}^\top \sum _{i=0}^{\infty } (H^l)^i \mathsf{C}^l_Z (\mathsf{F}^{i+1}\bar{\mathsf{Z}}^{(s)}_{t} + (I - \mathsf{F}^i)(I-\mathsf{F})^{-1} \mathsf{C}^{(s)} + \upbeta ^l) \end{aligned}$$

This can be rewritten in terms of the controller $\bar{K}^{(s+1)}_{l,1}$ and $\bar{K}^{(s+1)}_{l,2}$,

$$\begin{aligned} Z^l_{t+1} = {A^l} Z^l_t - {B^l} \bar{K}^{(s+1)}_{l,1} \begin{bmatrix} Z^l_t \\ \bar{\mathsf{Z}}^{(s)}_t \end{bmatrix} - {B^l} \bar{K}^{(s+1)}_{l,2} \end{aligned}$$

(112)

where,

$$\begin{aligned} \bar{K}^{(s)}_{l,2} = (I - G^l {B^l}) (C_U^l)^{-1} {B^l} \sum _{i=0}^{\infty } ({H^l})^{i} \mathsf{C}^l_Z ((I - \mathsf{F}^i)(I - \mathsf{F})^{-1} \mathsf{C}^{(s)} + \upbeta ^l) \end{aligned}$$

(113)

Simulating the behavior of infinitely many agents as in Lemma 7, we get the mean-field offset update operator $\varLambda $ defined as

$$\begin{aligned} \varLambda (\mathsf{C}^{(s)}) = \mathsf{E}^{-1} \mathsf{B}\mathsf{R}^{-1} \mathsf{B}^\top \sum _{i=0}^{\infty } \mathsf{H}^i [\mathsf{C}_Z(I - \mathsf{F}^i)(I - \mathsf{F})^{-1} \mathsf{C}^{(s)}] + {{\,\mathrm{diag}\,}}(\mathsf{C}^1_Z,\ldots ,\mathsf{C}^l_Z) \upbeta \end{aligned}$$

where $\upbeta = (\upbeta ^1, \ldots , \upbeta ^L) \in \mathbb {R}^{mLL}$. $\square $

Proof of Lemma 10

Proof

We define the global constants $\mu ^l,\rho ^l_1, \varphi ^l_1$ and $\lambda ^l_1$ for Lemma 1. We observe from Section A in Malik et al. [22] that these constants depend on norms of matrices $\Vert \bar{\mathsf{A}}^l \Vert _2$, which depend on the norm of mean-field trajectory dynamics matrix $\Vert \mathsf{F}^{(s)} \Vert _2$. Furthermore the constants also depend on the initial cost $J_l((K^{(1)}_{l,1},K^{(0)}_{l,2}),\bar{\mathsf{Z}}^{(s)})$. We start by obtaining a bound for $\Vert \mathsf{F}^{(s)} \Vert _2$. From (71) we observe

$$\begin{aligned} \Vert \mathsf{F}^{(s+1)} - \mathsf{F}^* \Vert _F \le \Vert \mathsf{B}\Vert _F \sum _{l \in [L]} \sigma _{\min }^{-1} (\bar{\varSigma }^{l}) \sigma _{\min }^{-1}({C_U^l}) \epsilon _1 + T_1 \Vert \mathsf{F}^{(s)} - \mathsf{F}^* \Vert _F . \end{aligned}$$

This implies

$$\begin{aligned} \Vert \mathsf{F}^{(s)} - \mathsf{F}^* \Vert _F \le \Vert \mathsf{F}^{(1)} - \mathsf{F}^* \Vert _F + \frac{1}{1-T_1} \Vert \mathsf{B}\Vert _F \sum _{l \in [L]} \sigma _{\min }^{-1} (\bar{\varSigma }^{l}) \sigma _{\min }^{-1}({C_U^l}) \epsilon _1 . \end{aligned}$$

Hence,

$$\begin{aligned} \Vert \mathsf{F}^{(s)} \Vert _2 \le&\Vert \mathsf{F}^* \Vert _2 + m(L+1) \Vert \mathsf{F}^{(1)} - \mathsf{F}^* \Vert _F \nonumber \\&+ \frac{m(L+1)}{1-T_1} \Vert \mathsf{B}\Vert _F \sum _{l \in [L]} \sigma _{\min }^{-1} (\bar{\varSigma }^{l}) \sigma _{\min }^{-1}({C_U^l}) \epsilon _1 =: \bar{F} \end{aligned}$$

(114)

Now that an upper bound on $\Vert \mathsf{F}^{(s)} \Vert _2$ has been defined in (114), we compute ${\mathbf {\mathsf{{J}}}}^1_l$ which is the upper bound on $J_l((K^{(1)}_{l,1},K^{(0)}_{l,2}),\bar{\mathsf{Z}}^{(s)})$ for $s \in [S_1]$. Under controller $(K^{(1)}_{l,1},K^{(0)}_{l,2})$ the dynamics of generic agent l and the mean-field trajectory dynamics are decoupled.

$$\begin{aligned} Z^l_{t+1}&= ({A^l} - {B^l} K^{1,1}_{l,1}) Z^l_t + W^l_t \\ \bar{\mathsf{Z}}^{(s)}_{t+1}&= \mathsf{F}^{(s)} \bar{\mathsf{Z}}^{(s)}_{t} + \omega _t \end{aligned}$$

The cost function for the generic agent l is

$$\begin{aligned}&J_l((K^{(1)}_{l,1},K^{(0)}_{l,2}),\bar{\mathsf{Z}}^{(s)}) = \lim _{T \rightarrow \infty } \frac{1}{T} \sum _{t=0}^{T-1} \mathbb {E}\bigg [ \bigg \Vert \begin{bmatrix} Z^l_t \\ \bar{\mathsf{Z}}_t \end{bmatrix} - \bar{\upbeta }^l \bigg \Vert ^2_{\bar{\mathsf{Q}}_l} + \Vert K^{1,1}_{l,1} Z^l_t \Vert ^2_{C_U^l} \bigg ] \\&= \lim _{T \rightarrow \infty } \frac{1}{T} \sum _{t=0}^{T-1} \mathbb {E}[\Vert Z^l_t \Vert ^2_{Q^l} + \Vert K^{1,1}_{l,1} Z^l_t \Vert ^2_{C_U^l} + \sum _{k \in [L]} \Vert Z^l_t - (\bar{Z}^k_t + \beta ^{lk}) \Vert ^2] \\&\le \lim _{T \rightarrow \infty } \frac{1}{T} \sum _{t=0}^{T-1} \mathbb {E}[ 2 \Vert Z^l_t \Vert ^2_{Q^l + \sum _{k \in [L]}} + \Vert K^{1,1}_{l,1} Z^l_t \Vert ^2_{C_U^l} + 2\sum _{k \in [L]} \Vert \bar{Z}^k_t \Vert ^2_{{C_Z^{lk}}} ] + 2 \sum _{k \in [L]} \Vert \beta ^{lk} \Vert ^2_{{C_Z^{lk}}} \\&= \lim _{T \rightarrow \infty } \frac{1}{T} \sum _{t=0}^{T-1} \mathbb {E}[ 2 \Vert Z^l_t \Vert ^2_{Q^l + \sum _{k \in [L]}} + \Vert K^{1,1}_{l,1} Z^l_t \Vert ^2_{C_U^l} + 2 \Vert \bar{\mathsf{Z}}_t \Vert ^2_{\bar{C}_l} ] + 2 \sum _{k \in [L]} \Vert \beta ^{lk} \Vert ^2_{{C_Z^{lk}}} \end{aligned}$$

where $\bar{C}_l = {{\,\mathrm{diag}\,}}(C^{l1}_Z,\ldots ,C^{lL}_Z)$. Using results in standard LQR analysis [21], this cost is given by

$$\begin{aligned} J_l((K^{(1)}_{l,1},K^{(0)}_{l,2}),\bar{\mathsf{Z}}^{(s)}) = {{\,\mathrm{Tr}\,}}(\bar{P}_l \varSigma ^{(i)}_w) + 2{{\,\mathrm{Tr}\,}}(\bar{P}^{(s)}_l \sigma ) + \sum _{k \in [L]} \Vert \beta ^{lk} \Vert ^2_{{C_Z^{lk}}} , \end{aligned}$$

where the matrices $\bar{P}_l$ and $\bar{P}^{(s)}_l$ are solutions to the Lyapunov equations,

$$\begin{aligned} \bar{P}_l&= 2(Q^l + \sum _{k \in [L]} {C_Z^{lk}}) + (K^{(1,1)}_{l,1})^\top {C_U^l} K^{(1,1)}_{l,1} \nonumber \\& + ({A^l} - {B^l} K^{(1,1)}_{l,1})^\top \bar{P}_l ({A^l} - {B^l} K^{(1,1)}_{l,1}), \nonumber \\ \bar{P}^{(s)}_l&= \bar{C}_l + (\mathsf{F}^{(s)})^\top \bar{P}^{(s)}_l \mathsf{F}^{(s)} . \end{aligned}$$

(115)

We upper bound ${{\,\mathrm{Tr}\,}}(\bar{P}^{(s)}_l \sigma )$ using Lemma 20 in Fazel et al. [12]. Toward that end, we first define matrix $\bar{P}^*_l$ as the solution to the Lyapunov equation

$$\begin{aligned} \bar{P}^*_l = \bar{C}_l + (\mathsf{F}^*)^\top \bar{P}^*_l \mathsf{F}^* \end{aligned}$$

(116)

We also introduce the following operators,

$$\begin{aligned}&\mathcal {T}^{(s)}(X) = \sum _{t=0}^{\infty } ((\mathsf{F}^{(s)})^\top )^t X (\mathsf{F}^{(s)})^t, \mathcal {T}^*(X) = \sum _{t=0}^{\infty } ((\mathsf{F}^*)^\top )^t X (\mathsf{F}^*))^t, \nonumber \\& \mathcal {F}^{(s)}(X) = (\mathsf{F}^{(s)})^\top X \mathsf{F}^{(s)}, \mathcal {F}^*(X) = (\mathsf{F}^*)^\top X \mathsf{F}^* \end{aligned}$$

(117)

where $\mathcal {T}^{(s)}(\bar{C}_l) = \bar{P}^{(s)}_l$ and $\mathcal {T}^*(\bar{C}_l) = \bar{P}^*_l$. Towards upper bounding ${{\,\mathrm{Tr}\,}}(\bar{P}^{(s)}_l \sigma )$ we first recognize

$$\begin{aligned} {{\,\mathrm{Tr}\,}}(\bar{P}^{(s)}_l \sigma ) = {{\,\mathrm{Tr}\,}}((\bar{P}^{(s)}_l - \bar{P}^*_l)\sigma ) + {{\,\mathrm{Tr}\,}}(\bar{P}^*_l \sigma ) \le \Vert \bar{P}^{(s)}_l - \bar{P}^*_l \Vert _2 \Vert \sigma \Vert _2 + {{\,\mathrm{Tr}\,}}(\bar{P}^*_l \sigma ) \end{aligned}$$

(118)

So we need to bound $\Vert \bar{P}^{(s)}_l - \bar{P}^*_l \Vert _2$ using Lemma 20 in Fazel et al. [12]. First we obtain a bound on $\Vert \mathcal {F}^{(s)} - \mathcal {F}^* \Vert _2$ which is similar to Lemma 19 of Fazel et al. [12], where $\Vert \cdot \Vert _2$ is the operator norm $\Vert \mathcal {F}\Vert _2 = \sup _{X} \frac{\Vert \mathcal {F}(X) \Vert _2}{\Vert X \Vert _2}$. Let us first define $\tilde{\mathsf{F}} = \mathsf{F}^{(s)} - \mathsf{F}^*$; then, for any matrix X,

$$\begin{aligned} \mathcal {F}^{(s)}(X) - \mathcal {F}^*(X) = (\mathsf{F}^*)^\top X \tilde{\mathsf{F}} + (\tilde{\mathsf{F}})^\top X \mathsf{F}^* - (\tilde{\mathsf{F}})^\top X \tilde{\mathsf{F}} \end{aligned}$$

Then, using the definition of operator norm $\Vert \cdot \Vert _2$ we get

$$\begin{aligned} \Vert \mathcal {F}^{(s)} - \mathcal {F}^* \Vert _2 \le 2 \Vert \mathsf{F}^{(s)} - \mathsf{F}^* \Vert _2 \Vert \mathsf{F}^* \Vert _2 + \Vert \mathsf{F}^{(s)} - \mathsf{F}^* \Vert ^2_2 \end{aligned}$$

(119)

Now we obtain a bound on $\Vert \mathcal {T}^* \Vert $ using techniques similar to Lemma 17 of Fazel et al. [12]. Consider a unit norm vector v and unit spectral norm matrix X.

$$\begin{aligned}&v^\top \mathcal {T}^*(X) v = \sum _{t=0}^{\infty } v^\top ((\mathsf{F}^*)^\top )^t X (\mathsf{F}^*)^t v = \sum _{t=0}^{\infty } {{\,\mathrm{Tr}\,}}((\mathsf{F}^*)^t v v^\top ((\mathsf{F}^*)^\top )^t X) \nonumber \\&= \sum _{t=0}^{\infty } {{\,\mathrm{Tr}\,}}(\sigma ^{1/2}(\mathsf{F}^*)^t v v^\top ((\mathsf{F}^*)^\top )^t \sigma ^{1/2} \sigma ^{-1/2} X \sigma ^{-1/2}) \nonumber \\&\le \sum _{t=0}^{\infty } {{\,\mathrm{Tr}\,}}(\sigma ^{1/2}(\mathsf{F}^*)^t v v^\top ((\mathsf{F}^*)^\top )^t \sigma ^{1/2}) \Vert \sigma ^{-1/2} X \sigma ^{-1/2} \Vert _2 \nonumber \\&= \Vert \sigma ^{-1/2} X \sigma ^{-1/2} \Vert _2 (v^\top \mathcal {T}^*(\sigma ) v) \le \frac{\Vert \mathcal {T}^*(\sigma ) \Vert _2 }{\sigma _{\min }(\sigma )} \le \frac{{{\,\mathrm{Tr}\,}}(\bar{P}^*_l \sigma )}{\sigma _{\min }(\sigma ) \sigma _{\min }(\bar{C}_l)} \end{aligned}$$

(120)

Hence, $\Vert \mathcal {T}^* \Vert _2 \le \frac{{{\,\mathrm{Tr}\,}}(\bar{P}^*_l \sigma )}{\sigma _{\min }(\sigma ) \sigma _{\min }(\bar{C}_l)}$. Using (119)-(120), we get

$$\begin{aligned} \Vert \mathcal {T}^* \Vert _2 \Vert \mathcal {F}^{(s)} - \mathcal {F}^* \Vert _2 \le \frac{{{\,\mathrm{Tr}\,}}(\bar{P}^*_l \sigma )}{\sigma _{\min }(\sigma ) \sigma _{\min }(\bar{C}_l)} (2 \Vert \mathsf{F}^* \Vert _2 + \Vert \mathsf{F}^{(s)} - \mathsf{F}^* \Vert _2) \Vert \mathsf{F}^{(s)} - \mathsf{F}^* \Vert _2 \end{aligned}$$

Since $\epsilon \le \frac{1}{\sqrt{m(L+1)}} \min _{l \in [L]}(1,c^l_{16})$, $\Vert \mathsf{F}^{(s)} - \mathsf{F}^* \Vert _2 \le \min _{l \in [L]}(1,c^l_{16})$, where

$$\begin{aligned} c^l_{16} = \frac{\sigma _{\min }(\sigma ) \sigma _{\min }(\bar{C}_l)}{2{{\,\mathrm{Tr}\,}}(\bar{P}^*_l \sigma ) (2 \Vert \mathsf{F}^* \Vert _2 + 1)}, \end{aligned}$$

(121)

then

$$\begin{aligned} \Vert \mathcal {T}^* \Vert _2 \Vert \mathcal {F}^{(s)} - \mathcal {F}^* \Vert _2 \le 1/2 \end{aligned}$$

This satisfies the conditions for Lemma 20 in Fazel et al. [12], so we obtain,

$$\begin{aligned} \Vert \bar{P}^{(s)}_l - \bar{P}^*_l \Vert _2 = \Vert \mathcal {T}^{(s)}(\bar{C}_l) - \mathcal {T}^*(\bar{C}_l) \Vert _2 \le c^l_{15} \Vert \mathsf{F}^{(s)} - \mathsf{F}^* \Vert _2 \end{aligned}$$

(122)

where

$$\begin{aligned} c^l_{15} = \bigg (\frac{{{\,\mathrm{Tr}\,}}(\bar{P}^*_l \sigma )}{\sigma _{\min }(\sigma ) \sigma _{\min }(\bar{C}_l)} \bigg )^2 (2 \Vert \mathsf{F}^* \Vert _2 + 1) \Vert \bar{C}_l \Vert _2 \end{aligned}$$

(123)

Hence, using (118) and (122),

$$\begin{aligned} {{\,\mathrm{Tr}\,}}(\bar{P}^{(s)}_l \sigma ) \le c^l_{15} \Vert \sigma \Vert _2 \Vert \mathsf{F}^{(s)} - \mathsf{F}^* \Vert _2 + {{\,\mathrm{Tr}\,}}(\bar{P}^*_l \sigma ) \end{aligned}$$

Now we can bound the cost $J_l((K^{(1)}_{l,1},K^{(0)}_{l,2}),\bar{\mathsf{Z}}^{(s)})$,

$$\begin{aligned}&J_l((K^{(1)}_{l,1},K^{(0)}_{l,2}),\bar{\mathsf{Z}}^{(s)}) \le {{\,\mathrm{Tr}\,}}(\bar{P}_l \varSigma ^{(i)}_w) + 2c^l_{15} \Vert \sigma \Vert _2 \Vert \mathsf{F}^{(s)} - \mathsf{F}^* \Vert _2 + 2 {{\,\mathrm{Tr}\,}}(\bar{P}^*_l \sigma ) \nonumber \\& + \sum _{k \in [L]} \Vert \beta ^{lk} \Vert ^2_{{C_Z^{lk}}} =: {\mathbf {\mathsf{{J}}}}^1_l , \end{aligned}$$

(124)

Firstly we can bound $\mu ^l$. Using Corollary 5 from [12],

$$\begin{aligned} \mu ^l \le \frac{ \Vert \varSigma _{\bar{K}_l^{(s)}} \Vert _2}{\sigma ^2_{\min }(\bar{\varSigma }^l) \sigma _{\min }(C^l_U)} \le \frac{J^1_l(\bar{K}^{(s)},\bar{\mathsf{Z}}^{(s)})}{\sigma ^2_{\min }(\bar{\varSigma }^l) \sigma _{\min }(C^l_U)\sigma _{\min }({\bar{\mathsf{Q}}}_l)} \le \frac{ {\mathbf {\mathsf{{J}}}}^1_l \sigma ^{-1}_{\min }({\bar{\mathsf{Q}}}_l) }{\sigma ^2_{\min }(\bar{\varSigma }^l) \sigma _{\min }(C^l_U)}, \end{aligned}$$

where we use Lemma 5.1 from [27] for the second inequality. Now using (114), (124) and Lemma 9 from Malik et al. [22] we define

$$\begin{aligned} c^l_0&= \frac{ \sqrt{\Vert {C_U^l} \Vert _2 + 10 \Vert \mathsf{B}\Vert _2^2 {\mathbf {\mathsf{{J}}}}^1_l} + 10\Vert {B^l} \Vert _2 (\Vert {A^l} \Vert _2 + \bar{F}) {\mathbf {\mathsf{{J}}}}^1_l }{\sigma _{\min }({C_U^l})}, \nonumber \\ c^l_1&= \max \bigg (\frac{10{\mathbf {\mathsf{{J}}}}^1_l}{\sigma _{\min }(\bar{\mathsf{Q}}_l)} \sqrt{\Vert {C_U^l} \Vert _2 + \Vert {B^l} \Vert _2^2 (10{\mathbf {\mathsf{{J}}}}^1_l)^2}, c^l_0 \bigg ), \nonumber \\ c^l_2&= 4 \bigg ( \frac{10 {\mathbf {\mathsf{{J}}}}^1_l}{\sigma _{\min }(\bar{\mathsf{Q}}_l)} \bigg )^2 \Vert \bar{\mathsf{Q}}_l \Vert _2 \Vert {B^l} \Vert _2 (\Vert {A^l} \Vert _2 + \bar{F} + \Vert {B^l} \Vert _2 c^l_1 + 1), \nonumber \\ c^l_3&= 8 \bigg ( \frac{10{\mathbf {\mathsf{{J}}}}^1_l}{\sigma _{\min }(\bar{\mathsf{Q}}_l)} \bigg )^2 (c^l_1)^2 \Vert {C_U^l} \Vert _2 \Vert {B^l} \Vert _2 (\Vert {A^l} \Vert _2 + \bar{F} + \Vert {B^l} \Vert _2 c^l_1 + 1), \nonumber \\ c^l_4&= 2 \bigg ( \frac{10{\mathbf {\mathsf{{J}}}}^1_l}{\sigma _{\min }(\bar{\mathsf{Q}}_l)} \bigg )^2 (c^l_1 + 1) \Vert {C_U^l} \Vert _2, c^l_5 = \sqrt{\Vert {C_U^l} \Vert _2 + \Vert {B^l} \Vert _2^2 (10{\mathbf {\mathsf{{J}}}}^1_l)^2}, \nonumber \\ c^l_6&= \Vert {C_U^l} \Vert _F + \Vert {B^l} \Vert ^2_F (c^l_1 + 1)(c^l_2 + c^l_3 + c^l_4) + 10\Vert {B^l} \Vert ^2_F {\mathbf {\mathsf{{J}}}}^1_l \nonumber \\& + \Vert {B^l} \Vert _F (\Vert {A^l} \Vert _2 + \bar{F})(c^l_2 + c^l_3 + c^l_4), \nonumber \\ c^l_7&= 50 c^l_6 \frac{{\mathbf {\mathsf{{J}}}}^1_l}{\sigma _{\min }(\bar{\mathsf{Q}}_l)} + 4 c^l_5 \bigg (\frac{10 {\mathbf {\mathsf{{J}}}}^1_l}{\sigma _{\min }(\bar{\mathsf{Q}}_l)}\bigg )^2 \Vert {B^l} \Vert _2 (\Vert {A^l} \Vert _2 + \bar{F} + \Vert {B^l} \Vert _2 c^l_1) + c^l_1, \nonumber \\ c^l_8&= \Vert \bar{\varSigma }^l \Vert _2 (c^l_2 + c^l_3 + c^l_4), \nonumber \\ c^l_9&= \min \bigg ( \frac{\sigma _{\min }(\bar{\mathsf{Q}}_l)}{40 {\mathbf {\mathsf{{J}}}}^1_l \Vert {B^l} \Vert _2 (\Vert {A^l} \Vert _2 + \bar{F} + \Vert {B^l} \Vert _2 c^l_1 + 1)},1 \bigg ) \nonumber \end{aligned}$$

The global constants for Lemma 1 can now be defined as

$$\begin{aligned} \mu ^l \le \frac{ {\mathbf {\mathsf{{J}}}}^1_l }{\sigma ^2_{\min }(\bar{\varSigma }^l) \sigma _{\min }(C^l_U) \sigma _{\min }({\bar{\mathsf{Q}}}_l)}, \rho ^l_1 = c^l_9, \varphi ^l_1 = c^l_7, \lambda ^l_1 = c^l_8 . \end{aligned}$$

(125)

Now we move on to defining the constants $\nu ^l, \varphi ^l_2, \rho ^l_2$ and $\lambda ^l_2$ for Lemma 2. First we find the upper bound for $\Vert {\mathbf {\mathsf{{A}}}}_l \Vert _2$ in the definition of cost $J^2_l$ (96). From the definition (97),

$$\begin{aligned} \Vert {\mathbf {\mathsf{{A}}}}_l \Vert _2&\le \big (\Vert (I - \bar{\mathsf{A}}^l + \bar{B}^l K^{(S_1)}_{l,1})^{-1} \Vert _2^2 \Vert {B^l} \Vert _2^2 + 1 \big ) \nonumber \\& \big (\Vert \bar{\mathsf{Q}}_l \Vert _2 + \Vert {C_U^l} \Vert _2 \Vert K^{(S_1)}_{l,1} \Vert _2^2 + \Vert {C_U^l} \Vert _2 \Vert K^{(S_1)}_{l,1} \Vert _2 + \Vert {C_U^l} \Vert _2 \big ) \end{aligned}$$

(126)

First let us observe that

$$\begin{aligned} I - \bar{\mathsf{A}}^l + \bar{B}^l K^{(S_1)}_{l,1}&= \begin{pmatrix} I &{} 0 \\ 0 &{} I \end{pmatrix} - \begin{pmatrix} {A^l} &{} 0 \\ 0 &{} \mathsf{F}^{(S_1)} \end{pmatrix} + \begin{pmatrix} {B^l} \\ 0 \end{pmatrix} \begin{pmatrix} K^{(1,S_1)}_{l,1}&K^{(2,S_1)}_{l,1} \end{pmatrix}, \\&= \begin{pmatrix} I - {A^l} + {B^l} K^{(1,S_1)}_{l,1} &{} {B^l} K^{(2,S_1)}_{l,1} \\ 0 &{} I - \mathsf{F}^{(S_1)} \end{pmatrix} \end{aligned}$$

and thus

$$\begin{aligned}&(I - \bar{\mathsf{A}}^l + \bar{B}^l K^{(S_1)}_{l,1})^{-1} = \\&\begin{pmatrix} (I - {A^l} + {B^l} K^{(1,S_1)}_{l,1})^{-1} &{} - (I - {A^l} + {B^l} K^{(1,S_1)}_{l,1})^{-1} {B^l} K^{(2,S_1)}_{l,1} (I - \mathsf{F}^{(S_1)})^{-1} \\ 0 &{} (I - \mathsf{F}^{(S_1)})^{-1} \end{pmatrix} \end{aligned}$$

As a result,

$$\begin{aligned}&\Vert (I - \bar{\mathsf{A}}^l + \bar{B}^l K^{(S_1)}_{l,1})^{-1} \Vert _2 \le \Vert (I - {A^l} + {B^l} K^{(1,S_1)}_{l,1})^{-1} \Vert _2 + \Vert (I - \mathsf{F}^{(S_1)})^{-1} \Vert _2 \nonumber \\& + \Vert (I - {A^l} + {B^l} K^{(1,S_1)}_{l,1})^{-1} \Vert _2 \Vert {B^l} K^{(2,S_1)}_{l,1} \Vert _2 \Vert (I - \mathsf{F}^{(S_1)})^{-1} \Vert _2 \end{aligned}$$

(127)

So we bound the quantities $ \Vert (I - {A^l} + {B^l} K^{(1,S_1)}_{l,1})^{-1} \Vert _2$ and $\Vert (I - \mathsf{F}^{(S_1)})^{-1} \Vert _2$.

$$\begin{aligned}&\Vert (I - {A^l} + {B^l} K^{(1,S_1)}_{l,1})^{-1} \Vert _2 \nonumber \\&= \Vert (I - {A^l} + {B^l} K^{*,1}_{l,1} + {B^l} (K^{(1,S_1)}_{l,1} - K^{*,1}_{l,1}))^{-1}) \Vert _2 \nonumber \\&= \Vert (I - {A^l} + {B^l} K^{*,1})^{-1} (I + (I - {A^l} + {B^l} K^{*,1})^{-1} {B^l} (K^{(1,S_1)}_{l,1} - K^{*,1}_{l,1}))^{-1} \Vert _2 \nonumber \\&\le (1 -\rho ({A^l} - {B^l} K^{*,1}))^{-1}\Vert (I + (I - {A^l} + {B^l} K^{*,1})^{-1} {B^l} (K^{(1,S_1)}_{l,1} - K^{*,1}_{l,1}))^{-1} \Vert _2 \nonumber \\&\le (1 -\rho ({A^l} - {B^l} K^{*,1}))^{-1} (1 - \Vert (I - {A^l} + {B^l} K^{*,1})^{-1} {B^l} \Vert _2)^{-1} =: c^l_{10} \end{aligned}$$

(128)

where the last inequality is due to (75) and the fact that $\epsilon \le \frac{1}{\sqrt{m(L+1)}} \min _{l \in [L]}\Big (1,\frac{1}{D^2_l}\Big )$ which implies $\Vert \mathsf{F}^{(S_1)} - \mathsf{F}^* \Vert _2 \le 1$ and $\Vert K^{(S_1)}_{l,1} - K^*_{l,1} \Vert _2 \le 1$. Similarly,

$$\begin{aligned} \Vert (I - \mathsf{F}^{(S_1)})^{-1} \Vert _2&= \Vert (I - \mathsf{F}^* +( \mathsf{F}^* - \mathsf{F}^{(S_1)}))^{-1} \Vert _2 \nonumber \\&= \Vert (I - \mathsf{F}^*)^{-1} ( I - (I - \mathsf{F}^*)^{-1}( \mathsf{F}^{(S_1)} - \mathsf{F}^*))^{-1} \Vert _2 \nonumber \\&\le (1 - \rho (\mathsf{F}^*))^{-1} \Vert ( I - (I - \mathsf{F}^*)^{-1}( \mathsf{F}^{(S_1)} - \mathsf{F}^*))^{-1} \Vert _2 \nonumber \\&\le (1 - \rho (\mathsf{F}^*))^{-1} ( 1 - \Vert I - \mathsf{F}^*\Vert _2)^{-1} =: c^l_{11} \end{aligned}$$

(129)

Similarly the following terms can be upper bounded,

$$\begin{aligned} \Vert {B^l} K^{(2,S_1)}_{l,1} \Vert _2&\le \Vert {B^l} K^{*,2}_{l,1} \Vert _2 + \Vert {B^l} \Vert _2 =: c^l_{12}, \end{aligned}$$

(130)

$$\begin{aligned} \Vert K^{(S_1)}_{l,1} \Vert _2&\le \Vert K^*_{l,1} \Vert _2 + 1 =: c^l_{13}. \end{aligned}$$

(131)

Using (126)-(131), we can bound

$$\begin{aligned} \Vert {\mathbf {\mathsf{{A}}}}_l \Vert _2 \le \bar{{\mathbf {\mathsf{{A}}}}}_l \end{aligned}$$

(132)

where

$$\begin{aligned} \bar{{\mathbf {\mathsf{{A}}}}}_l := (c^l_{14} \Vert {B^l} \Vert _2^2 + 1)(\Vert \bar{\mathsf{Q}}_l \Vert _2 + \Vert {C_U^l} \Vert _2(1 + c^l_{13} + (c^l_{13})^2)), \end{aligned}$$

(133)

and thus

$$\begin{aligned} c^l_{14} = c^l_{10} + c^l_{10} c^l_{11} c^l_{12} + c^l_{12} \ge \Vert (I - \bar{\mathsf{A}}^l + \bar{B}^l K^{(S_1)}_{l,1})^{-1} \Vert _2. \end{aligned}$$

(134)

Now we move on to the bound on $J_l((K^{(S_1)}_{l,1},K^{(1)}_{l,2}),\bar{\mathsf{Z}}^{(s)})$. From the definition of cost $J_l$,

$$\begin{aligned}&J_l((K^{(S_1)}_{l,1},K^{(1)}_{l,2}),\bar{\mathsf{Z}}^{(s+S_1)}) = J_l^1(K^{(S_1)}_{l,1},\bar{\mathsf{Z}}^{(s+S_1)}) \\& + J_l^2((K^{(S_1)}_{l,1},K^{(1)}_{l,2}),\bar{\mathsf{Z}}^{(s+S_1)}) + (\bar{\upbeta }^l)^\top \bar{\mathsf{Q}}_l \bar{\upbeta }^l \\&\le {\mathbf {\mathsf{{J}}}}^1_l + J_l^2((K^{(S_1)}_{l,1},K^{(1)}_{l,2}),\bar{\mathsf{Z}}^{(s+S_1)}) + (\bar{\upbeta }^l)^\top \bar{\mathsf{Q}}_l \bar{\upbeta }^l \end{aligned}$$

Hence, we need to bound $J_l^2((K^{(S_1)}_{l,1},K^{(1)}_{l,2}),\bar{\mathsf{Z}}^{(s+S_1)})$. Recall the definition (96)

$$\begin{aligned} J_l^2 (K_{l,2}) = K_{l,2}^\top {\mathbf {\mathsf{{A}}}}_l K_{l,2} + {\mathbf {\mathsf{{c}}}}^\top _l K_{l,2} + {\mathbf {\mathsf{{d}}}}_l \end{aligned}$$

where ${\mathbf {\mathsf{{A}}}}_l$, ${\mathbf {\mathsf{{c}}}}_l$ and ${\mathbf {\mathsf{{d}}}}_l$ are defined in (97). As shown in (132) $\Vert {\mathbf {\mathsf{{A}}}}_l \Vert _2 \le \bar{{\mathbf {\mathsf{{A}}}}}_l$. Using definition of ${\mathbf {\mathsf{{c}}}}_l$ as defined in (97),

$$\begin{aligned} \Vert {\mathbf {\mathsf{{c}}}}_l \Vert _2 \le \,&2\Vert (I - \bar{\mathsf{A}}^l + \bar{B}K^{(S_1)}_{l,1})^{-1} \Vert ^2_2 (\Vert \bar{\mathsf{Q}}_l \Vert _2 + \Vert {C_U^l} \Vert _2 \Vert K^{(S_1)}_{l,1} \Vert ^2_2) \Vert \bar{B}^l \Vert _2 \Vert \bar{\mathsf{C}}^{(s)} \Vert _2 \nonumber \\&+ 2\Vert (I - \bar{\mathsf{A}}^l + \bar{B}K^{(S_1)}_{l,1})^{-1} \Vert _2 (\Vert {C_U^l} \Vert _2 \Vert K^{(S_1)}_{l,1} \Vert _2 + 2 \Vert (\bar{\upbeta }^l)^\top \bar{\mathsf{Q}}_l \Vert _2 \Vert \bar{B}^l \Vert _2) \nonumber \\ \le&\, 2 (c^l_{14})^2 (\Vert \bar{\mathsf{Q}}_l \Vert _2 + \Vert {C_U^l} \Vert _2 (2\Vert K^*_{l,1} \Vert ^2_2 + 2)) \Vert \bar{B}^l \Vert _2 \bar{C} \nonumber \\& + c^l_{14} (\Vert {C_U^l} \Vert _2 (\Vert K^*_{l,1} \Vert _2 + 1)+ 2 \Vert (\bar{\upbeta }^l)^\top \bar{\mathsf{Q}}_l \Vert _2 \Vert \bar{B}^l \Vert _2) = : \bar{{\mathbf {\mathsf{{c}}}}}_l \end{aligned}$$

(135)

The last inequality is obtained using (134), (90) and the fact that $\Vert K^*_{l,1} - K^{(S_1)}_{l,1} \Vert _2 \le 1$. Similarly using the definition of ${\mathbf {\mathsf{{d}}}}_l$ we obtain

$$\begin{aligned} \Vert {\mathbf {\mathsf{{d}}}}_l \Vert _2 \le (c^l_{14})^2 (\Vert \bar{\mathsf{Q}}_l \Vert _2 + \Vert {C_U^l} \Vert _2 (2\Vert K^*_{l,1} \Vert ^2_2 + 2) ) \bar{C}^2 + 2 c^l_{14} \Vert \bar{\upbeta }^l \bar{\mathsf{Q}}_l \Vert _2 \bar{C}^2 =: \bar{{\mathbf {\mathsf{{d}}}}}_l \end{aligned}$$

(136)

Now we can bound $J_l^2((K^{(S_1)}_{l,1},K^{(1)}_{l,2}),\bar{\mathsf{Z}}^{(s+S_1)})$ as follows:

$$\begin{aligned} J_l^2((K^{(S_1)}_{l,1},K^{(1)}_{l,2}),\bar{\mathsf{Z}}^{(s+S_1)}) \le \bar{{\mathbf {\mathsf{{A}}}}}_l \Vert K^{(1)}_{l,2} \Vert ^2_2 + \bar{{\mathbf {\mathsf{{c}}}}}_l \Vert K^{(1)}_{l,2} \Vert _2 + \bar{{\mathbf {\mathsf{{d}}}}}_l =: {\mathbf {\mathsf{{J}}}}^2_l \end{aligned}$$

(137)

Hence,

$$\begin{aligned} \nu ^l = \bar{{\mathbf {\mathsf{{A}}}}}_l, \varphi ^l_2 = \bar{{\mathbf {\mathsf{{A}}}}}_l, \rho ^l_2 = \sqrt{\frac{4{\mathbf {\mathsf{{J}}}}^2_l}{\Vert {\mathbf {\mathsf{{A}}}}_l \Vert _2}}, \lambda ^l_2 = \sqrt{80 \varphi ^l_2 {\mathbf {\mathsf{{J}}}}^2_l} \end{aligned}$$

(138)

concluding the proof. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

uz Zaman, M.A., Miehling, E. & Başar, T. Reinforcement Learning for Non-stationary Discrete-Time Linear–Quadratic Mean-Field Games in Multiple Populations. Dyn Games Appl 13, 118–164 (2023). https://doi.org/10.1007/s13235-022-00448-w

Download citation

Accepted: 06 April 2022
Published: 10 May 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s13235-022-00448-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reinforcement Learning for Non-stationary Discrete-Time Linear–Quadratic Mean-Field Games in Multiple Populations

Abstract

Access this article

Similar content being viewed by others

Unified reinforcement Q-learning for mean field game and control problems

Mean Field Games

Mean Field Games

Data Availability Statement:

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix

Proof of Proposition 1

Proof

Proof of Theorem 2

Proof

Proof of Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Proof

Lemma 5

Lemma 6

Proof

Proof of Theorem 3

Proof

Lemma 7

Proof

Lemma 8

Lemma 9

Proof

Lemma 10

Proof

Proof of Lemma 3

Proof

Proof of Lemma 7

Proof

Proof of Lemma 9

Proof

Proof of Lemma 10

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation