On the Parallelization Upper Bound for Asynchronous Stochastic Gradients Descent in Non-convex Optimization

Wang, Lifu; Shen, Bo

doi:10.1007/s10957-022-02141-9

On the Parallelization Upper Bound for Asynchronous Stochastic Gradients Descent in Non-convex Optimization

Published: 04 December 2022

Volume 196, pages 900–935, (2023)
Cite this article

Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Lifu Wang^1,2 &
Bo Shen^1,2

220 Accesses
1 Altmetric
Explore all metrics

Abstract

In deep learning, asynchronous parallel stochastic gradient descent (APSGD) is a broadly used algorithm to speed up the training process. In asynchronous system, the time delay of stale gradients in asynchronous algorithms is generally proportional to the total number of workers. When the number of workers is too large, the time delay will bring additional deviation from the accurate gradient due to delayed gradients and may cause a negative influence on the convergence speed of the algorithm. One may ask: How many workers can be used at most to achieve both the convergence and the speedup? In this paper, we consider the asynchronous training problem with the non-convex case. We theoretically study this problem to find an approximating second-order stationary point using asynchronous algorithms in non-convex optimization and investigate the behaviors of APSGD near-saddle points. This work gives the first theoretical guarantee to find an approximating second-order stationary point in asynchronous algorithms and a provable upper bound for the time delay. The techniques we provide can be generalized to analyze other types of asynchronous algorithms to understand the behaviors of asynchronous algorithms in distributed asynchronous parallel training.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accelerating neural network training with distributed asynchronous and selective optimization (DASO)

Article Open access 04 February 2022

Scaling up stochastic gradient descent for non-convex optimisation

Article Open access 07 October 2022

The Impact of Synchronization in Parallel Stochastic Gradient Descent

Notes

http://www.mpich.org/.

References

Agarwal, A., Duchi, J.C.: Distributed delayed stochastic optimization. In: Taylor, J., Zemel, R., Bartlett, P., Pereira, F.C.N., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, Granada, vol. 24, pp. 873–881 (2011)
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang, K., Le, Q.V., Ng, A.Y.: Large scale distributed deep networks. In: Bartlett, P.L., Pereira, F.C.N., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, Lake Tahoe, vol. 25, pp. 1232–1240 (2012)
Dekel, O., Gilad-Bachrach, R., Shamir, O., Xiao, L.: Optimal distributed online prediction using mini-Batches. J. Mach. Learn. Res. 13(6), 165–202 (2012)
MathSciNet MATH Google Scholar
Du, S., Lee, J.: On the power of over-parametrization in neural networks with Quadratic Activation. In: Dy, J.G., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, Stockholm, pp. 1329–1338 (2018)
Fang, C., Lin, Z., Zhang, T.: Sharp analysis for nonconvex SGD escaping from saddle points. arXiv preprint arXiv:1902.00247 (2019)
Ge, R., Huang, F., Jin, C., Yuan, Y.: Escaping from saddle points-online stochastic gradient for tensor decomposition. In : Grünwald, P., Hazan, E., Kale, S. (eds.) Proceedings of The 28th Conference on Learning Theory, Paris, pp. 797–842 (2015)
Ge, R., Lee, J. D., Ma, T.: Matrix completion has no spurious local minimum. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, Barcelona, vol. 29, pp. 2973–2981 (2016)
Golden, S.: Lower bounds for the Helmholtz function. Phys. Rev. 137(4B), B1127–B1128 (1965)
Article MathSciNet MATH Google Scholar
Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., Jordan, M. I.: How to escape saddle points efficiently. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning, Sydney, pp. 1724–1732 (2017)
Jin, C., Netrapalli, P., Ge, R., Kakade, S.M., Jordan, M.I.: A short note on concentration inequalities for random vectors with subGaussian Norm. arXiv:1902.03736 (2019)
Jin, C., Netrapalli, P., Ge, R., Kakade, S.M., Jordan, M.I.: Stochastic gradient descent escapes saddle points efficiently. arXiv:1902.04811v1 (2019)
Lieb, E.H.: Convex trace functions and the Wigner–Yanase–Dyson conjecture. Adv. Math. 11(3), 267–288 (1973)
Article MathSciNet MATH Google Scholar
Lian, X., Huang, Y., Li, Y., Ji, L.: Asynchronous parallel stochastic gradient for nonconvex optimization. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, Montreal, vol. 28, pp. 2737–2745 (2015)
Liu, J., Wright, S.J., Ré, C., Bittorf, V., Sridhar, S.: An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res. 16(10), 285–322 (2015)
MathSciNet MATH Google Scholar
Mao, X.: Razumikhin-type theorems on exponential stability of neutral stochastic differential equations. Chin. Sci. Bull. 44(24), 2225–2228 (1999)
Article MathSciNet MATH Google Scholar
Mao, X.: Razumikhin-type theorems on exponential stability of stochastic functional differential equations. Stoch. Process. Appl. 65(2), 233–250 (1996)
Article MathSciNet MATH Google Scholar
Recht, B., Re, C., Wright, S.J., Niu, F.: Hogwild: A Lock-free approach to parallelizing stochastic gradient descent. In: Taylor, J., Zemel, R., Bartlett, P., Pereira, F.C.N., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, Granada, vol. 24, pp. 693–701 (2011)
Sutter, D., Berta, M., Tomamichel, M.: Multivariate trace inequalities. Commun. Math. Phys. 352(1), 1–22 (2017)
Article MathSciNet MATH Google Scholar
Yun, H., Yu, H., Hsieh, C., Vishwanathan, S.V.N., Dhillon, I.S.: NOMAD: non-locking, stochastic multi-machine algorithm for asynchronous and decentralized matrix completion. Very Large Data Bases 7(11), 975–986 (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electronic and Information Engineering, Beijing Jiaotong University, Beijing, China
Lifu Wang & Bo Shen
Key Laboratory of Communication and Information Systems, Beijing Municipal Commission of Education Beijing Jiaotong University, Beijing, China
Lifu Wang & Bo Shen

Authors

Lifu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Bo Shen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bo Shen.

Additional information

Communicated by Gabriel Peyré.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix: Detailed Proof

Some Useful Lemmas

In this section, some lemmas and definitions used in this paper are listed below.

Definition A.1

We define the zero-mean nSG($\sigma _i$) sequence as the sequence of random vectors $X_1,X_2...X_n\in {\mathbb {R}}^d$ with filtrations $F_i=\sigma (X_1,X_2...X_i)$ such that

$$\begin{aligned} {\mathbb {E}}[X_i|F_{i-1}]=0,\ {\mathbb {E}} [e^{s\Vert X_i\Vert }|F_{i-1}]\le e^{4s^2\sigma _i^2},\quad \sigma _i\in F_{i-1}. \end{aligned}$$

For a zero-mean nSG($\sigma _i$) sequence, we have some important lemmas from Jin et al. [10]. As in [10], for a zero-mean nSG $X_i$, let ${\varvec{Y}}_i=\begin{bmatrix} 0 &{} X^T_i \\ X_i &{} 0 \end{bmatrix},$ and $c=4$, then we have:

Lemma A.1

(Lemma 6 in [10]) Supposing ${\mathbb {E}}tr\{e^{\sum _i\theta \varvec{Y}_i}\}\le e^{\sum \theta ^2\sigma ^2_i}(d+1)$ for some $\theta \in {\mathbb {R}}^+$, with probability at least $1-2(d+1)e^{-\iota }$:

$$\begin{aligned} \left\| \sum _i X_i\right\| \le c\theta \sum _i^n\sigma ^2_i+\frac{1}{\theta }\iota . \end{aligned}$$

Lemma A.2

(Sub-Gaussian Hoeffding inequality, Lemma 6 in [10]) With probability at least $1-2(d+1)e^{-\iota }$:

$$\begin{aligned} \left\| \sum _i^nX_i\right\| \le c\sqrt{\sum _i^n\sigma _i^2\cdot \iota }. \end{aligned}$$

The proof is based on Chernoff bound arguments and the fact $\lambda (\varvec{Y}_i)=0, \Vert X_i\Vert $ or $-\Vert X_i\Vert $.

Using the same way, it is easy to prove the square sum theorem:

Lemma A.3

(Lemma 29 in [11]) For a zero-mean nSG($\sigma _i$) sequence $X_i$ with $\sigma _i=\sigma $, with probability at least $1-e^{-\iota }$:

$$\begin{aligned} \sum _i\Vert X_i\Vert ^2\le c\sigma ^2(n+\iota ). \end{aligned}$$

Lemma A.4

Let ${\varvec{X}}$ be a $\sigma ^2$-sub-Gaussian random vector, then we have,

$$\begin{aligned} {\mathbb {E}} e^{\theta ^2\Vert {\varvec{X}}\Vert ^2}\le e^{65c\sigma ^2\theta ^2}, \end{aligned}$$

(18)

if $\theta ^2 \le \frac{1}{16c\sigma ^2}$.

Note that $\Vert {\varvec{X}}\Vert ^2$ is a $16c \sigma ^2$ sub-exponential random variable. This lemma can be proved by a direct calculation:

$$\begin{aligned} {\mathbb {E}} e^{\theta ^2\Vert {\varvec{X}}\Vert ^2}\le e^{c\theta ^2\sigma ^2}(1+128c^2\theta ^4\sigma ^4)\le e^{65c\theta ^2\sigma ^2}. \end{aligned}$$

The following theorem is very useful in the proof using Chernoff bound:

Lemma A.5

Let ${\varvec{Y}}_i$ the random matrix such that ${\mathbb {E}}\{{\varvec{Y}}_i\}=0$ and ${\mathbb {E}}tr \{ e^{\theta {\varvec{Y}}_i}\} \le e^{c\theta ^2 \sigma _i^2}(d+1)$, then we have ${\mathbb {E}}tr\{ e^{\theta \sum _i {\varvec{Y}}_i}\} < e^{c\theta ^2 (\sum _i\sigma _i)^2} (d+1)$.

The proof is given in Sect. I.

In the rest of appendix, the parameters we used are listed below:

$$\begin{aligned}{} & {} \eta =\frac{\epsilon ^2}{w\sigma ^2L}\ ,\quad \ r=s \ ,\quad \ f=(T+1)M\eta \sqrt{\rho \epsilon }/2 , \nonumber \\{} & {} T_{\max }=T+\frac{u e^{f}}{M\eta \sqrt{\rho \epsilon }/2} \ ,\quad \ B\le {\tilde{{\mathcal {O}}}}(1),\nonumber \\{} & {} F=60c\sigma ^2 \eta LT \le T\epsilon ^2 , \quad F_2=T_{\max }\eta L\sigma ^2\ ,\nonumber \\{} & {} S=B\sqrt{L\eta MT_{\max }}\eta \sqrt{M}\sqrt{T_{\max }}\sigma \ , \quad w\le {\tilde{{\mathcal {O}}}}(1),\nonumber \\{} & {} b=\log (2(d+1)) +\log 2\ , \quad \ \frac{2\sqrt{48c}+2b }{C}=\frac{1}{2}\ , \quad \nonumber \\{} & {} p=\frac{1}{1+C} \ , \quad c_2=\log 96 +\log (d+1). \end{aligned}$$

(19)

And we have the following claim which will be proven later:

Lemma A.6

$\eta \sim \epsilon ^2, T_{\max }\sim \frac{\epsilon ^{-2.5}}{M}, F\sim T\epsilon ^2, F_2\sim T_{\max }\epsilon ^{2}, S\sim M\epsilon ^{0.5},w \le {\widetilde{O}}(1), u \le {\widetilde{{\mathcal {O}}}}(1) $ such that the following conditions are satisfied.

(a)
$\eta \le \frac{1}{3ML(T+1)}$ and $2\eta ^2M^2L^2T^3\le 1/5$;
(b)
$\sqrt{3\times 65}(M T_{\max } \eta \rho S+ \sqrt{T_{\max }M} \eta \ell ) \le p$;
(c)
$\frac{S^2-3\eta ^2M\sigma ^2 T_{\max }c^2c_2}{3\eta ^2M^2T_{\max }}-2L^2\eta ^2T^3M^2 F- c2T_{\max }2L^2M\eta ^2T\sigma ^2 \ge 2T F_2$;
(d)
Let $q=M\eta \sqrt{\rho \epsilon }/2e^{-(T+1)M\eta \sqrt{\rho \epsilon }/2}$. We have $\frac{2^{u }\sqrt{M}\eta r}{6\sqrt{3}\sqrt{2q d}}\ge 2S$;
(e)
$e^{-T_{\max }+\log T+\log T_{\max }}\le 1/48$.

Detailed Proof of Theorem 1

Theorem 1 is a direct corollary of the following lemma.

Lemma B.7

For a large enough $\iota $, let

$$\begin{aligned} K=\max \left\{ 100\iota T\frac{f(x_0)-f(x_*)}{M\eta F},100\iota T_{\max }\frac{f(x_0)-f(x_*)}{M\eta F_2}\right\} . \end{aligned}$$

With probability at least $1-3e^{-\iota }$, we have:

(1)
There are at most $\lceil K/8T \rceil $ intervals of the first kind .
(2)
There are at most $\lceil K/8T \rceil $ intervals of the second kind.

So at least $\lfloor K/4T \rfloor $ intervals are of the third kind.

(1) is from the following theorem, which is just a variant of Theorem 1 in [13].

Theorem B.1

Supposing $\eta ^2\left( \frac{3 L}{4}-L^2M T^2\eta \right) -\frac{\eta }{2M}<0$, with probability at least $1-3e^{-\iota }$, we have:

$$\begin{aligned} f(x_{t_0+\tau +1})-f(x_{t_0})\le & {} \sum _{k=t_0}^{t_0+\tau } -\frac{3M\eta }{8}\Vert \nabla f(x_k)\Vert ^2 + c\eta \sigma ^2\iota +2\eta ^2LM^2 c\sigma ^2(\tau +1+\iota )\nonumber \\{} & {} +\,L^2T^2M\eta ^3 \sum _{k=t_0-T}^{t_0-1} \left\| \sum _{m=1}^M\nabla f(x_{j-\tau _{j,m}})\right\| ^2. \end{aligned}$$

(20)

Then

$$\begin{aligned} \begin{aligned} f(x_{\tau +1})-f(x_{0})&\le \sum _{k=0}^{\tau } -\frac{3M\eta }{8}\Vert \nabla f(x_k)\Vert ^2 + c\eta \sigma ^2\iota +2\eta ^2LM^2 c\sigma ^2(\tau +1+\iota ). \end{aligned} \end{aligned}$$

(21)

For 2), consider a point x with $\Vert \nabla f(x)\Vert \approx 0, \lambda _{\min }(\nabla f(x))\le -\gamma $ for some $\gamma >0$. We have:

Theorem B.2

Supposing $\eta L MT\le 1/3$, given a point $x_k$, let $\varvec{H}=\nabla ^2 f(x_k)$, and $e_1$ be the minimum eigendirection of $\varvec{H}$, $\gamma = -\lambda _{\min }(\varvec{H})\ge \sqrt{\rho \epsilon }/2$ and $\sum _{t=k-2T}^{k-1}\Vert \nabla f(x_t)\Vert ^2\le F$. We have, with probability at least 1/24,

$$\begin{aligned} \sum _{t=k}^{k+T_{\max }-1}\Vert \nabla f(x(t))\Vert ^2\ge F_2=T_{\max }\eta L\sigma ^2. \end{aligned}$$

Now, let $z_i$ be the stopping time such that

$$\begin{aligned}{} & {} z_1=\inf \{j| S_j \text { is of the second kind}\},\nonumber \\{} & {} z_i=\inf \{j| T_{\max }/2T\le j-z_{i-1} \text { and } S_j \text { is of the second kind}\}. \end{aligned}$$

(22)

Let $N = max\{i|2T \cdot z_i+T_{\max } \le K\}$. Note that for $X_i=\sum _{k=Z_{z_i}}^{Z_{z_i}+T_{\max }-1} \Vert \nabla f(x_k)\Vert ^2$, ${\mathbb {E}} X_i\ge \frac{1}{24}F_2$ by Theorem B.2. $\sum _i^N X_i$ is a submartingale. Using Azuma’s inequality, 2) follows.

Remark B.1

When we set $t_0=0$, $\tau =K$, this theorem shows when $\eta ^2\left( \frac{3 L}{4}-L^2M T^2\eta \right) -\frac{\eta }{2M}<0$, we have $f(x_K)-f(x_0)\le \sum _{k=0}^{K} -\frac{3M\eta }{8}\Vert \nabla f(x_k)\Vert ^2+ \text {some constants}$. Thus as in the synchronous case, we only need to show $\sum _k\Vert \nabla f(x_k)\Vert ^2$ is large. However, when $t_0>0$, the “memory effect” term $L^2T^2M\eta ^3 \sum _{k=t_0-T}^{t_0-1} \left\| \sum _{m=1}^M\nabla f(x_{j-\tau _{j,m}})\right\| ^2$ is important, and there is no guarantee that $f(x_k)$ will keep decreasing as k increasing. This observation is crucial in the analysis of the behaviors near-saddle points.

Proof of Theorem B.2

Theorem B.2 is the key theorem for the $\epsilon -$second-order convergence. In this section, we study the behaviors of APSGD near a saddle point and prove Theorem B.2. The main idea is to study the exponential instability of APSGD near a saddle point and use an inequality to give a lower bound of $\sum _{t=k}^{k+T_{\max }}\Vert \nabla f(x_t)\Vert ^2$ to prove Theorem B.2.

1.1 Descent Inequality

The behaviors of asynchronous gradient descent are quite different from the synchronous case in [6, 9], which can be described by the following inequality:

Lemma C.8

Supposing $\eta ^2\left( \frac{3 L}{4}-L^2M T^2\eta \right) -\frac{\eta }{2M}<0$ ,

$$\begin{aligned}{} & {} \sum _{k=t_0}^{t-1+t_0}(1+2L^2\eta ^2M^2T^3) \Vert \nabla f(x_{k})\Vert ^2 \ge \frac{\Vert x_{t_0+t}-x_{t_0}\Vert ^2-3\eta ^3\left\| \sum _m \sum _{i=t_0}^{t_0+t-1}\zeta _{i,m} \right\| ^2}{3\eta ^2M^2t}\nonumber \\{} & {} \quad -\,\sum _{k=t_0-2T}^{t_0-1}2L^2M^2\eta ^2T^3\Vert \nabla f(x_{k})\Vert ^2-\,\sum _{k=t_0}^{t-1+t_0} 2L^2\eta ^2\left\| \sum _{j=k-\tau ^{\max }_{k}}^{k-1}\sum _{m=1}^M\zeta _{j,m}\right\| ^2, \end{aligned}$$

(23)

where $0\le \tau ^{\max }\le T$ is a random variable.

Proof of Lemma C.8

$$\begin{aligned}&\Vert x_{t_0+t}-x_{t_0} \Vert ^2-3\eta ^2\left\| \sum _m \sum _{i=t_0}^{t-1}\zeta _{i,m}\right\| ^2\nonumber \\&\quad =\eta ^2\left\| \sum _{k=t_0}^{t-1+t_0}\sum _{m=1}^M\nabla f(x_{k-\tau _{k,m}})+\sum _m \sum _{i=t_0}^{t-1}\zeta _{i,m}\right\| ^2 -3\eta ^2\left\| \sum _m \sum _{i=t_0}^{t-1}\zeta _{i,m}\right\| ^2\nonumber \\&\quad \le 3\eta ^2\sum _{k=t_0}^{t-1+t_0} t \left[ \left\| M\nabla f(x_k)\right\| ^2+\underbrace{\left\| M\nabla f(x_k)- \sum _{m=1}^M \nabla f(x_{k-\tau _{k,m}})\right\| ^2}_{T_1}\right] \nonumber \\&\quad \overset{(a)}{\le }\ 3\eta ^2 \sum _{k=t_0}^{t-1+t_0} M^2t\Vert \nabla f(x_{k})\Vert ^2 +3\eta ^2t\sum _{k=t_0}^{t-1+t_0}M^22L^2\eta ^2 \left[ \left\| \sum _{j=k-\tau ^{\max }_{k}}^{k-1}\sum _{m=1}^M\zeta _{j,m}\right\| ^2\right. \nonumber \\&\left. \qquad +\,T\left\| \sum _{m=1}^M\nabla f(x_{j-\tau _{j,m}})\right\| ^2\right] \nonumber \\&\quad \le 3\eta ^2 \sum _{k=t_0}^{t-1+t_0} M^2t\Vert \nabla f(x_{k})\Vert ^2 +3\eta ^2t\sum _{k=t_0}^{t-1+t_0}\sum _{j=k-T}^{k-1} M^22L^2\eta ^2T \left\| \sum _{m=1}^M\nabla f(x_{j-\tau _{j,m}})\right\| ^2\nonumber \\&\qquad +3\eta ^2t\sum _{k=t_0}^{t-1+t_0}\sum _{j=k-T}^{k-1} M^22L^2\eta ^2\left\| \sum _{j=k-\tau ^{\max }_{k}}^{k-1}\sum _{m=1}^M \zeta _{j,m}\right\| ^2\nonumber \\&\quad \le 3\eta ^2 \sum _{k=t_0}^{t-1+t_0} M^2t\Vert \nabla f(x_{k})\Vert ^2 +3\eta ^2t\sum _{k=t_0-T}^{t-1+t_0} M^22L^2\eta ^2T^2 \left\| \sum _{m=1}^M\nabla f(x_{k-\tau _{k,m}})\right\| ^2\nonumber \\&\qquad +3\eta ^2t\sum _{k=t_0}^{t-1+t_0} M^22L^2\eta ^2\left\| \sum _{j=k-\tau ^{\max }_{k}}^{k-1}\sum _{m=1}^M\zeta _{j,m}\right\| ^2\nonumber \\&\quad \overset{(b)}{\le }\ 3\eta ^2 \sum _{k=t_0}^{t-1+t_0} M^2t\Vert \nabla f(x_{k})\Vert ^2 +3\eta ^2t\sum _{k=t_0-2T}^{t-1+t_0} M^22L^2\eta ^2T^3 M^2\Vert \nabla f(x_{k})\Vert ^2\nonumber \\&\qquad +3\eta ^2t\sum _{k=t_0}^{t-1+t_0} M^22L^2\eta ^2\left\| \sum _{j=k-\tau ^{\max }_{k}}^{k-1}\sum _{m=1}^M\zeta _{j,m}\right\| ^2. \end{aligned}$$

(24)

In (a), we use the estimation for $T_1$ in [13] (see also section H), which says: $T_1\le 2L^2(\Vert \sum _{j=k-\tau ^{max}_{k}}^{k-1}\eta \sum _{m=1}^M \zeta _{j,m}\Vert ^2 + T\sum _{j=k-\tau ^{max}_{k}}^{k-1} \eta ^2\Vert \sum _{m=1}^M\nabla f(x_{j-\tau _{j,m}})\Vert ^2)],$ and (b) is from $\tau _{k,m}\le T$ such that

$$\begin{aligned} \sum _{k=t_0-T}^{t-1+t_0}\left\| \sum _{m=1}^M\nabla f(x_{k-\tau _{k,m}})\right\| ^2\le \sum _{k=t_0-2T}^{t-1+t_0}TM^2\Vert \nabla f(x_{k})\Vert ^2. \end{aligned}$$

$\square $

If $\sum _{k=t_0-2T}^{t_0-1}\Vert \nabla f(x_{k})\Vert ^2$ is very large, in the worst case, even if $\max _{t\le T_{\max }}\Vert x_{k+t}-x_k\Vert $ is large enough, $\sum _{k=t_0}^{t_0+2T}\Vert \nabla f(x_{k})\Vert ^2$ can still be very small. This is because there is no guarantee that the asynchronous gradient descent can decrease the function value, so the algorithm may finally return to a point near the saddle point. However, if $\Vert \nabla f(x_{k})\Vert ^2$ keeps small for a long enough time ($>2T$), we have $\sum _{k=t_0-2T}^{t_0-1}\Vert \nabla f(x_{k})\Vert ^2\le F$, then $\max _{t\le T_{\max }}\Vert x_{k+t}-x_k\Vert $ is large $\Rightarrow \sum _{k=t_0}^{t_0+T_{\max }}\Vert \nabla f(x_{k})\Vert ^2\ge F_2$ by Lemma C.8. Thus there is a direct corollary:

Lemma C.9

There is a parameter $S\sim \sqrt{L\eta MT_{\max }}\cdot \eta \sqrt{M}\sqrt{T_{\max }}\sigma $ such that, supposing $\eta L MT\le 1/3$, if $\sum _{t=k-2T}^{k-1}\Vert \nabla f(x_t)\Vert ^2\le F $, we have,

$$\begin{aligned} \begin{aligned} {\mathbb {P}}\left( \sum _{t=k}^{k+T_{\max }-1}\Vert \nabla f(x_t)\Vert ^2 \ge F_2,\ { or \ }\forall t\le T_{\max }, \Vert x_{k+t}-x_k\Vert ^2\le S^2\right) \ge 1- 1/24. \end{aligned}\nonumber \\ \end{aligned}$$

(25)

Proof of Lemma C.9

Suppose there is $\tau \le T_{\max }$, such that $\Vert x(k+\tau )-x(k)\Vert ^2\ge S^2$. With probability at least $1-TT_{\max }e^{-T_{\max }}-\frac{1}{48}$, we have,

$$\begin{aligned}&\sum _{k=t_0}^{T_{\max }-1+t_0}(1+2L^2\eta ^2M^2T^3) \Vert \nabla f(x_{k})\Vert ^2\nonumber \\&\quad \ge \sum _{k=t_0}^{\tau -1+t_0}(1+2L^2\eta ^2M^2T^3) \Vert \nabla f(x_{k})\Vert ^2 \nonumber \\&\quad \ge \frac{\Vert x_{t_0+\tau -1}-x_{t_0}\Vert ^2-3\eta ^2\left\| \sum _m \sum _{i=t_0}^{t_0+\tau -1}\zeta _{i,m}\right\| ^2}{3\eta ^2M^2T_{\max }}\nonumber \\&\qquad -\sum _{k=t_0-2T}^{t_0-1}2L^2\eta ^2T^3\left\| \sum _{m=1}^M\nabla f(x_{k})\right\| ^2-\sum _{k=t_0}^{T_{\max }-1+t_0} 2L^2\eta ^2\left\| \sum _{j=k-\tau _{k,\mu }}^{k-1}\sum _{m=1}^M\zeta _{j,m}\right\| ^2\nonumber \\&\quad \overset{(a)}{\ge }\ \frac{S^2-3\eta ^2M\sigma ^2 T_{\max }c^2c_2}{3\eta ^2M^2T_{\max }}-2L^2\eta ^2T^3M^2 F- c2T_{\max }2L^2M\eta ^2T\sigma ^2\nonumber \\&\quad \ge 2T F_2. \end{aligned}$$

(26)

Since $2L^2\eta ^2M^2T^3= 2L^2\eta ^2M^2T^2\cdot T \le T$, we have,

$$\begin{aligned} \sum _{k=t_0}^{T_{\max }-1+t_0} \Vert \nabla f(x_{k})\Vert ^2 \ge F_2. \end{aligned}$$

(27)

In (a) we use Lemma A.3, and with probability at least $1-TT_{\max }e^{-T_{\max }}$,

$$\begin{aligned} \sum _{k=t_0}^{T_{\max }-1+t_0} 2L^2\eta ^2\left\| \sum _{j=k-\tau ^{\max }_{k}}^{k-1}\sum _{m=1}^M\zeta _{j,m}\right\| ^2\le c\sigma ^22T_{\max }2L^2M\eta ^2T\le c\sigma ^2. \end{aligned}$$

With probability at least $1-1/48$, we have,

$$\begin{aligned} \left\| \sum _m \sum _{i=t_0}^{t_0+\tau -1}\zeta _{i,m}\right\| ^2\le c^2M\sigma ^2T_{\max }c_2. \end{aligned}$$

This is due to Lemma A.2 and $2(d+1)e^{-c_2}=\frac{1}{48}$, $c_2=\log 96 +\log (d+1)$ and $e^{-T_{\max }+\log T+\log T_{\max }}\le 1/48$. Our claim follows. $\square $

Using this lemma, in order to prove Theorem B.2, we can turn to show $\Vert x(t)-x(0)\Vert >S$.

1.1.1 Exponential Instability of Asynchronous Gradient Dynamics

We will show $\Vert x(t)-x(0)\Vert >S$ by analyzing the exponential instability of asynchronous gradient dynamics near the strict saddle points. In fact, we have:

Theorem C.3

Supposing $x_k$ satisfies $\lambda _{\min }(\nabla ^2f(x_k))\le -\sqrt{\rho \epsilon }/2$, we have,

$$\begin{aligned} {\mathbb {P}}\left( \max _{t\le T_{\max }}\Vert x_{k+t}-x_k\Vert \ge S\right) \ge 1/12. \end{aligned}$$

(28)

The main idea of the proof of this theorem is to consider an equation with the following form

$$\begin{aligned} x(k)= & {} x(k-1)+\eta \left[ \sum _{m=1}^M ({\varvec{H}}+\varDelta (k-\tau _{k,m}))x(k-\tau _{k,m})\right. \nonumber \\{} & {} \left. +\,\sqrt{M}{\hat{\zeta }}_k +\sum _m {\hat{\xi }}_{k,m}\right] , \end{aligned}$$

(29)

where $\varvec{H}$ is a symmetric matrix with $\lambda _{\max }(\varvec{H})\ge \sqrt{\rho \epsilon }/2$, $\Vert \varDelta (k-\tau _{k,m})\Vert \le \rho S$, ${\hat{\zeta }}_k=2N(0,r^2/d)e_1$, ${\hat{\xi }}_{k,m}$ is $\ell ^2\Vert x(k)\Vert ^2$-norm-sub-Gaussian by Assumptions 2 and 3. This is a time-delayed equation. Inspired by Mao–Razumikhin–Lyapunov stability theorem in stochastic differential equation with finite delay (see [15, 16]), we can show (see Theorem E.5):

$$\begin{aligned} x(T_{\max })\ge \frac{(1+q)^{T_{\max }-T}\sqrt{M}\eta r }{6\sqrt{3}\sqrt{2q d}}, \end{aligned}$$

(30)

where $q= M\eta \sqrt{\rho \epsilon }/2 \cdot e^{-(T+1)M \eta }$, which proves the theorem.

Proof of theorem C.3

In order to analyze x(t) under the updating rules, as in [9], the standard proof strategy to consider two sequences $\{x_1(t)\}$ and $\{x_2(t)\}$ as two separate runs of Algorithm 1 starting from x(k)(for all $t\le k$, $x_1(t)=x_2(t)$). They are coupled, such that for the Gaussian noise $\zeta _1(t)$ and $\zeta _2(t)$ in Algorithm 1, $e_1^T\zeta _1=-e_1^T\zeta _2$, where $e_1$ is the eigenvector corresponding to the minimum eigenvalue of $\nabla ^2f(x)$, and the components at any direction perpendicular to $e_1$ of $\zeta _1$ and $\zeta _2$ are equal. Given coupling sequence $\{x_1(k+t)\}$ and $\{x_2(k+t)\}$, let $x(t)=x_1(k+t)-x_2(k+t)$. We have $x(k)=x(k-1)+\eta (\sum _{m=1}^M\nabla f(x_1(k-\tau _{k,m}))+\zeta _{1,k,m}-\sum _{m=1}^M\nabla f(x_2(k-\tau _{k,m})) -\zeta _{2,k,m}).$ It is easy to see,

$$\begin{aligned} \nabla f(&x_1)-\nabla f(x_2)\\&=\int _0^1 \nabla ^2f(tx_1+(1-t)x_2)(x_1-x_2)\,\textrm{d}t \\&=[\nabla ^2f(x_0)+\int _0^1 \nabla ^2f(tx_1+(1-t)x_2)\,\textrm{d}t -\nabla ^2f(x_0)](x_1-x_2). \end{aligned}$$

Let ${\varvec{H}}=\nabla ^2f(x_0)$, $\varDelta _{x_1,x_2} = \int _0^1 \nabla ^2f(tx_1+(1-t)x_2)\,\textrm{d}t-\nabla ^2f(x_0)$. We have,

$$\begin{aligned} x(k)= & {} x(k-1)\nonumber \\{} & {} \quad +\,\eta \left[ \sum _{m=1}^M ({\varvec{H}}+\varDelta _{x_1(k-\tau _{k,m}),x_2(k-\tau _{k,m})})x(k-\tau _{k,m})+\zeta _{1,k,m}- \zeta _{2,k,m}\right] .\nonumber \\ \end{aligned}$$

(31)

We need to estimate the probability of the event

$$\begin{aligned} \left\{ \max _{t\le T_{\max }}(\Vert x_1(k+t)-x_1(k)\Vert ^2,\Vert x_2(k+t)-x_2(k)\Vert ^2)\ge S^2\text {or} \ \Vert x(t)\Vert \ge 2S\right\} . \end{aligned}$$

It is enough to consider a random variable $x'$ such that $x'(t)|E-x(t)|E=0$ where E is the event $\{ \forall t \le T_{\max } : \max _{t\le T_{\max }}(\Vert x_1(k+t)-x_1(k)\Vert ^2,\Vert x_2(k+t)-x_2(k)\Vert ^2)\le S^2\}$. This is from

$$\begin{aligned}{} & {} {\mathbb {P}}(\max (\Vert x_1(k+t)-x_1(k)\Vert ^2,\Vert x_2(k+t)-x_2(k)\Vert ^2)\nonumber \\{} & {} \quad \le S^2,\ \forall t \le T_{\max } \text { or}\ \Vert x(t)\Vert< 2S )\nonumber \\{} & {} \quad ={\mathbb {P}}( \max (\Vert x_1(k+t)-x_1(k)\Vert ^2,\Vert x_2(k+t)-x_2(k)\Vert ^2)\nonumber \\{} & {} \quad \le S^2\ \forall t \le T_{\max }\text { or}\ \Vert x'(t)\Vert < 2S ). \end{aligned}$$

(32)

Then we can turn to consider $x'$, such that

$$\begin{aligned} x'(k){} & {} =x'(k-1)\nonumber \\{} & {} \quad \times +\eta \left[ \sum _{m=1}^M ({\varvec{H}}+\varDelta '_{x_1(k-\tau _{k,m}),x_2(k-\tau _{k,m})})x(k-\tau _{k,m})+\zeta _{1,k,m}- \zeta _{2,k,m}\right] .\nonumber \\ \end{aligned}$$

(33)

If $\max (\Vert x_1(t)-x(k)\Vert ^2,\Vert x_2(t)-x(k)\Vert ^2)\le S^2$,

$$\begin{aligned} \varDelta '(t)=\varDelta , \end{aligned}$$

else

$$\begin{aligned} \varDelta '(t)=\rho S. \end{aligned}$$

Then $\varDelta '_{x_1(k-\tau _{k,m}),x_2(k-\tau _{k,m})}\le \rho S$. To simplify symbols, we denote $x=x'$. To show that $\Vert x(T_{\max })\Vert \ge 2S$, we consider (31). Let $\{\zeta _{1,i}, \zeta _{2,i}\}$, $\{\xi _{1,i,m}, \xi _{2,i,m}\}$ be the Gaussian noise and stochastic gradient noise in two runs. We set $\zeta _i=\zeta _{1,i}-\zeta _{2,i}$, $\xi _{i,m}=\xi _{1,i,m}-\xi _{2,i,m}$. It is easy to see that $\zeta _i=2{\varvec{P}}\zeta _{1,i}$, where ${\varvec{P}}$ is the projection matrix to $e_1$. This is from the definition of the coupling sequence. And from Assumption 3, $\xi _{i,m}$ is $\ell ^2\Vert x(i-\tau _{i,m})\Vert ^2$-norm-sub-Gaussian.

Then there is a polynomial function $f(t_0,t,y)$ such that $x(k)=\psi (k)+\phi (k)+\phi _{sg}(k)$, and

$$\begin{aligned}&\psi (k)=\sqrt{M}\eta \sum _{i=0}^{k-1} f(i,k,{\varvec{H}})\zeta _i, \nonumber \\&\phi (k)=\eta \sum _m \sum _{i=0}^{k-1} f(i,k,{\varvec{H}})\varDelta (i-\tau _{i,m})x(i-\tau _{i,m}),\nonumber \\&\phi _{sg}(k)=\eta \sum _m\sum _{i=0}^{k-1}f(i,k,{\varvec{H}})\xi _{i,m}. \end{aligned}$$

(34)

Here $f(t_0,t,{\varvec{H}})$ is the solution (fundamental solution) of the following linear equation:

$$\begin{aligned} x(k)= & {} x(k-1)+ \eta \left[ \sum _{m=1}^m {\varvec{H}}x(k-\tau _{k,m})\right] ,\nonumber \\ x(t_0)= & {} {\varvec{I}},\nonumber \\ x(n)= & {} {\varvec{0}} \text { for all } n<t_0. \end{aligned}$$

(35)

This is an easy inference for linear time-varying systems. And if the maximal eigenvalue of ${\varvec{H}}$ is $\gamma $, it is easy to see for any vector $V=\varvec{P}V$ with $\Vert V\Vert =1$, $\Vert f(t_0,t,{\varvec{H}})V\Vert _2= f(t_0,t,\gamma )\triangleq f(t_0,t)$.

Lemma D.10

Let $f(t_0,t)=f(t_0,t,\gamma )$, $\beta ^2(k)=\sum _{i=0}^k f^2(i,k)$. We have,

(1)
$f(t_0,t_1)f(t_1,t_2)\le f(t_0,t_2)$;
(2)
$f(t_1,t_2)\ge f(t_1,t_2-1)$;
(3)
$f(k,t)\beta (k)=\sqrt{\sum _{j=0}^{k-1}f^2(k,t)f^2(j,k)}\le \sqrt{\sum _{j=0}^{k-1}f^2(j,t)}\le \beta (t)$;
(4)
$f(k,k+t) \ge (1+M\eta \gamma e^{-(T+1)M \eta \gamma } )^{t-T}f(k,k)$ if $t\ge T$;
(5)
$q= M\eta \gamma e^{-(T+1)M \eta \gamma }$, $\beta ^2(k)\ge \sum _{j=0}^{k-T} (1+q)^{2j} \ge \frac{(1+q)^{2(k-T)}}{3\cdot 2q}$ when $k-T\ge \ln 2/q$.

Proof

The first three inequalities are trivial. (4) is from Corollary E.2. (5) is easily deduced from (4). $\square $

Now we can estimate the $\phi $ term.

$$\begin{aligned} \phi (t+1) = \eta \sum _m \sum _{n=0}^{t} f(n,t+1,{\varvec{H}}) \varDelta (n-\tau _{n,m}) x(n-\tau _{n,m}). \end{aligned}$$

(36)

To give an estimation, we need the Chernoff bound. Let

$$\begin{aligned} {\varvec{Y}}=\begin{bmatrix} 0 &{} X^T \\ X &{} 0 \end{bmatrix} , \quad {\varvec{Y}}_N=\begin{bmatrix} 0 &{} \psi ^T \\ \psi &{} 0 \end{bmatrix}, \\ {\varvec{Y}}_\phi =\begin{bmatrix} 0 &{} \phi ^T \\ \phi &{} 0 \end{bmatrix}, \quad {\varvec{Y}}_{sg}=\begin{bmatrix} 0 &{} \phi _{sg}^T \\ \phi _{sg} &{} 0 \end{bmatrix}. \end{aligned}$$

There is a theorem:

Theorem D.4

For all $0\le t\le T_{\max }$, and $\theta ^2\le \frac{1}{48\cdot c(\sum _{j=1}^{t} p^j)^2 \beta ^2(t)M\eta ^2 4r^2/d}$, $C_2=3\times 65$, we have:

$$\begin{aligned}&\mathrm{(a)}\quad {\mathbb {E}} tr\{ e^{\theta {\varvec{Y}}_\phi (t)+\theta {\varvec{Y}}_{sg}(t) }\}\le e^{c\theta ^2 (\sum _{j=1}^{t} p^j)^2 \beta ^2(t)M\eta ^2 4r^2/d}(d+1),\\&\mathrm{(b)}\quad {\mathbb {E}} tr\{ e^{\theta {\varvec{Y}}(t)}\} \le e^{c\theta ^2 \left( 1+\sum _{j=1}^{t} p^j\right) ^2\beta ^2(t) 4M\eta ^2 r^2/d}(d+1),\\&\mathrm{(c)}\quad {\mathbb {E}} e^{\theta ^2 \Vert {\varvec{Y}}(t)\Vert ^2} \le e^{C_2c\theta ^2 \left( 1+\sum _{j=1}^{t} p^j\right) ^2\beta ^2(t) 4M\eta ^2 r^2/d}(d+1). \end{aligned}$$

Proof

We use mathematical induction. In the proof, Lemma A.5 will be used many times. For $t=0$, the first inequality is obviously true. For the second one,

$$\begin{aligned} x(0)=\psi (0)+\phi (0)+\phi _{sg}(0)=\psi (0), \end{aligned}$$

so since $\psi $ is sub-Gaussian, we have,

$$\begin{aligned} {\mathbb {E}} tr\{e^{\theta {\varvec{Y}}(0)}\}\le & {} e^{c\theta ^2 \beta ^2(0)M\eta ^24r^2/d}(d+1),\nonumber \\ {\mathbb {E}} e^{\theta ^2 \Vert {\varvec{Y}}(0)\Vert ^2}\le & {} e^{65c\theta ^2 \beta ^2(0)M\eta ^24r^2/d}\nonumber \\\le & {} e^{C_2c\theta ^2 [1+\sum _{j=1}^{0} p^j)]^2 \beta ^2(0)M\eta ^2 4r^2/d}. \end{aligned}$$

(37)

Then supposing the lemma is true for all $\tau \le t$, we consider $t+1$.

$$\begin{aligned} {\mathbb {E}} tr\{e^{\theta {\varvec{Y}}_\phi (t+1)}\}={\mathbb {E}} tr\{e^{\theta ( \eta \sum _m \sum _i f(i,t,H) \varDelta (i-\tau _{i,m}) {\varvec{Y}}(i-\tau _{i,m}))}\}. \end{aligned}$$

Thus,

$$\begin{aligned} {\mathbb {E}} tr\{ e^{\theta {\varvec{Y}}_\phi (t+1) }\}= & {} {\mathbb {E}} tr \{e^{ \theta ( \eta \sum _m \sum _i f(i,t+1,{\varvec{H}}) \varDelta (i-\tau _{i,m}) {\varvec{Y}}(i-\tau _{i,m}))}\} \nonumber \\\overset{(1)}{\le } & {} e^{c\theta ^2 (Mt f(i,t+1) \eta \rho S)^2 \left( 1+\sum _{j=1}^t p^j\right) ^2\beta ^2(i) M\eta ^2 4r^2/d}(d+1)\nonumber \\\le & {} e^{c\theta ^2 (Mt \eta \rho S)^2\left( 1+\sum _{j=1}^t p^j\right) ^2 \beta ^2(t+1)M\eta ^2 4r^2/d}(d+1). \end{aligned}$$

(38)

Here (1) is from lemma A.5, and $e^X=1+X+\frac{X^2}{2}+..., \Vert f(i,t+1,{\varvec{H}})\Vert \le f(i,t+1), \beta (i-\tau )\le \beta (i), \varDelta (i-\tau _{i,m})\le \rho S.$ Meanwhile,

$$\begin{aligned} {\mathbb {E}} tr\{ e^{\theta {\varvec{Y}}_{sg}(t+1) }\}= & {} tr\{{\mathbb {E}} e^{ \sum _i\sum _m c\theta ^2 ( f(i,t+1)\eta \ell \Vert {\varvec{Y}}(i-\tau _{i,m})\Vert )^2}{\varvec{I}}\}\nonumber \\\le & {} e^{C_2c\theta ^2 (\sqrt{Mt} f(i,t+1) \eta \ell )^2 \left( 1+\sum _{j=1}^t p^j\right) ^2\beta ^2(i) M\eta ^2 4r^2/d}(d+1)\nonumber \\\le & {} e^{C_2c\theta ^2 (\sqrt{Mt} \eta \ell )^2\left( 1+\sum _{j=1}^t p^j\right) ^2 \beta ^2(t+1)M\eta ^2 4r^2/d}(d+1). \end{aligned}$$

(39)

Combine Lemma A.5 and (38), (39). We have,

$$\begin{aligned}{} & {} {\mathbb {E}} tr\{ e^{\theta {\varvec{Y}}_\phi (t+1)+\theta {\varvec{Y}}_{sg}(t+1) }\}\nonumber \\{} & {} \quad \le e^{c\theta ^2 (Mt \eta \rho S+\sqrt{C_2}\sqrt{Mt} \eta \ell )^2\left( 1+\sum _{j=1}^t p^j\right) ^2 \beta ^2(t+1)M\eta ^2 4r^2/d}(d+1)\nonumber \\{} & {} \quad \le e^{c\theta ^2 p^2(1+\sum _{j=1}^t p^j)^2 \beta ^2(t+1)M\eta ^2 4r^2/d}(d+1)\nonumber \\{} & {} \quad = e^{c\theta ^2 (\sum _{j=1}^{t+1} p^j)^2 \beta ^2(t+1)M\eta ^2 4r^2/d}(d+1). \end{aligned}$$

(40)

This proves (a).

As for (b), we can use (40), Lemma A.5 and Chernoff bound for Gaussian distribution to deduce that:

$$\begin{aligned} {\mathbb {E}} tr\{ e^{\theta {\varvec{Y}}(t+1)}\}= & {} {\mathbb {E}} tr \{e^{\theta ({\varvec{Y}}_N(t+1)+{\varvec{Y}}_{sg}(t+1) +{\varvec{Y}}_\phi (t+1))}\}\nonumber \\\le & {} e^{c\theta ^2 \left( 1+\sum _{j=1}^{t+1} p^j\right) ^2\beta ^2(t+1) 4M\eta ^2 r^2/d}(d+1). \end{aligned}$$

(41)

For (c), firstly, using Lemma A.4, we have,

$$\begin{aligned} \begin{aligned} {\mathbb {E}} e^{\theta ^2 \Vert {\varvec{Y}}_N(t)\Vert ^2} \le e^{65c\theta ^2 \beta ^2(t)M^2\eta ^24r^2/d}. \end{aligned} \end{aligned}$$

(42)

And

$$\begin{aligned} {\mathbb {E}} e^{\theta ^2 \Vert {\varvec{Y}}_{sg}(t+1)\Vert ^2}&\le {\mathbb {E}} e^{\sum _i \sum _m 65c\theta ^2 (f(i,t+1) \eta \ell \Vert {\varvec{Y}}(i-\tau _{i,m})\Vert )^2}\\&\le e^{65C_2c\theta ^2 (\sqrt{Mt} f(i,t+1) \eta \ell )^2 \left( 1+\sum _{j=1}^t p^j\right) ^2\beta ^2(i) M\eta ^2 4r^2/d}\\&\le e^{65C_2c\theta ^2 (\sqrt{Mt} \eta \ell )^2\left( 1+\sum _{j=1}^t p^j\right) ^2 \beta ^2(t+1)M\eta ^2 4r^2/d}. \end{aligned}$$

On the other hand,

$$\begin{aligned} {\mathbb {E}} e^{\theta ^2 \Vert {\varvec{Y}}_\phi (t)\Vert ^2 }= & {} {\mathbb {E}} e^{ \theta ^2( \eta \sum _m \sum _i f(i,t,{\varvec{H}}) \varDelta (i-\tau _{i,m})\Vert {\varvec{Y}}(i-\tau _{i,m})\Vert )^2} \nonumber \\\le & {} q e^{65C_2c\theta ^2 (Mt f(i,t+1) \eta \rho S)^2 \left( 1+\sum _{j=1}^t p^j\right) ^2\beta ^2(i) M\eta ^2 4r^2/d}\nonumber \\\le & {} e^{65 C_2c\theta ^2 (Mt \eta \rho S)^2\left( 1+\sum _{j=1}^t p^j\right) ^2 \beta ^2(t+1)M\eta ^2 4r^2/d}. \end{aligned}$$

(43)

Thus,

$$\begin{aligned} {\mathbb {E}} e^{\theta ^2 \Vert {\varvec{Y}}(t+1)\Vert ^2}\le & {} {\mathbb {E}} e^{3\theta ^2 \Vert {\varvec{Y}}_N(t+1)\Vert ^2+3\theta ^2 \Vert {\varvec{Y}}_{sg}(t+1)\Vert ^2+3\theta ^2 \Vert {\varvec{Y}}_\phi (t+1)\Vert ^2}\nonumber \\\le & {} e^{65c\theta ^2 3\left[ 1+(\sqrt{C_2}Mt \eta \rho S+\sqrt{C_2} \sqrt{Mt} \eta \ell )\left( 1+\sum _{j=1}^t p^j\right) \right] ^2 \beta ^2(t+1)M\eta ^2 4r^2/d}\nonumber \\\le & {} e^{C_2c\theta ^2 \left( 1+\sum _{j=1}^{t+1} p^j\right) ^2 \beta ^2(t+1)M\eta ^2 4r^2/d}. \end{aligned}$$

(44)

where we use (b) of Lemma A.6. Then (c) follows. $\square $

Using Lemma A.1, we have,

Corollary D.1

For any $\iota >0$, we have,

$$\begin{aligned} {\mathbb {P}}(\Vert \phi _{sg}(k)+\phi (k)\Vert \le \frac{\beta (k)\sqrt{M}\eta 2r}{C\sqrt{d}}(\sqrt{48c}+\iota ) )\ge 1-2(d+1)e^{-\iota }, \end{aligned}$$

where $\frac{1}{C}=\sum _{i=1}^{\infty }p^i=\frac{p}{1-p}$.

We select $\iota =b=\log (2(d+1)) +\log 2$, then $\frac{2\sqrt{48c}+2b }{C}=\frac{1}{2}$, ${\mathbb {P}}(\Vert \phi _{sg}(k)+\phi (k)\Vert \le \frac{\beta (k)\sqrt{M}\eta r}{2\sqrt{d}})\ge \frac{1}{2}$.

Lemma D.11

For all k:

$$\begin{aligned} {\mathbb {P}}(\Vert \psi (k)\Vert \ge \frac{\beta (k)\sqrt{M}2\eta r}{3\sqrt{d}})\ge \frac{2}{3}. \end{aligned}$$

Proof

Since $\psi $ is Gaussian, ${\mathbb {P}}(|X|\le \lambda \sigma )\le 2\lambda /\sqrt{2\pi }\le \lambda $ for all normal random variable X. Let $\lambda =\frac{1}{3}$. $\psi (k)\ge \frac{\beta (k)\sqrt{M}\eta 2r}{3\sqrt{d}}$ with probability 2/3. $\square $

Using above results, with probability 1/6, $\Vert x(k)\Vert \ge \Vert \psi (k)\Vert -\Vert \phi _{sg}(k)+\phi (k)\Vert \ge \frac{\beta (k)\sqrt{M}\eta r}{6\sqrt{d}}$. Then we have, with probability at least 1/6, either $\max _{0\le t \le T_{\max }} (\Vert x_1(k+t)-x_1(k)\Vert ,\Vert x_2(k+t)-x_2(k)\Vert )\ge S$, or $\Vert x_1(k+t)-x_2(k+t)\Vert \ge \frac{\beta (k)\sqrt{M}\eta r}{6\sqrt{d}}$, and for all $k-T\ge \ln 2/q$, $q=M\eta \sqrt{\rho \epsilon }/2e^{-f}$, $f=(T+1)M\eta \sqrt{\rho \epsilon }/2$, we have $\beta ^2(k)\ge \frac{(1+q)^{k-T}}{6q}.$

For coupling two runs of Algorithm 1$\{x_1(t)\}$, $\{x_2(t)\}$, we have:

$$\begin{aligned}{} & {} \max (\Vert x_1(k+t)-x_1(k)\Vert ,\Vert x_2(k+t)-x_2(k)\Vert )\nonumber \\{} & {} \quad \ge \frac{1}{2}\Vert x_1(k+t)-x_2(k+t)\Vert . \end{aligned}$$

(45)

Since $u\le {\widetilde{{\mathcal {O}}}}(1)$ is large enough to satisfy (e) in Lemma A.6 and $\eta $ small enough such that $(1+q)^{1/q}>2$, we have,

$$\begin{aligned} \frac{\beta (T_{\max })\sqrt{M}\eta r}{6\sqrt{d}}\ge & {} \frac{(1+q)^{T_{\max }-T}\sqrt{M}\eta r }{6\sqrt{3}\sqrt{2q d}}\nonumber \\\ge & {} \frac{(1+q)^{u/q }\sqrt{M}\eta r}{6\sqrt{3}\sqrt{2q d}}\ge \frac{2^{u }\sqrt{M}\eta r}{6\sqrt{3}\sqrt{2q d}}\ge 2S. \end{aligned}$$

(46)

Thus with probability at least 1/6,

$$\begin{aligned} \max _{t\le T_{\max }}(\Vert x_1(k+t)-x_1(k)\Vert ,\Vert x_2(k+t)-x_2(k)\Vert )\ge S. \end{aligned}$$

Therefore,

$$\begin{aligned}&{\mathbb {P}}\left( \max _{t\le T_{\max }}\Vert x_1(k+t)-x_1(k)\Vert \ge S\right) \\&\quad ={\mathbb {P}}(\max _{t\le T_{\max }}\Vert x_2(k+t)-x_2(k)\Vert \ge S)\\&\quad \ge \frac{1}{2}{\mathbb {P}}(\max _{t\le T_{\max }}(\Vert x_1(k+t)-x_1(k)\Vert ,\Vert x_2(k+t)-x_2(k)\Vert )\ge S)\\&\quad \ge 1/12. \end{aligned}$$

This proves Theorem C.3. $\square $

The Growth Rate of Polynomial $f(t_1,t_2)$

In this section, we will prove the last property of polynomial $f(t_1,t_2)$ in Lemma D.10. Firstly, in the synchronous case, the delay $T=0$. We have Lyapunov’s first theorem.

Lemma E.12

Let ${\varvec{A}}$ be a symmetric matrix, with maximum eigenvalue $\gamma >0$. Suppose the updating rules of x are

$$\begin{aligned} \begin{aligned} x(n+1)=x(n)+{\varvec{A}}x(n). \end{aligned} \end{aligned}$$

(47)

Then x(n) is exponentially unstable in the neighborhood of zero.

This can be proved by choosing a Lyapunov function. We consider $V(n)=x(n)^T{\varvec{P}}x(n)$, where ${\varvec{P}}$ is the Projection matrix to the subspace of the maximum eigenvalue. We can show that $V(n+1)=(1+\gamma )^2V(n)$.

This method can be generalized to the asynchronous (time-delayed) systems using a Lyapunov–Razumikhin argument.

Theorem E.5

For a discrete system, V(n, x) is a positive value Lyapunov function. Let $\varOmega $ be the space of discrete function $x(\cdot )$ from $\{-T,...0,1,2,...\}$ to ${\mathbb {R}}^d$ and $x(\cdot )$ is a solution of the given discrete system equation. Suppose there exist $q, q_m$ satisfying the following two conditions:

$$\begin{aligned}{} & {} \mathrm{(a)}\quad V(t+1,x(t+1))\ge q_m V(t,x(t)), q_m>0;\nonumber \\{} & {} \mathrm{(b)}\quad \text {If }\ V(t-\tau ,x(t-\tau ))\ge (1+q)^{-T} \frac{q_m}{1+q}V(t,x(t)),\ \forall 0\le \tau \le T, \nonumber \\{} & {} \qquad \quad \text {we have }\ V(t+1,x(t+1))\ge (1+q) V(t,x(t)). \end{aligned}$$

(48)

Then for any $x(\cdot )\in \varOmega $ satisfying that for all $-T\le t\le 0 $, $V(t,x(t))\ge p V(0,x(0))$ with $0<p\le 1$, we have $V(t,x(t))\ge (1+q)^t pV(0,x(0))$ for all $t>0$.

Proof

Let $B(n)=(1+q)^{-n} V(n)$. To prove our theorem, we only need to show B(n) has a lower bound.

$B(0)=V(0)\ge p V(0)\triangleq p'$. Assuming there is a $t>0$ such that $B(t)=(1+q)^{-t} V(t)< p' $, select the minimum one as t, such that $B(k)\ge p'$ for all $k<t$, and $B(t)< p'$. Note that $V(t)\ge q_m V(t-1)$ so that $B(t)\ge p'\frac{q_m}{1+q}$. Then for all k satisfying $t-T\le k\le t$,

$$\begin{aligned} V(k)= & {} (1+q)^{k}B(k)\ge (1+q)^k p'\frac{q_m}{1+q}\nonumber \\= & {} (1+q)^{k-t} (1+q)^{t} p'\frac{q_m}{1+q}\ge (1+q)^{k-t} (1+q)^{t} \frac{q_m}{1+q} B(t)\nonumber \\\ge & {} (1+q)^{-T}\frac{q_m}{1+q} V(t). \end{aligned}$$

(49)

So that we have $V(t+1)\ge (1+q) V(t) $, $ B(t+1)\ge B(t)\ge p' \frac{q_m}{1+q} $. If $B(t+1)\ge p'$, $V(t+2)\ge q_m V(n+1)$, so that $B(t+2)\ge B(t+1) \frac{q_m}{1+q}\ge p'\frac{q_m}{1+q}$. If $B(t+1)<p'$, $V(t+1-\tau )\ge (1+q)^{-T}\frac{q_m}{1+q}V(t+1)$, from the condition in (48), $ B(t+2)\ge B(t+1)\ge p' \frac{q_m}{1+q} $. This process can be continued, such that $B(t) \ge p'\frac{q_m}{1+q} $ for any t. Our claim follows. $\square $

Using Theorem E.5 to (35), we set $ V(n,x)=\Vert {\varvec{P}}x(n)\Vert $. Supposing $x(0)=I,x(-t)=0$ for all $t>0$, $\Vert {\varvec{P}}x(t)\Vert =e_1^Tx(t)$ and $q_m=1$. Thus easily we have:

Corollary E.2

Let f(k, t) be the polynomial in lemma D.10. $f(k,k+t)\ge (1+q)^{t-T} f(k,k)$ if $t\ge T$, where $q=M\eta \gamma e^{-(T+1)M\eta \gamma }$.

Proof of Theorem 2

In this section, we prove Theorem 2. Lemma B.7 shows that with high probability at least K/2 iterations are of the third kind, such that there is an integer k satisfying $\sum _{i=k-2T}^{k-1}\Vert \nabla f(x_i)\Vert ^2 < F$ and $\lambda _{\min }(\nabla ^2 f(x_{k}))> -\sqrt{\rho \epsilon }/2$. Then we have,

$$\begin{aligned} \Vert \nabla f(x_{i^*})\Vert ^2= & {} \min _{i\in \{k-T...k-1\}}\Vert \nabla f(x_i)\Vert ^2\nonumber \\\le & {} \frac{1}{T}\sum _{i=k-T}^{k-1}\Vert \nabla f(x_i)\Vert ^2 < \frac{1}{T}F\le \epsilon ^2. \end{aligned}$$

(50)

Using $2\eta ^2M^2L^2T^3\le 1/5$, for every interval of the third kind, with probability at least 1/10,

$$\begin{aligned} \Vert x_{i^*}-x_k\Vert ^2= & {} \eta ^2\left\| \sum _{i=i^*}^{k-1}\sum _{m=1}^M\nabla f(x_{k-\tau _{k,m}})+\sum _m \sum _{i=i^*}^{k-1}\zeta _{i,m}\right\| ^2\nonumber \\\le & {} 2\eta ^2MT\sum _{i=i^*}^{k-1}\sum _m\Vert \nabla f(x_{i-\tau _{i,m}})\Vert ^2 +2\eta ^2\left\| \sum _{i=i^*}^{k-1}\sum _m \zeta _{i,m}\right\| ^2\nonumber \\\le & {} 2\eta ^2M^2 T^2F+20c\eta ^2MT\sigma ^2\nonumber \\\le & {} \epsilon ^2/4L^2. \end{aligned}$$

(51)

And

$$\begin{aligned} \lambda _{\min }\nabla ^2f(x_{i^*})> & {} -\sqrt{\rho \epsilon }/2-\rho \Vert x_{i^*}-x_{k}\Vert \nonumber \\> & {} -\sqrt{\rho \epsilon }/2-\sqrt{\rho \epsilon }/2\sqrt{\rho \epsilon }/L\ge -\sqrt{\rho \epsilon }. \end{aligned}$$

(52)

There are at least $\lfloor K/4T \rfloor $ intervals which are of the third kind. Using Hoeffding’s lemma, with a high probability, we can find such $x_{i^*}$.

Note that $K=\max \{100\iota \frac{f(x_0)-f(x_*)}{M\eta \frac{F}{T}},100\iota \frac{f(x_0)-f(x_*)}{M\eta \frac{F_2}{T_{\max }}}\}\sim \frac{1}{M\epsilon ^4}$. From $2\eta ^2M^2L^2T^3 \le 1/5$, we have $T\le {\widetilde{{\mathcal {O}}}}(K^{1/3}M^{-1/3})$. Our theorem follows. $\square $

Proof of Lemma B.7

Using Theorem B.1, if there are more than $\lceil K/8T \rceil $ intervals of the first kind, with probability$1-3e^{-\iota }$,

$$\begin{aligned}&f(x_{K+1})-f(x_{0})\nonumber \\&\quad \le \sum _{k=0}^{K} -\frac{3M\eta }{8}\Vert \nabla f(x_k)\Vert ^2 + c\eta \sigma ^2\iota +2\eta ^2LM c\sigma ^2(K+1+\iota )\nonumber \\&\quad \le \sum _i\sum _{k\in S_i} -\frac{3M\eta }{8}\Vert \nabla f(x_k)\Vert ^2 + c\eta \sigma ^2\iota +2\eta ^2LM c\sigma ^2(K+1+\iota )\nonumber \\&\quad \le -\frac{K}{8T} \frac{3M\eta }{8} F + c\eta \sigma ^2\iota +2\eta ^2LM c\sigma ^2(K+1+\iota )\nonumber \\&\quad = -\frac{K}{8T} \frac{3M\eta }{8} (F-\frac{128\eta LT}{3}c\sigma ^2) + c\eta \sigma ^2\iota +2\eta ^2LM c\sigma ^2(K+1+\iota )\nonumber \\&\quad \le -\frac{K}{8T} \frac{3M\eta }{8} (F-50c\sigma ^2 \eta LT)+ c\eta \sigma ^2\iota +2\eta ^2LM c\sigma ^2(K+1+\iota )\nonumber \\&\quad \le -\frac{K}{8T} \frac{3M\eta }{8} \frac{1}{6}F + c\eta \sigma ^2\iota +2\eta ^2LM c\sigma ^2(K+1+\iota ). \end{aligned}$$

(53)

Since $K\ge 100\iota T\frac{f(x_0)-f(x_*)}{M\eta F }$, and $\iota $ is large enough, it cannot be achieved.

As for (2), let $z_i$ be the stopping time such that

$$\begin{aligned}{} & {} z_1=\inf \{j| S_j \text { is of second kind}\}\nonumber \\{} & {} z_i=\inf \{j| T_{\max }/2T\le j-z_{i-1} \text { and } S_j \text { is of second kind}\}. \end{aligned}$$

(54)

Let $N = max\{i|2T \cdot z_i+T_{\max } \le K\}$. We have,

$$\begin{aligned}&f(x_{K+1})-f(x_0)\\&\quad \le \sum _{k=0}^{K} -\frac{3M\eta }{8}\Vert \nabla f(x_k)\Vert ^2 + c\eta \sigma ^2\iota \\&\quad \quad +\,\frac{3\eta ^2L}{2}Mc\sigma ^2(K+1+\iota )+ L^2 T M\eta ^3 Mc\sigma ^2(K+1+T+\iota )\\&\quad \le c\eta \sigma ^2\iota +\frac{3\eta ^2L}{2}Mc\sigma ^2(K+1+\iota )+ L^2 T M\eta ^3 Mc\sigma ^2(K+1+T+\iota )\\&\quad \quad +\, \sum _i^N \sum _{k=F_{z_i}}^{F_{z_i}+T_{\max }-1} -\frac{3M\eta }{8}\Vert \nabla f(x_k)\Vert ^2. \end{aligned}$$

Let $X_i= \sum _{k=F_{z_i}}^{F_{z_i}+T_{\max }-1} \Vert \nabla f(x_k)\Vert ^2.$ $\sum _i X_i$ is a submartingale and the last term of the above equation is $-\sum _i X_i$. Using Theorem B.2, ${\mathbb {P}}(X_i\ge F_2)\ge 1/24$. Let $Y_i$ be a random variable, such that $Y_i=X_i$ if $X_i\le F_2$ else $Y_i=F_2$. Then we have a bounded sub-martingale $0\le Y_i\le X_i$. Using Azuma’s inequality, we have,

$$\begin{aligned} \begin{aligned} {\mathbb {P}}\left( \sum _i^N X_i \ge {\mathbb {E}}\left\{ \sum _i Y_i\right\} -\lambda \right) \ge {\mathbb {P}}\left( \sum _i^N Y_i \ge {\mathbb {E}}\left\{ \sum _i Y_i\right\} -\lambda \right) \ge 1-2e^{-\frac{\lambda ^2}{2F_2^2N}}. \end{aligned}\nonumber \\ \end{aligned}$$

(55)

And it easy to see ${\mathbb {E}}\{\sum _i^N Y_i\}\ge \frac{1}{24}NF_2$. We have,

$$\begin{aligned} {\mathbb {P}}\left( \sum _i^N Y_i\ge \frac{1}{24}NF_2-\sqrt{2N}F_2\sqrt{\iota }\right) \ge 1-e^{-\iota }. \end{aligned}$$

(56)

If there are more than K/8T intervals of the second kind, we have $N\ge K/4T_{\max }$ and

$$\begin{aligned} \frac{1}{24}N-\sqrt{2N}\sqrt{\iota }\ge \frac{1}{48}N. \end{aligned}$$

With probability at least $1-3e^{-\iota }$,

$$\begin{aligned} \begin{aligned} f(x_{K+1})-f(x_0)&\le c\eta \sigma ^2\iota +2\eta ^2LMc\sigma ^2(K+1+\iota ) -\frac{3M\eta }{8}\frac{N}{48}F_2. \end{aligned} \end{aligned}$$

(57)

If $N\ge K/4T_{\max }$, and $K\ge 100\iota T_{\max }\frac{f(x_0)-f(x_*)}{M\eta F_2}$, it cannot be achieved. $\square $

Proof of theorem B.1

We need a lemma:

Lemma H.13

Under the condition of theorem B.1, we have,

$$\begin{aligned}&f(x_{k+1})-f(x_k)\nonumber \\&\quad \le -\frac{M\eta }{2}\Vert \nabla f(x_k)\Vert ^2+ \left( \frac{3\eta ^2 L}{4}-\frac{\eta }{2M}\right) \left\| \sum _{m=1}^M\nabla f(x_{k-\tau _{k,m}})\right\| ^2\nonumber \\&\quad \quad + \,L^2TM\eta \sum _{j=k-T}^{k-1} \eta ^2\left\| \sum _{m=1}^M\nabla f(x_{j-\tau _{j,m}})\right\| ^2 - \eta \left\langle \nabla f(x_k),\sum _{m=1}^M\zeta _{k,m}\right\rangle \nonumber \\&\quad \quad +\,\frac{3\eta ^2L}{2}\left\| \sum _{m=1}^M \zeta _{k,m}\right\| ^2+ L^2 M\eta ^3 \left\| \sum _{j=k-\tau ^{\max }_{k}}^{k-1}\sum _{m=1}^M \zeta _{j,m}\right\| ^2. \end{aligned}$$

(58)

Proof of Lemma H.13

This lemma is, in fact, a transformation of (30) in [13].

$$\begin{aligned} f(&x_{k+1})-f(x_k)\nonumber \\&\le \left\langle \nabla f(x_k),x_{k+1}-x_k\right\rangle +\frac{L}{2}\Vert x_{k+1}-x_k\Vert ^2\nonumber \\&=-\left\langle \nabla f(x_k),\eta \sum _{m=1}^M \nabla f(x_{k-\tau _{k,m}})+\eta \sum _{m=1}^M\zeta _{k,m}\right\rangle \nonumber \\&\quad +\,\frac{\eta ^2L}{2}\left\| \sum _{m=1}^M\nabla f(x_{k-\tau _{k,m}})+\sum _{m=1}^M\zeta _{k,m}\right\| ^2\nonumber \\&=-\left\langle \nabla f(x_k),\eta \sum _{m=1}^M \nabla f(x_{k-\tau _{k,m}})\right\rangle +\left\langle \nabla f(x_k),\eta \sum _{m=1}^M\zeta _{k,m}\right\rangle \nonumber \\&\quad +\,\frac{\eta ^2L}{2}\left\| \sum _{m=1}^M\nabla f(x_{k-\tau _{k,m}})+\sum _{m=1}^M\zeta _{k,m}\right\| ^2\nonumber \\&\overset{(1)}{=}\ -\frac{M\eta }{2}\left( \Vert \nabla f(x_k)\Vert ^2+ \left\| \frac{1}{M}\sum _{m=1}^M\nabla f(x_{k-\tau _{k,m}})\right\| ^2\nonumber \right. \\&\left. \quad - \,\left\| \nabla f(x_k)-\frac{1}{M}\sum _{m=1}^M \nabla f(x_{k-\tau _{k,m}})\right\| ^2\right) - \eta \left\langle \nabla f(x_k),\sum _{m=1}^M\zeta _{k,m}\right\rangle \nonumber \\&\quad +\,\frac{\eta ^2L}{2}\left\| \sum _{m=1}^M\nabla f(x_{k-\tau _{k,m}})+\sum _{m=1}^M \zeta _{k,m}\right\| ^2\nonumber \\&= -\frac{M\eta }{2}\left( \Vert \nabla f(x_k)\Vert ^2+ \left\| \frac{1}{M}\sum _{m=1}^M\nabla f(x_{k-\tau _{k,m}})\right\| ^2\right. \nonumber \\&\left. \quad -\,\underbrace{\left\| \nabla f(x_k)-\frac{1}{M}\sum _{m=1}^M \nabla f(x_{k-\tau _{k,m}})\right\| ^2}_{T_1} \right) \nonumber \\&\quad - \,\eta \left\langle \nabla f(x_k),\sum _{m=1}^M\zeta _{k,m}\right\rangle +\frac{\eta ^2L}{2}\left( \frac{3}{2}\left\| \sum _{m=1}^M\nabla f(x_{k-\tau _{k,m}})\right\| ^2+3\left\| \sum _{m=1}^M \zeta _{k,m}\right\| ^2\right) \nonumber \\&\overset{(2)}{\le }\ -\frac{M\eta }{2}\left[ \Vert \nabla f(x_k)\Vert ^2+ \left\| \frac{1}{M}\sum _{m=1}^M\nabla f(x_{k-\tau _{k,m}})\right\| ^2\nonumber \right. \\&\left. \quad -\,2L^2\left( \left\| \sum _{j=k-\tau ^{\max }_{k}}^{k-1}\eta \sum _{m=1}^M \zeta _{j,m}\right\| ^2 + T\sum _{j=k-\tau ^{\max }_{k}}^{k-1} \eta ^2\left\| \sum _{m=1}^M\nabla f(x_{j-\tau _{j,m}})\right\| ^2\right) \right] \nonumber \\&\qquad \left( \tau ^{\max }_{k}=\arg \max _{m\in \{1,...M\}}\Vert x_k-x_{\tau _{k,m}}\Vert \right) \nonumber \\&\quad -\, \eta \left\langle \nabla f(x_k),\sum _{m=1}^M\zeta _{k,m}\right\rangle +\frac{\eta ^2L}{2}\left( \frac{3}{2}\left\| \sum _{m=1}^M\nabla f(x_{k-\tau _{k,m}})\right\| ^2+3\left\| \sum _{m=1}^M \zeta _{k,m}\right\| ^2\right) \nonumber \\&\quad \le -\frac{M\eta }{2}\left[ \Vert \nabla f(x_k)\Vert ^2+ \left\| \frac{1}{M}\sum _{m=1}^M\nabla f(x_{k-\tau _{k,m}})\right\| ^2-2L^2 \left( \eta ^2 \left\| \sum _{j=k-\tau ^{\max }_{k}}^{k-1}\sum _{m=1}^M \zeta _{j,m}\right\| ^2 \right. \right. \nonumber \\&\left. \left. \quad + \,T\sum _{j=k-\tau ^{\max }_{k}}^{k-1} \eta ^2\left\| \sum _{m=1}^M\nabla f(x_{j-\tau _{j,m}})\right\| ^2\right) \right] - \eta \left\langle \nabla f(x_k),\sum _{m=1}^M\zeta _{k,m}\right\rangle \nonumber \\&\quad +\,\frac{\eta ^2L}{2}\left( \frac{3}{2}\left\| \sum _{m=1}^M\nabla f(x_{k-\tau _{k,m}})\right\| ^2+3\left\| \sum _{m=1}^M \zeta _{k,m}\right\| ^2\right) \nonumber \\&\le -\frac{M\eta }{2}\Vert \nabla f(x_k)\Vert ^2+ \left( \frac{3\eta ^2 L}{4}-\frac{\eta }{2M}\right) \left\| \sum _{m=1}^M\nabla f(x_{k-\tau _{k,m}})\right\| ^2\nonumber \\&\quad +\, L^2TM\eta \sum _{j=k-T}^{k-1} \eta ^2\left\| \sum _{m=1}^M\nabla f(x_{j-\tau _{j,m}})\right\| ^2\nonumber \\&\quad -\, \eta \left\langle \nabla f(x_k),\sum _{m=1}^M\zeta _{k,m}\right\rangle +\frac{3\eta ^2L}{2}\left\| \sum _{m=1}^M \zeta _{k,m}\right\| ^2+ L^2 M\eta ^3 \left\| \sum _{j=k-\tau ^{\max }_{k}}^{k-1}\sum _{m=1}^M \zeta _{j,m}\right\| ^2. \end{aligned}$$

(59)

In (1) we use the fact $\left\langle a,b\right\rangle =\frac{1}{2}(\Vert a\Vert ^2+\Vert b\Vert ^2-\Vert a-b\Vert ^2)$, and (2) is from the estimation of $T_1$ in [13]. $\square $

From this lemma, we can observe that different from the general SGD, since the stale gradients are used, there is no guarantee that the function value will decrease in every step. However, it can be proved that the overall trend of the function value is still decreasing.

Proof of theorem B.1

$$\begin{aligned} f(&x_{t_0+\tau +1})-f(x_{t_0})\nonumber \\&=\sum _{k=t_0}^{t_0+\tau } f(x_{k+1})-f(x_{k})\nonumber \\&\le \sum _{k=t_0}^{t_0+\tau } -\frac{M\eta }{2}\Vert \nabla f(x_k)\Vert ^2+ \left( \frac{3\eta ^2 L}{4}-\frac{\eta }{2M}\right) \left\| \sum _{m=1}^M\nabla f(x_{k-\tau _{k,m}})\right\| ^2\nonumber \\&\quad +\, L^2TM\eta \sum _{j=k-T}^{k-1} \eta ^2\left\| \sum _{m=1}^M\nabla f(x_{j-\tau _{j,m}})\right\| ^2 \nonumber \\&\quad -\, \eta \left\langle \nabla f(x_k),\sum _{m=1}^M\zeta _{k,m}\right\rangle +\frac{3\eta ^2L}{2}\left\| \sum _{m=1}^M \zeta _{k,m}\right\| ^2+ L^2 M\eta ^3 \left\| \sum _{j=k-\tau ^{\max }_{k}}^{k-1}\sum _{m=1}^M \zeta _{j,m}\right\| ^2 \nonumber \\&= \sum _{k=t_0}^{t_0+\tau } -\frac{M\eta }{2}\Vert \nabla f(x_k)\Vert ^2+ \sum _{k=t_0}^{t_0+\tau } \left( \frac{3\eta ^2 L}{4}-\frac{\eta }{2M}\right) \left\| \sum _{m=1}^M\nabla f(x_{k-\tau _{k,m}})\right\| ^2\nonumber \\&\quad +\, L^2TM\eta \sum _{j=k-T}^{k-1} \eta ^2\left\| \sum _{m=1}^M\nabla f(x_{j-\tau _{j,m}})\right\| ^2-\sum _{k=t_0}^{t_0+\tau } \eta \left\langle \nabla f(x_k),\sum _{m=1}^M\zeta _{k,m}\right\rangle \nonumber \\&\quad +\,\frac{3\eta ^2L}{2}\left\| \sum _{m=1}^M \zeta _{k,m}\right\| ^2+ L^2 M\eta ^3 \left\| \sum _{j=k-\tau ^{\max }_{k}}^{k-1}\sum _{m=1}^M \zeta _{j,m}\right\| ^2\nonumber \\&\le \sum _{k=t_0}^{t_0+\tau } -\frac{M\eta }{2}\Vert \nabla f(x_k)\Vert ^2+ \sum _{k=t_0}^{t_0+\tau } \left( \eta ^2\left( \frac{3 L}{4}+L^2M T^2\eta \right) -\frac{\eta }{2M}\right) \nonumber \\&\quad \quad \times \left\| \sum _{m=1}^M\nabla f(x_{k-\tau _{k,m}})\right\| ^2\nonumber \\&\quad +\, L^2TM\eta ^3 \sum _{k=t_0-T}^{t_0-1} T \left\| \sum _{m=1}^M\nabla f(x_{j-\tau _{j,m}})\right\| ^2\nonumber \\&\quad \underbrace{-\sum _{k=t_0}^{t_0+\tau } \eta \left\langle \nabla f(x_k),\sum _{m=1}^M\zeta _{k,m}\right\rangle +\sum _{k=t_0}^{t_0+\tau }\frac{3\eta ^2L}{2}\left\| \sum _{m=1}^M \zeta _{k,m}\right\| ^2}_{T_{2,a}}\nonumber \\&\quad +\,\underbrace{\sum _{k=t_0-T}^{t_0+\tau } L^2 M\eta ^3 \left\| \sum _{j=k-\tau ^{\max }_{k}}^{k-1}\sum _{m=1}^M \zeta _{j,m}\right\| ^2}_{T_{2,b}}. \end{aligned}$$

(60)

In order to estimate $T_2=T_{2,a}+T_{2,b}$, we can use lemmas in [11]. Let $\zeta _k=\frac{1}{M}\sum _{m=1}^M\zeta _{k,m}$. With probability $1-e^{-\iota }$, we have,

$$\begin{aligned} -\sum _{k=t_0}^{t_0+\tau } \eta \left\langle M\nabla f(x_k),\zeta _{k}\right\rangle \le \frac{\eta M}{8}\sum _{k=t_0}^{t_0+\tau } \Vert \nabla f(x_k)\Vert ^2+ c\eta \sigma ^2\iota . \end{aligned}$$

(61)

This is from Lemma 30 in [11]. Meanwhile, with probability $1-e^{-\iota }$,

$$\begin{aligned} \sum _{k=t_0}^{t_0+\tau }\frac{3\eta ^2L}{2}\left\| \sum _{m=1}^M \zeta _{k,m}\right\| ^2 \le \frac{3\eta ^2L}{2}Mc\sigma ^2(\tau +1+\iota ). \end{aligned}$$

And with probability at least $1-e^{-\iota /TML\eta +\log (\tau +T)+\log T}\ge 1-e^{-\iota } $ (when $\tau $ is large enough),

$$\begin{aligned} \sum _{k=t_0-T}^{t_0+\tau } L^2 M\eta ^3 \left\| \sum _{j=k-\tau ^{\max }_{k}}^{k-1}\sum _{m=1}^M \zeta _{j,m}\right\| ^2\le & {} L^2 T M\eta ^3 Mc\sigma ^2(\tau /TM\eta +1+T+\iota )\nonumber \\\le & {} \frac{\eta ^2L}{2}Mc\sigma ^2(\tau +1+\iota ). \end{aligned}$$

(62)

Note that $\eta ^2\left( \frac{3 L}{4}-L^2M T^2\eta \right) -\frac{\eta }{2M}<0$. Combine all the above facts. With probability at least $1-3e^{-\iota }$,

$$\begin{aligned} f(x_{t_0+\tau +1})-f(x_{t_0})\le & {} \sum _{k=t_0}^{t_0+\tau } -\frac{3M\eta }{8}\Vert \nabla f(x_k)\Vert ^2 + c\eta \sigma ^2\iota \nonumber \\{} & {} +\,2\eta ^2LMc\sigma ^2(\tau +1+\iota )\nonumber \\{} & {} +\, L^2TM\eta ^3\sum _{k=t_0-T}^{t_0-1} T \left\| \sum _{m=1}^M\nabla f(x_{j-\tau _{j,m}})\right\| ^2. \end{aligned}$$

(63)

The theorem follows. $\square $

Proof of Lemmas in Sect. A

Proof of Lemma A.5

For any semi-positive definite matrix ${\varvec{A}}_i$ and $\sum _i a_i=1 $, $a_i\ge 0$, we have,

$$\begin{aligned} tr\prod _{i=1}^M {\varvec{A}}_i^{a_i}\le \sum _i a_i tr {\varvec{A}}_i.\ \ \ \end{aligned}$$

However, it is impossible to use this inequality directly. When ${\varvec{Y}}_i$ and ${\varvec{Y}}_j$ are not commutative if $i\ne j$, we have $e^{\sum _i^n {\varvec{Y}}_i}\ne \prod _i^n e^{{\varvec{Y}}_i} $ even $tr\{e^{\sum _i^n {\varvec{Y}}_i}\}\ne tr\{\prod _i^n e^{{\varvec{Y}}_i} \}$. In the case $n=2$, we have Golden–Thompson inequality in [8] which says $tr\{e^{{\varvec{Y}}_1+{\varvec{Y}}_2}\}\le tr\{e^{{\varvec{Y}}_1}e^{{\varvec{Y}}_2} \}$. However, it is false when $n=3$, which is studied by Lieb [12]. Fortunately, for $n\ge 3$, we have Sutter–Berta–Tomamichel inequality [18]:

Let $\Vert \cdot \Vert _*$ be the trace norm, and ${\varvec{H}}_k$ be Hermitian matrix. We have,

$$\begin{aligned} \log \left\| \exp \left( \sum _k^n {\varvec{H}}_k\right) \right\| _* \le \int \log \left\| \prod _k^n \exp [(1+it){\varvec{H}}_k]\right\| \,\textrm{d} \beta (t), \end{aligned}$$

(64)

where $\beta $ is a probability measure. For the right-hand side, we have,

$$\begin{aligned} \left\| \prod _k^n \exp [(1+it){\varvec{H}}_k]\right\| _*&\le \sum _i \sigma _i\left( \prod _k^n \exp [(1+it){\varvec{H}}_]k\right) \\&\le \sum _i \sigma _i(\exp ({\varvec{H}}_1))\sigma _i(\exp ({\varvec{H}}_2))...\sigma _i(\exp ({\varvec{H}}_n)), \end{aligned}$$

where $\sigma _i$ is the ith singular value.

If all $H_k$ are semi-positive definite, $\lambda _i=\sigma _i$, using the elementary inequality:

$$\begin{aligned} \sum _i \lambda _i^{\alpha _1}(\exp ({\varvec{H}}_1))\lambda _i^{\alpha _2}(\exp ({\varvec{H}}_2))...\lambda _i^{\alpha _n}(\exp ({\varvec{H}}_n))\le \sum _i \left( \sum _k\alpha _k \lambda _i(\exp ({\varvec{H}}_k))\right) ,\nonumber \\ \end{aligned}$$

(65)

where $\sum _i \alpha _i=1$, we have $\left\| \prod _k^n \exp [(1+it){\varvec{H}}_k]\right\| _*\le tr \{\sum _k \alpha _k \exp ({\varvec{H}}_k)\}.$

Therefore $tr \{\exp \left( \sum _k^n \alpha _k {\varvec{H}}_k\right) \} \le \sum _k \alpha _k tr \{ \exp ({\varvec{H}}_k)\}. $ $\square $

Proof of Lemma A.6

Let $\eta =\frac{\epsilon ^2}{w\sigma ^2L}.$ Thus if $\epsilon ^2\le {\mathcal {O}}\left( \frac{1}{MT^{2/3}}\right) $, $2\eta ^2M^2L^2T^3\le 1/5$ $\eta L M (T+1) \le {\mathcal {O}}(1)$. (a) follows.

For (b), we have,

$$\begin{aligned} T^2_{\max } M^2\eta ^2\rho ^2 S^2\sim & {} T^2_{\max } M^2\eta ^2\rho ^2 L\eta MT_{\max }\eta ^2 M T^2_{\max }\sigma ^2\nonumber \\= & {} T^4_{\max } \eta ^5 M^4\rho ^2 L \sigma ^2\le {\tilde{O}}\left( \frac{\eta L\sigma ^2}{\epsilon ^2}\right) \le O \left( \frac{1}{w}\right) . \end{aligned}$$

(66)

Thus if $w\le {\tilde{{\mathcal {O}}}}(1)$ is large enough, we have $T_{\max } M\eta \rho S\le {\mathcal {O}}(1)$. And

$$\begin{aligned} T_{\max }M \eta ^2\ell ^2\le {\mathcal {O}}\left( \frac{\epsilon ^2 \ell ^2}{w\sigma ^2 L\sqrt{\rho \epsilon }}\right) ,\quad \sqrt{T_{\max }M} \eta \ell \le {\mathcal {O}}(\epsilon ^{3/4}/\sqrt{w}). \end{aligned}$$

If $w\le {\tilde{{\mathcal {O}}}}(1)$, (b) follows.

As for (c), note that $S^2=B^2L\eta MT_{\max }\eta ^2 M T^2_{\max }\sigma ^2$,

$$\begin{aligned}&\frac{S^2-3\eta ^2M\sigma ^2 T_{\max }c^2c_2}{3\eta ^2M^2T_{\max }}-2L^2\eta ^2T^3M^2 F- 2cT_{\max }2L^2M\eta ^2T\sigma ^2\\&\quad =\frac{B^2}{3}L\eta T_{\max }^2\sigma ^2 -\frac{1}{M}c^2c_2\sigma ^2\\&\qquad -\,(2L^2\eta ^2T^2M^2) T60c\sigma ^2 \eta LT- (\eta TM L) 4c T_{\max }L\eta \sigma ^2\\&\quad \ge \frac{B^2}{3}L\eta T_{\max }^2\sigma ^2 -\frac{1}{M}c^2c_2\sigma ^2- 40c\sigma ^2 \eta LT^2- 4/3 c T_{\max }L\eta \sigma ^2. \end{aligned}$$

Since $T_{\max }>T$, and $L\eta T_{\max }^2\sigma ^2> \sigma ^2 \eta LT^2$, $L\eta T_{\max }^2\sigma ^2> T_{\max }L\eta \sigma ^2$, there exist $B\le {\mathcal {O}}(1)$ such that

$$\begin{aligned}&\frac{B^2}{3}L\eta T_{\max }^2\sigma ^2 -\frac{1}{M}c^2c_2\sigma ^2- 40c\sigma ^2 \eta LT^2- 4/3c T_{\max }L\eta \sigma ^2\\&\quad \ge 2T T_{\max }\eta L\sigma ^2\\&\quad =2TF_2. \end{aligned}$$

Then (c) follows. $2^{u }=\frac{6\sqrt{3}\sqrt{2q d} 2S}{\sqrt{M}\eta r}$, $u= {\widetilde{{\mathcal {O}}}}(1)$ and (d) follows. When $u\le {\widetilde{{\mathcal {O}}}}(1)$ is large enough, (e) follows. $\square $

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, L., Shen, B. On the Parallelization Upper Bound for Asynchronous Stochastic Gradients Descent in Non-convex Optimization. J Optim Theory Appl 196, 900–935 (2023). https://doi.org/10.1007/s10957-022-02141-9

Download citation

Received: 16 January 2022
Accepted: 19 November 2022
Published: 04 December 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s10957-022-02141-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the Parallelization Upper Bound for Asynchronous Stochastic Gradients Descent in Non-convex Optimization

Abstract

Access this article

Similar content being viewed by others

Accelerating neural network training with distributed asynchronous and selective optimization (DASO)

Scaling up stochastic gradient descent for non-convex optimisation

The Impact of Synchronization in Parallel Stochastic Gradient Descent

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix: Detailed Proof

Some Useful Lemmas

Definition A.1

Lemma A.1

Lemma A.2

Lemma A.3

Lemma A.4

Lemma A.5

Lemma A.6

Detailed Proof of Theorem 1

Lemma B.7

Theorem B.1

Theorem B.2

Remark B.1

Proof of Theorem B.2

1.1 Descent Inequality

Lemma C.8

Proof of Lemma C.8

Lemma C.9

Proof of Lemma C.9

1.1.1 Exponential Instability of Asynchronous Gradient Dynamics

Theorem C.3

Proof of theorem C.3

Lemma D.10

Proof

Theorem D.4

Proof

Corollary D.1

Lemma D.11

Proof

The Growth Rate of Polynomial \(f(t_1,t_2)\)

Lemma E.12

Theorem E.5

Proof

Corollary E.2

Proof of Theorem 2

Proof of Lemma B.7

Proof of theorem B.1

Lemma H.13

Proof of Lemma H.13

Proof of theorem B.1

Proof of Lemmas in Sect. A

Proof of Lemma A.5

Proof of Lemma A.6

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation