Skip to main content
Log in

An elementary approach to tight worst case complexity analysis of gradient based methods

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

This work presents a novel analysis that allows to achieve tight complexity bounds of gradient-based methods for convex optimization. We start by identifying some of the pitfalls rooted in the classical complexity analysis of the gradient descent method, and show how they can be remedied. Our methodology hinges on elementary and direct arguments in the spirit of the classical analysis. It allows us to establish some new (and reproduce known) tight complexity results for several fundamental algorithms including, gradient descent, proximal point and proximal gradient methods which previously could be proven only through computer-assisted convergence proof arguments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. Though we will commonly write “solving a problem”, here and throughout the rest of the paper, it should be understood as finding an approximate solution to the problem.

  2. Inequality (\(\bigstar \)) will play a starring role in this work, hence the label.

  3. By applying (\(\bigstar \)) twice at (xy), first as is and secondly after reversing the roles of (xy).

  4. In case that \(\Vert \nabla f(x^{n+1})\Vert =\Vert \nabla f(x^{n})\Vert \).

  5. Under the convention that \(T_{-1}:=0\).

  6. Observe that we replace the index from i to n when we use (4.7) in our development.

  7. Clearly, the example presented in [4] for the gradient projection method in order to illustrate that the complexity bound is tight holds also for the more general proximal gradient method.

References

  1. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  2. Beck, A.: First-Order Methods in Optimization, vol. 25. SIAM, Philadelphia (2017)

    Book  MATH  Google Scholar 

  3. Bertsekas, D.P.: Convex Optimization Algorithms. Athena Scientific, Belmont (2015)

    MATH  Google Scholar 

  4. Drori, Y.: Contributions to the complexity analysis of optimization algorithms. Ph.D. Thesis Tel-Aviv University (2014)

  5. Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. Ser. A 145(1–2), 451–482 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  6. Goldstein, A.A.: Convex programming in Hilbert space. B. Am. Math. Soc. 70(5), 709–710 (1964)

    Article  MathSciNet  MATH  Google Scholar 

  7. Güler, O.: On the convergence of the proximal point algorithm for convex minimization. SIAM J. Control. Optim. 29(2), 403–419 (1991)

    Article  MathSciNet  MATH  Google Scholar 

  8. Graham, R., Knuth, D., Patashnik, O.: Concrete Mathematics: A Foundation for Computer Science, 2nd edn. Addison-Wesley, Boston (1994)

    MATH  Google Scholar 

  9. Kim, D., Fessler, J.A.: Optimizing the efficiency of first-order methods for decreasing the gradient of smooth convex functions. J. Optim. Theory Appl. 188, 192–219 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  10. Levitin, E.S., Polyak, B.T.: Constrained minimization methods. USSR Comp. Math. Math. Phys. 6(5), 1–50 (1966)

    Article  MATH  Google Scholar 

  11. Lions, P.L., Mercier, I.: Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal. 16(6), 964–979 (1979)

    Article  MathSciNet  MATH  Google Scholar 

  12. Martinet, B.: Régularisation d’inéquations variationnelles par approximations successives. Rev. Française Informatique. Recherche Opérationnelle 4, 154–158 (1970)

    MATH  Google Scholar 

  13. Moreau, J.-J.: Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. France 93, 273–299 (1965)

    Article  MathSciNet  MATH  Google Scholar 

  14. Passty, G.B.: Ergodic convergence to a zero of the sum of monotone operators in Hilbert space. J. Math. Anal. Appl. 72(2), 383–390 (1979)

    Article  MathSciNet  MATH  Google Scholar 

  15. Sabach, S., Teboulle, M.: Lagrangian methods for composite optimization. Handb. Numer. Anal. 20, 401–436 (2019)

    MathSciNet  MATH  Google Scholar 

  16. Taylor, A.B., Hendrickx, J.M., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Math. Program. Ser. A 161(1–2), 307–345 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  17. Taylor, A.B., Hendrickx, J.M., Glineur, F.: Exact worst-case performance of first-order methods for composite convex optimization. SIAM J. Optim. 27(3), 1283–1313 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  18. Teboulle, M.: A simplified view of first order methods for optimization. Math. Program. Ser. B 170(1), 67–96 (2018)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Funding

This research was partially supported by the Israel Science Foundation, under ISF Grants 1844-16, and 2619-20.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marc Teboulle.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A

Appendix A

In the following lemma we establish a useful relation between \(\lambda _n\) and \(T_n\).

Lemma 13

For a given positive integer k, let \(\{t_n\}_{n=0}^{k-1}\) be the sequence defined by the recurrence relation (4.4). Then for any \(n=0,1,\dots ,k-1\) it holds that

$$\begin{aligned} \lambda _n = \frac{LT_n}{2+LT_n}. \end{aligned}$$
(A.1)

Proof

Recall that according to (4.4), \(Lt_0=\sqrt{2}\) and that \(Lt_n\) is the positive root of

$$\begin{aligned} 0= & {} (Lt_n)^2+(LT_{n-1})Lt_n-2(LT_{n-1}+1)\\= & {} (Lt_n-1)^2+(LT_{n-1}+2)(Lt_n-1)-(LT_{n-1}+1), \end{aligned}$$

for any \(n=1,2,\dots ,k-1\). Thus, \(\lambda _0=Lt_0-1=\sqrt{2}-1\) and \(\lambda _n=Lt_n-1\) is the positive root of

$$\begin{aligned} \lambda _n^2+(LT_{n-1}+2)\lambda _n-(LT_{n-1}+1)=0, \end{aligned}$$

for any \(n=1,2,\dots ,k-1\). The last quadratic equation can be written as

$$\begin{aligned} \lambda _n\left( 2+LT_{n-1}+(\lambda _n+1)\right) = LT_{n-1}+\lambda _n+1. \end{aligned}$$

Using the relation \(LT_n=LT_{n-1}+Lt_n=LT_{n-1}+\lambda _n+1\) we can write the above as

$$\begin{aligned} \lambda _n = \frac{LT_{n-1}+(\lambda _n+1)}{2+LT_{n-1}+(\lambda _n+1)} = \frac{LT_n}{2+LT_n}. \end{aligned}$$

The above holds for \(n=0,1,\dots ,k-1\) (also for \(n=0\)) since

$$\begin{aligned} \lambda _0 = \sqrt{2}-1 =(\sqrt{2}-1)\frac{(\sqrt{2}+1)\sqrt{2}}{2+\sqrt{2}} =\frac{(2-1)\sqrt{2}}{2+\sqrt{2}} = \frac{LT_0}{2+LT_0}. \end{aligned}$$

\(\square \)

We are now ready to prove Lemma 6.

Proof of Lemma 6

We begin by proving that \(\rho _n=0\) for all \(n=0,1,\dots ,k-1\). Since \(\lambda _0+1>0\) in order to examine whether \(\rho _0=(\lambda _0+1)(\tau (\lambda _0)-\lambda _0)=0\) it is enough to verify that \(\lambda _0=\tau (\lambda _0)\). The latter relation holds for \(\lambda _0=\sqrt{2}-1\). Indeed,

$$\begin{aligned}&\sqrt{2}-1=\lambda _0=\tau (\lambda _0)=1-\frac{2\lambda _0^2}{1-\lambda _0} = 1-\frac{2(\sqrt{2}-1)^2}{2-\sqrt{2}}\\&\quad =1-\frac{\sqrt{2}(\sqrt{2}-1)^2}{\sqrt{2}-1}=\sqrt{2}-1. \end{aligned}$$

To show that \(\rho _n=0\) for all \(n=1,2,\dots ,k-1\) we first observe that due to Lemma 13

$$\begin{aligned} \lambda _{n-1} = \frac{LT_{n-1}}{2+LT_{n-1}}=\frac{LT_{n}-(\lambda _n+1)}{2+LT_{n}-(\lambda _n+1)}=\frac{LT_{n}\lambda _n-1}{LT_{n}\lambda _n+1}. \end{aligned}$$

Hence,

$$\begin{aligned} 1+\lambda _{n-1}= \frac{LT_{n}-\lambda _n+1+LT_{n}-\lambda _n-1}{LT_{n}-\lambda _n+1} = \frac{2(LT_n-\lambda _n)}{LT_{n}-\lambda _n+1}, \end{aligned}$$

and

$$\begin{aligned} 1-\lambda _{n-1}= \frac{LT_{n}-\lambda _n+1-\left( LT_{n}-\lambda _n-1\right) }{LT_{n}-\lambda _n+1} = \frac{2}{LT_{n}-\lambda _n+1}, \end{aligned}$$

and thus

$$\begin{aligned} \tau ^+(\lambda _{n-1})=1+\frac{2\lambda _{n-1}}{1-\lambda _{n-1}}=\frac{1+\lambda _{n-1}}{1-\lambda _{n-1}} = LT_n-\lambda _n. \end{aligned}$$
(A.2)

Using the above we can write for all \(n=1,2,\dots ,k-1\)

$$\begin{aligned} \begin{array}{rl} \rho _n &{}= LT_n\tau (\lambda _n)+LT_{n-1}\tau ^{+}(\lambda _{n-1})-(\lambda _n+1)\lambda _n\\ &{}=LT_n\left( 1-\frac{2\lambda _n^2}{1-\lambda _n}\right) +\left[ LT_n-(\lambda _n+1)\right] (LT_n-\lambda _n)-(\lambda _n+1)\lambda _n\\ &{}=LT_n\left( 1-\frac{2\lambda _n^2}{1-\lambda _n}\right) +LT_n\left[ LT_n-(2\lambda _n+1)\right] =LT_n\left[ LT_n-\frac{2\lambda _n^2+2\lambda _n(1-\lambda _n)}{1-\lambda _n} \right] \\ &{}=LT_n\left[ LT_n-\frac{2\lambda _n}{1-\lambda _n}\right] \overset{(A.1)}{=}LT_n\left[ LT_n-\frac{2LT_n/(2+LT_n)}{2/(2+LT_n)}\right] =0. \end{array} \end{aligned}$$

Finally, the alleged value of \(\rho _k\) is established due to Lemma 13 as follows

$$\begin{aligned} \rho _k=LT_{k-1}\tau ^+(\lambda _{k-1})\overset{(A.2)}{=}LT_{k-1}(LT_k-\lambda _k)=LT_{k-1}(LT_{k-1}+1), \end{aligned}$$

which completes the proof.\(\square \)

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Teboulle, M., Vaisbourd, Y. An elementary approach to tight worst case complexity analysis of gradient based methods. Math. Program. 201, 63–96 (2023). https://doi.org/10.1007/s10107-022-01899-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-022-01899-0

Keywords

Mathematics subject classification

Navigation