Skip to main content
Log in

Variance reduction for root-finding problems

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

Minimizing finite sums of smooth and strongly convex functions is an important task in machine learning. Recent work has developed stochastic gradient methods that optimize these sums with less computation than methods that do not exploit the finite sum structure. This speedup results from using efficiently constructed stochastic gradient estimators, which have variance that diminishes as the algorithm progresses. In this work, we ask whether the benefits of variance reduction extend to fixed point and root-finding problems involving sums of nonlinear operators. Our main result shows that variance reduction offers a similar speedup when applied to a broad class of root-finding problems. We illustrate the result on three tasks involving sums of n nonlinear operators: averaged fixed point, monotone inclusions, and nonsmooth common minimizer problems. In certain “poorly conditioned regimes,” the proposed method offers an n-fold speedup over standard methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. All results in the paper extend to separable Hilbert spaces with minimal modifications, but for simplicity we state the results only in Euclidean space.

  2. We will show in that \({{\,\mathrm{zer}\,}}(S)\) is closed and convex in Lemma 1.4, which implies \(\mathrm {proj}_{{\mathrm {zer}(S)}}\) is well-defined.

  3. See [8, 26, 41] for information on convex error bounds and Sect. 3 for further examples.

  4. Solutions to (1.11) may be recovered from any root \(x^*\) of S, since \(J_{\gamma A}(x^*)\) is a zero of \(A + \frac{1}{n}\sum _{i=1}^n B_i\) [1, Proposition 2.3(ii)].

References

  1. Attouch, H., Peypouquet, J., Redont, P.: Backward-forward algorithms for structured monotone inclusions in Hilbert spaces. Aust. J. Math. Anal. Appl. 457(2), 1095–1117 (2018)

    Article  MATH  Google Scholar 

  2. Baillon, J.B., Combettes, P.L., Cominetti, R.: Asymptotic behavior of compositions of under-relaxed nonexpansive operators. arXiv preprint arXiv:1304.7078 (2013)

  3. Balamurugan, P., Bach, F.: Stochastic variance reduction methods for saddle-point problems. In: Neural Information Processing Systems (NIPS), Advances in Neural Information Processing Systems. Barcelona, Spain (2016)

  4. Bauschke, H.H., Combettes, P.L.: A weak-to-strong convergence principle for Fejér–Monotone methods in Hilbert spaces. Math. Oper. Res. 26(2), 248–264 (2001)

    Article  MATH  Google Scholar 

  5. Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces, 2nd edn. Springer, Berlin (2017)

    Book  MATH  Google Scholar 

  6. Bauschke, H.H., Combettes, P.L., Kruk, S.G.: Extrapolation algorithm for affine-convex feasibility problems. Numer. Algorithms 41(3), 239–274 (2006)

    Article  MATH  Google Scholar 

  7. Bianchi, P.: A stochastic proximal point algorithm: convergence and application to convex optimization. In: IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP)

  8. Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Math. Program. 165, 471–507 (2016)

    Article  MATH  Google Scholar 

  9. Combettes, P.L.: Convex set theoretic image recovery by extrapolated iterations of parallel subgradient projections. IEEE Trans. Image Process. 6(4), 493–506 (1997)

    Article  Google Scholar 

  10. Combettes, P.L.: Quasi-Fejérian analysis of some optimization algorithms. In: Dan Butnariu, Y.C., Reich, S. (eds.) Inherently Parallel Algorithms in Feasibility and Optimization and their Applications. Studies in Computational Mathematics, vol. 8, pp. 115 – 152. Elsevier (2001)

  11. Combettes, P.L.: Solving monotone inclusions via compositions of nonexpansive averaged operators. Optimization 53(5–6), 475–504 (2004)

    Article  MATH  Google Scholar 

  12. Combettes, P.L., Eckstein, J.: Asynchronous block-iterative primal-dual decomposition methods for monotone inclusions. Math. Program. 168, 645–672 (2016)

    Article  MATH  Google Scholar 

  13. Combettes, P.L., Pesquet, J.C.: Stochastic quasi-Fejér block-coordinate fixed point iterations with random sweeping. SIAM J. Optim. 25(2), 1221–1248 (2015)

    Article  MATH  Google Scholar 

  14. Combettes, P.L., Pesquet, J.C.: Stochastic approximations and perturbations in forward-backward splitting for monotone operators. Pure Appl. Funct. Anal. 1(1), 13–37 (2016)

    MATH  Google Scholar 

  15. Combettes, P.L., Pesquet, J.C.: Stochastic quasi-Fejér block-coordinate fixed point iterations with random sweeping II: mean-square and linear convergence. arXiv preprint arXiv:1704.08083 (2017)

  16. Combettes, P.L., Yamada, I.: Compositions and convex combinations of averaged nonexpansive operators. J. Math. Anal. Appl. 425(1), 55–70 (2015)

    Article  MATH  Google Scholar 

  17. Davis, D.: SMART: the stochastic monotone aggregated root-finding algorithm. arXiv preprint arXiv:1601.00698 (2016)

  18. Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems, pp. 1646–1654 (2014)

  19. Defazio, A., Domke, J., Caetano, T.: Finito: a faster, permutable incremental gradient method for big data problems. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1125–1133 (2014)

  20. Durrett, R.: Probability: Theory and Examples. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press, Cambridge (2010). https://doi.org/10.1017/CBO9780511779398

    Book  MATH  Google Scholar 

  21. Fercoq, O., Richtárik, P.: Accelerated, parallel, and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015)

    Article  MATH  Google Scholar 

  22. Ioffe, A.D.: Variational Analysis of Regular Mappings. Springer Monographs in Mathematics, Springer, Cham (2017)

    Book  MATH  Google Scholar 

  23. Reddi, S.J., Sra, S., Poczos, B., Smola, A.J.: Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 1145–1153. Curran Associates Inc, Red Hook (2016)

    Google Scholar 

  24. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)

  25. Krasnosel’skii, M.A.: Two remarks on the method of successive approximations. Uspekhi Mat. Nauk 10(1), 123–127 (1955)

    Google Scholar 

  26. Lai, M.J., Yin, W.: Augmented \$ \(\ell _{-}1\) \$ and nuclear-norm models with a globally linearly convergent algorithm. SIAM J. Imag. Sci. 6(2), 1059–1091 (2013)

    Article  MATH  Google Scholar 

  27. Ledoux, M., Talagrand, M.: Probability in Banach Spaces: Isoperimetry and Processes. Springer, Berlin (2013)

    MATH  Google Scholar 

  28. Liu, J., Wright, S.J., Ré, C., Bittorf, V., Sridhar, S.: An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res. 16, 285–322 (2015)

    MATH  Google Scholar 

  29. Mann, W.R.: Mean value methods in iteration. Proc. Am. Math. Soc. 4(3), 506–510 (1953)

    Article  MATH  Google Scholar 

  30. Nemirovskii, A., Yudin, D.B., Dawson, E.R.: Problem complexity and method efficiency in optimization (1983)

  31. Peng, Z., Xu, Y., Yan, M., Yin, W.: ARock: an algorithmic framework for asynchronous parallel coordinate updates. SIAM J. Sci. Comput. 38(5), A2851–A2879 (2016)

    Article  MATH  Google Scholar 

  32. Polyak, B.T.: Minimization of unsmooth functionals. USSR Comput. Math. Math. Phys. 9(3), 14–29 (1969)

    Article  MATH  Google Scholar 

  33. Recht, B., Re, C., Wright, S., Niu, F.: Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 693–701 (2011)

  34. Robbins, H., Siegmund, D.: A convergence theorem for non negative almost supermartingales and some applications. In: Herbert Robbins Selected Papers, pp. 111–135. Springer (1985)

  35. Rosasco, L., Villa, S., Vũ, B.C.: A stochastic forward-backward splitting method for solving monotone inclusions in hilbert spaces. arXiv preprint arXiv:1403.7999 (2014)

  36. Ryu, E.K., Boyd, S.: A primer on monotone operator methods. Appl. Comput. Math 15(1), 3–43 (2016)

    MATH  Google Scholar 

  37. Schmidt, M., Roux, N.L., Bach, F.: Minimizing finite sums with the stochastic average gradient. arXiv preprint arXiv:1309.2388 (2013)

  38. Shalev-Shwartz, S., Zhang, T.: Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Math. Program. 155, 105–145 (2016)

    Article  MATH  Google Scholar 

  39. Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14(2), 567–599 (2013)

    MATH  Google Scholar 

  40. Strohmer, T., Vershynin, R.: A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl. 15(2), 262–278 (2009)

    Article  MATH  Google Scholar 

  41. Zhang, H.: The restricted strong convexity revisited: analysis of equivalence to error bound and quadratic growth. Optim. Lett. 11(4), 817–833 (2017)

    Article  MATH  Google Scholar 

Download references

Acknowledgements

We thank B. Edmunds, P. Combettes, W. Yin, and the anonymous reviewers for the insightful feedback.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Damek Davis.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This material is supported by an Alfred P. Sloan research fellowship and NSF DMS award 2047637.

Appendices

Appendix

Convergence of candidate algorithm (1.5)

Lemma A.1

Suppose that \(S :{\mathcal {H}}\rightarrow {\mathcal {H}}\) satisfies conditions (1.3) and (1.4). Let \(x^0 \in {\mathcal {H}}\), let \(\lambda = n\min _i\{\beta _i \}\) and consider the following iteration:

$$\begin{aligned} x^{k+1} = x^k - \lambda S(x^k). \end{aligned}$$

Then for all \(k \in {\mathbb {N}}\), we have

$$\begin{aligned} \mathrm {dist}(x^{k+1}, {{\,\mathrm{zer}\,}}(S)) \le \left( 1 - \min \{\beta _i\}\mu \right) \mathrm {dist}(x^k, {{\,\mathrm{zer}\,}}(S)). \end{aligned}$$

Consequently, after at most

$$\begin{aligned} \left\lceil \frac{\log (\mathrm {dist}^2(x^0, {{\,\mathrm{zer}\,}}(S))/\varepsilon )}{\mu \min _i\{\beta _i\}}\right\rceil \end{aligned}$$

iterations, the point \(x^k\) satisfies \(\mathrm {dist}^2(x^k, {{\,\mathrm{zer}\,}}(S)) \le \varepsilon \).

Proof

For all \(k \in {\mathbb {N}}\) define \({\bar{x}}^k := \mathrm {proj}_{{{\,\mathrm{zer}\,}}(S)}(x^k)\) and observe that

$$\begin{aligned} \Vert x^{k+1} - {\bar{x}}^k\Vert ^2&= \Vert x^k - {\bar{x}}^k\Vert ^2 - 2\lambda \langle S(x^k), x^{k} - {\bar{x}}^k\rangle + \lambda ^2\Vert S(x^k)\Vert ^2. \end{aligned}$$
(A.1)

By Jensen’s inequality, we can upper bound the term \(\Vert S(x^k)\Vert ^2\):

$$\begin{aligned} \Vert S(x^k)\Vert ^2&\le \frac{1}{n}\sum _{i=1}^n \Vert S_i(x^k) - S_i({\bar{x}}^k)\Vert ^2 \\&\le \frac{1}{n\min _i\{\beta _i\} } \sum _{i=1}^n \beta _i \Vert S_i(x^k) - S_i({\bar{x}}^k)\Vert ^2\\&\le \frac{1}{\min _i\{\beta _i\} } \langle S(x^k), x^k - {\bar{x}}^k\rangle , \end{aligned}$$

where the final inequality follows from (1.3). Using this estimate in (A.1), we find that

$$\begin{aligned} \Vert x^{k+1} - {\bar{x}}^k\Vert ^2&\le \Vert x^k - {\bar{x}}^k\Vert ^2 - \lambda \left( 2 - \frac{1}{\min _i\{\beta _i\}}\lambda \right) \langle S(x^k), x^{k} - {\bar{x}}^k\rangle \\&\le (1-\min \{\beta _i\}\mu )\Vert x^k - {\bar{x}}^k\Vert ^2 = (1-\min \{\beta _i\}\mu )\mathrm {dist}(x^{k}, {\mathrm {zer}(S)}), \end{aligned}$$

where the final inequality follows from (1.4). To conclude the proof, note that

$$\begin{aligned} \mathrm {dist}(x^{k+1}, {\mathrm {zer}(S)}) \le \Vert x^{k+1} - {\bar{x}}^k\Vert ^2 \le (1-\min \{\beta _i\}\mu )\mathrm {dist}(x^{k}, {\mathrm {zer}(S)}), \end{aligned}$$

as desired. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Davis, D. Variance reduction for root-finding problems. Math. Program. 197, 375–410 (2023). https://doi.org/10.1007/s10107-021-01758-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-021-01758-4

Keywords

Mathematics Subject Classification

Navigation