Variance reduction for root-finding problems

Davis, Damek

doi:10.1007/s10107-021-01758-4

Variance reduction for root-finding problems

Full Length Paper
Series A
Published: 30 January 2022

Volume 197, pages 375–410, (2023)
Cite this article

Mathematical Programming Submit manuscript

Damek Davis ORCID: orcid.org/0000-0003-2105-4641¹

940 Accesses
Explore all metrics

Abstract

Minimizing finite sums of smooth and strongly convex functions is an important task in machine learning. Recent work has developed stochastic gradient methods that optimize these sums with less computation than methods that do not exploit the finite sum structure. This speedup results from using efficiently constructed stochastic gradient estimators, which have variance that diminishes as the algorithm progresses. In this work, we ask whether the benefits of variance reduction extend to fixed point and root-finding problems involving sums of nonlinear operators. Our main result shows that variance reduction offers a similar speedup when applied to a broad class of root-finding problems. We illustrate the result on three tasks involving sums of n nonlinear operators: averaged fixed point, monotone inclusions, and nonsmooth common minimizer problems. In certain “poorly conditioned regimes,” the proposed method offers an n-fold speedup over standard methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On optimal universal first-order methods for minimizing heterogeneous sums

Article 22 September 2023

An Approach for Analyzing the Global Rate of Convergence of Quasi-Newton and Truncated-Newton Methods

Article 23 September 2016

SVRG meets AdaGrad: painless variance reduction

Article 10 November 2022

Notes

All results in the paper extend to separable Hilbert spaces with minimal modifications, but for simplicity we state the results only in Euclidean space.
We will show in that ${{\,\mathrm{zer}\,}}(S)$ is closed and convex in Lemma 1.4, which implies $\mathrm {proj}_{{\mathrm {zer}(S)}}$ is well-defined.
See [8, 26, 41] for information on convex error bounds and Sect. 3 for further examples.
Solutions to (1.11) may be recovered from any root $x^*$ of S, since $J_{\gamma A}(x^*)$ is a zero of $A + \frac{1}{n}\sum _{i=1}^n B_i$ [1, Proposition 2.3(ii)].

References

Attouch, H., Peypouquet, J., Redont, P.: Backward-forward algorithms for structured monotone inclusions in Hilbert spaces. Aust. J. Math. Anal. Appl. 457(2), 1095–1117 (2018)
Article MATH Google Scholar
Baillon, J.B., Combettes, P.L., Cominetti, R.: Asymptotic behavior of compositions of under-relaxed nonexpansive operators. arXiv preprint arXiv:1304.7078 (2013)
Balamurugan, P., Bach, F.: Stochastic variance reduction methods for saddle-point problems. In: Neural Information Processing Systems (NIPS), Advances in Neural Information Processing Systems. Barcelona, Spain (2016)
Bauschke, H.H., Combettes, P.L.: A weak-to-strong convergence principle for Fejér–Monotone methods in Hilbert spaces. Math. Oper. Res. 26(2), 248–264 (2001)
Article MATH Google Scholar
Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces, 2nd edn. Springer, Berlin (2017)
Book MATH Google Scholar
Bauschke, H.H., Combettes, P.L., Kruk, S.G.: Extrapolation algorithm for affine-convex feasibility problems. Numer. Algorithms 41(3), 239–274 (2006)
Article MATH Google Scholar
Bianchi, P.: A stochastic proximal point algorithm: convergence and application to convex optimization. In: IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP)
Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Math. Program. 165, 471–507 (2016)
Article MATH Google Scholar
Combettes, P.L.: Convex set theoretic image recovery by extrapolated iterations of parallel subgradient projections. IEEE Trans. Image Process. 6(4), 493–506 (1997)
Article Google Scholar
Combettes, P.L.: Quasi-Fejérian analysis of some optimization algorithms. In: Dan Butnariu, Y.C., Reich, S. (eds.) Inherently Parallel Algorithms in Feasibility and Optimization and their Applications. Studies in Computational Mathematics, vol. 8, pp. 115 – 152. Elsevier (2001)
Combettes, P.L.: Solving monotone inclusions via compositions of nonexpansive averaged operators. Optimization 53(5–6), 475–504 (2004)
Article MATH Google Scholar
Combettes, P.L., Eckstein, J.: Asynchronous block-iterative primal-dual decomposition methods for monotone inclusions. Math. Program. 168, 645–672 (2016)
Article MATH Google Scholar
Combettes, P.L., Pesquet, J.C.: Stochastic quasi-Fejér block-coordinate fixed point iterations with random sweeping. SIAM J. Optim. 25(2), 1221–1248 (2015)
Article MATH Google Scholar
Combettes, P.L., Pesquet, J.C.: Stochastic approximations and perturbations in forward-backward splitting for monotone operators. Pure Appl. Funct. Anal. 1(1), 13–37 (2016)
MATH Google Scholar
Combettes, P.L., Pesquet, J.C.: Stochastic quasi-Fejér block-coordinate fixed point iterations with random sweeping II: mean-square and linear convergence. arXiv preprint arXiv:1704.08083 (2017)
Combettes, P.L., Yamada, I.: Compositions and convex combinations of averaged nonexpansive operators. J. Math. Anal. Appl. 425(1), 55–70 (2015)
Article MATH Google Scholar
Davis, D.: SMART: the stochastic monotone aggregated root-finding algorithm. arXiv preprint arXiv:1601.00698 (2016)
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems, pp. 1646–1654 (2014)
Defazio, A., Domke, J., Caetano, T.: Finito: a faster, permutable incremental gradient method for big data problems. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1125–1133 (2014)
Durrett, R.: Probability: Theory and Examples. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press, Cambridge (2010). https://doi.org/10.1017/CBO9780511779398
Book MATH Google Scholar
Fercoq, O., Richtárik, P.: Accelerated, parallel, and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015)
Article MATH Google Scholar
Ioffe, A.D.: Variational Analysis of Regular Mappings. Springer Monographs in Mathematics, Springer, Cham (2017)
Book MATH Google Scholar
Reddi, S.J., Sra, S., Poczos, B., Smola, A.J.: Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 1145–1153. Curran Associates Inc, Red Hook (2016)
Google Scholar
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)
Krasnosel’skii, M.A.: Two remarks on the method of successive approximations. Uspekhi Mat. Nauk 10(1), 123–127 (1955)
Google Scholar
Lai, M.J., Yin, W.: Augmented \$ $\ell _{-}1$ \$ and nuclear-norm models with a globally linearly convergent algorithm. SIAM J. Imag. Sci. 6(2), 1059–1091 (2013)
Article MATH Google Scholar
Ledoux, M., Talagrand, M.: Probability in Banach Spaces: Isoperimetry and Processes. Springer, Berlin (2013)
MATH Google Scholar
Liu, J., Wright, S.J., Ré, C., Bittorf, V., Sridhar, S.: An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res. 16, 285–322 (2015)
MATH Google Scholar
Mann, W.R.: Mean value methods in iteration. Proc. Am. Math. Soc. 4(3), 506–510 (1953)
Article MATH Google Scholar
Nemirovskii, A., Yudin, D.B., Dawson, E.R.: Problem complexity and method efficiency in optimization (1983)
Peng, Z., Xu, Y., Yan, M., Yin, W.: ARock: an algorithmic framework for asynchronous parallel coordinate updates. SIAM J. Sci. Comput. 38(5), A2851–A2879 (2016)
Article MATH Google Scholar
Polyak, B.T.: Minimization of unsmooth functionals. USSR Comput. Math. Math. Phys. 9(3), 14–29 (1969)
Article MATH Google Scholar
Recht, B., Re, C., Wright, S., Niu, F.: Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 693–701 (2011)
Robbins, H., Siegmund, D.: A convergence theorem for non negative almost supermartingales and some applications. In: Herbert Robbins Selected Papers, pp. 111–135. Springer (1985)
Rosasco, L., Villa, S., Vũ, B.C.: A stochastic forward-backward splitting method for solving monotone inclusions in hilbert spaces. arXiv preprint arXiv:1403.7999 (2014)
Ryu, E.K., Boyd, S.: A primer on monotone operator methods. Appl. Comput. Math 15(1), 3–43 (2016)
MATH Google Scholar
Schmidt, M., Roux, N.L., Bach, F.: Minimizing finite sums with the stochastic average gradient. arXiv preprint arXiv:1309.2388 (2013)
Shalev-Shwartz, S., Zhang, T.: Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Math. Program. 155, 105–145 (2016)
Article MATH Google Scholar
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14(2), 567–599 (2013)
MATH Google Scholar
Strohmer, T., Vershynin, R.: A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl. 15(2), 262–278 (2009)
Article MATH Google Scholar
Zhang, H.: The restricted strong convexity revisited: analysis of equivalence to error bound and quadratic growth. Optim. Lett. 11(4), 817–833 (2017)
Article MATH Google Scholar

Download references

Acknowledgements

We thank B. Edmunds, P. Combettes, W. Yin, and the anonymous reviewers for the insightful feedback.

Author information

Authors and Affiliations

School of Operations Research and Information Engineering, Cornell University, Ithaca, NY, 16850, USA
Damek Davis

Authors

Damek Davis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Damek Davis.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This material is supported by an Alfred P. Sloan research fellowship and NSF DMS award 2047637.

Appendices

Appendix

Convergence of candidate algorithm (1.5)

Lemma A.1

Suppose that $S :{\mathcal {H}}\rightarrow {\mathcal {H}}$ satisfies conditions (1.3) and (1.4). Let $x^0 \in {\mathcal {H}}$, let $\lambda = n\min _i\{\beta _i \}$ and consider the following iteration:

$$\begin{aligned} x^{k+1} = x^k - \lambda S(x^k). \end{aligned}$$

Then for all $k \in {\mathbb {N}}$, we have

$$\begin{aligned} \mathrm {dist}(x^{k+1}, {{\,\mathrm{zer}\,}}(S)) \le \left( 1 - \min \{\beta _i\}\mu \right) \mathrm {dist}(x^k, {{\,\mathrm{zer}\,}}(S)). \end{aligned}$$

Consequently, after at most

$$\begin{aligned} \left\lceil \frac{\log (\mathrm {dist}^2(x^0, {{\,\mathrm{zer}\,}}(S))/\varepsilon )}{\mu \min _i\{\beta _i\}}\right\rceil \end{aligned}$$

iterations, the point $x^k$ satisfies $\mathrm {dist}^2(x^k, {{\,\mathrm{zer}\,}}(S)) \le \varepsilon $.

Proof

For all $k \in {\mathbb {N}}$ define ${\bar{x}}^k := \mathrm {proj}_{{{\,\mathrm{zer}\,}}(S)}(x^k)$ and observe that

$$\begin{aligned} \Vert x^{k+1} - {\bar{x}}^k\Vert ^2&= \Vert x^k - {\bar{x}}^k\Vert ^2 - 2\lambda \langle S(x^k), x^{k} - {\bar{x}}^k\rangle + \lambda ^2\Vert S(x^k)\Vert ^2. \end{aligned}$$

(A.1)

By Jensen’s inequality, we can upper bound the term $\Vert S(x^k)\Vert ^2$:

$$\begin{aligned} \Vert S(x^k)\Vert ^2&\le \frac{1}{n}\sum _{i=1}^n \Vert S_i(x^k) - S_i({\bar{x}}^k)\Vert ^2 \\&\le \frac{1}{n\min _i\{\beta _i\} } \sum _{i=1}^n \beta _i \Vert S_i(x^k) - S_i({\bar{x}}^k)\Vert ^2\\&\le \frac{1}{\min _i\{\beta _i\} } \langle S(x^k), x^k - {\bar{x}}^k\rangle , \end{aligned}$$

where the final inequality follows from (1.3). Using this estimate in (A.1), we find that

$$\begin{aligned} \Vert x^{k+1} - {\bar{x}}^k\Vert ^2&\le \Vert x^k - {\bar{x}}^k\Vert ^2 - \lambda \left( 2 - \frac{1}{\min _i\{\beta _i\}}\lambda \right) \langle S(x^k), x^{k} - {\bar{x}}^k\rangle \\&\le (1-\min \{\beta _i\}\mu )\Vert x^k - {\bar{x}}^k\Vert ^2 = (1-\min \{\beta _i\}\mu )\mathrm {dist}(x^{k}, {\mathrm {zer}(S)}), \end{aligned}$$

where the final inequality follows from (1.4). To conclude the proof, note that

$$\begin{aligned} \mathrm {dist}(x^{k+1}, {\mathrm {zer}(S)}) \le \Vert x^{k+1} - {\bar{x}}^k\Vert ^2 \le (1-\min \{\beta _i\}\mu )\mathrm {dist}(x^{k}, {\mathrm {zer}(S)}), \end{aligned}$$

as desired. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Davis, D. Variance reduction for root-finding problems. Math. Program. 197, 375–410 (2023). https://doi.org/10.1007/s10107-021-01758-4

Download citation

Received: 09 June 2017
Accepted: 24 November 2021
Published: 30 January 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s10107-021-01758-4

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Variance reduction for root-finding problems

Abstract

Access this article

Similar content being viewed by others

On optimal universal first-order methods for minimizing heterogeneous sums

An Approach for Analyzing the Global Rate of Convergence of Quasi-Newton and Truncated-Newton Methods

SVRG meets AdaGrad: painless variance reduction

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix

Convergence of candidate algorithm (1.5)

Lemma A.1

Proof

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Variance reduction for root-finding problems

Abstract

Access this article

Similar content being viewed by others

On optimal universal first-order methods for minimizing heterogeneous sums

An Approach for Analyzing the Global Rate of Convergence of Quasi-Newton and Truncated-Newton Methods

SVRG meets AdaGrad: painless variance reduction

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix

Convergence of candidate algorithm (1.5)

Lemma A.1

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation