Abstract
Minimizing finite sums of smooth and strongly convex functions is an important task in machine learning. Recent work has developed stochastic gradient methods that optimize these sums with less computation than methods that do not exploit the finite sum structure. This speedup results from using efficiently constructed stochastic gradient estimators, which have variance that diminishes as the algorithm progresses. In this work, we ask whether the benefits of variance reduction extend to fixed point and root-finding problems involving sums of nonlinear operators. Our main result shows that variance reduction offers a similar speedup when applied to a broad class of root-finding problems. We illustrate the result on three tasks involving sums of n nonlinear operators: averaged fixed point, monotone inclusions, and nonsmooth common minimizer problems. In certain “poorly conditioned regimes,” the proposed method offers an n-fold speedup over standard methods.
Similar content being viewed by others
Notes
All results in the paper extend to separable Hilbert spaces with minimal modifications, but for simplicity we state the results only in Euclidean space.
We will show in that \({{\,\mathrm{zer}\,}}(S)\) is closed and convex in Lemma 1.4, which implies \(\mathrm {proj}_{{\mathrm {zer}(S)}}\) is well-defined.
References
Attouch, H., Peypouquet, J., Redont, P.: Backward-forward algorithms for structured monotone inclusions in Hilbert spaces. Aust. J. Math. Anal. Appl. 457(2), 1095–1117 (2018)
Baillon, J.B., Combettes, P.L., Cominetti, R.: Asymptotic behavior of compositions of under-relaxed nonexpansive operators. arXiv preprint arXiv:1304.7078 (2013)
Balamurugan, P., Bach, F.: Stochastic variance reduction methods for saddle-point problems. In: Neural Information Processing Systems (NIPS), Advances in Neural Information Processing Systems. Barcelona, Spain (2016)
Bauschke, H.H., Combettes, P.L.: A weak-to-strong convergence principle for Fejér–Monotone methods in Hilbert spaces. Math. Oper. Res. 26(2), 248–264 (2001)
Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces, 2nd edn. Springer, Berlin (2017)
Bauschke, H.H., Combettes, P.L., Kruk, S.G.: Extrapolation algorithm for affine-convex feasibility problems. Numer. Algorithms 41(3), 239–274 (2006)
Bianchi, P.: A stochastic proximal point algorithm: convergence and application to convex optimization. In: IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP)
Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Math. Program. 165, 471–507 (2016)
Combettes, P.L.: Convex set theoretic image recovery by extrapolated iterations of parallel subgradient projections. IEEE Trans. Image Process. 6(4), 493–506 (1997)
Combettes, P.L.: Quasi-Fejérian analysis of some optimization algorithms. In: Dan Butnariu, Y.C., Reich, S. (eds.) Inherently Parallel Algorithms in Feasibility and Optimization and their Applications. Studies in Computational Mathematics, vol. 8, pp. 115 – 152. Elsevier (2001)
Combettes, P.L.: Solving monotone inclusions via compositions of nonexpansive averaged operators. Optimization 53(5–6), 475–504 (2004)
Combettes, P.L., Eckstein, J.: Asynchronous block-iterative primal-dual decomposition methods for monotone inclusions. Math. Program. 168, 645–672 (2016)
Combettes, P.L., Pesquet, J.C.: Stochastic quasi-Fejér block-coordinate fixed point iterations with random sweeping. SIAM J. Optim. 25(2), 1221–1248 (2015)
Combettes, P.L., Pesquet, J.C.: Stochastic approximations and perturbations in forward-backward splitting for monotone operators. Pure Appl. Funct. Anal. 1(1), 13–37 (2016)
Combettes, P.L., Pesquet, J.C.: Stochastic quasi-Fejér block-coordinate fixed point iterations with random sweeping II: mean-square and linear convergence. arXiv preprint arXiv:1704.08083 (2017)
Combettes, P.L., Yamada, I.: Compositions and convex combinations of averaged nonexpansive operators. J. Math. Anal. Appl. 425(1), 55–70 (2015)
Davis, D.: SMART: the stochastic monotone aggregated root-finding algorithm. arXiv preprint arXiv:1601.00698 (2016)
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems, pp. 1646–1654 (2014)
Defazio, A., Domke, J., Caetano, T.: Finito: a faster, permutable incremental gradient method for big data problems. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1125–1133 (2014)
Durrett, R.: Probability: Theory and Examples. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press, Cambridge (2010). https://doi.org/10.1017/CBO9780511779398
Fercoq, O., Richtárik, P.: Accelerated, parallel, and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015)
Ioffe, A.D.: Variational Analysis of Regular Mappings. Springer Monographs in Mathematics, Springer, Cham (2017)
Reddi, S.J., Sra, S., Poczos, B., Smola, A.J.: Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 1145–1153. Curran Associates Inc, Red Hook (2016)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)
Krasnosel’skii, M.A.: Two remarks on the method of successive approximations. Uspekhi Mat. Nauk 10(1), 123–127 (1955)
Lai, M.J., Yin, W.: Augmented \$ \(\ell _{-}1\) \$ and nuclear-norm models with a globally linearly convergent algorithm. SIAM J. Imag. Sci. 6(2), 1059–1091 (2013)
Ledoux, M., Talagrand, M.: Probability in Banach Spaces: Isoperimetry and Processes. Springer, Berlin (2013)
Liu, J., Wright, S.J., Ré, C., Bittorf, V., Sridhar, S.: An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res. 16, 285–322 (2015)
Mann, W.R.: Mean value methods in iteration. Proc. Am. Math. Soc. 4(3), 506–510 (1953)
Nemirovskii, A., Yudin, D.B., Dawson, E.R.: Problem complexity and method efficiency in optimization (1983)
Peng, Z., Xu, Y., Yan, M., Yin, W.: ARock: an algorithmic framework for asynchronous parallel coordinate updates. SIAM J. Sci. Comput. 38(5), A2851–A2879 (2016)
Polyak, B.T.: Minimization of unsmooth functionals. USSR Comput. Math. Math. Phys. 9(3), 14–29 (1969)
Recht, B., Re, C., Wright, S., Niu, F.: Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 693–701 (2011)
Robbins, H., Siegmund, D.: A convergence theorem for non negative almost supermartingales and some applications. In: Herbert Robbins Selected Papers, pp. 111–135. Springer (1985)
Rosasco, L., Villa, S., Vũ, B.C.: A stochastic forward-backward splitting method for solving monotone inclusions in hilbert spaces. arXiv preprint arXiv:1403.7999 (2014)
Ryu, E.K., Boyd, S.: A primer on monotone operator methods. Appl. Comput. Math 15(1), 3–43 (2016)
Schmidt, M., Roux, N.L., Bach, F.: Minimizing finite sums with the stochastic average gradient. arXiv preprint arXiv:1309.2388 (2013)
Shalev-Shwartz, S., Zhang, T.: Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Math. Program. 155, 105–145 (2016)
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14(2), 567–599 (2013)
Strohmer, T., Vershynin, R.: A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl. 15(2), 262–278 (2009)
Zhang, H.: The restricted strong convexity revisited: analysis of equivalence to error bound and quadratic growth. Optim. Lett. 11(4), 817–833 (2017)
Acknowledgements
We thank B. Edmunds, P. Combettes, W. Yin, and the anonymous reviewers for the insightful feedback.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This material is supported by an Alfred P. Sloan research fellowship and NSF DMS award 2047637.
Appendices
Appendix
Convergence of candidate algorithm (1.5)
Lemma A.1
Suppose that \(S :{\mathcal {H}}\rightarrow {\mathcal {H}}\) satisfies conditions (1.3) and (1.4). Let \(x^0 \in {\mathcal {H}}\), let \(\lambda = n\min _i\{\beta _i \}\) and consider the following iteration:
Then for all \(k \in {\mathbb {N}}\), we have
Consequently, after at most
iterations, the point \(x^k\) satisfies \(\mathrm {dist}^2(x^k, {{\,\mathrm{zer}\,}}(S)) \le \varepsilon \).
Proof
For all \(k \in {\mathbb {N}}\) define \({\bar{x}}^k := \mathrm {proj}_{{{\,\mathrm{zer}\,}}(S)}(x^k)\) and observe that
By Jensen’s inequality, we can upper bound the term \(\Vert S(x^k)\Vert ^2\):
where the final inequality follows from (1.3). Using this estimate in (A.1), we find that
where the final inequality follows from (1.4). To conclude the proof, note that
as desired. \(\square \)
Rights and permissions
About this article
Cite this article
Davis, D. Variance reduction for root-finding problems. Math. Program. 197, 375–410 (2023). https://doi.org/10.1007/s10107-021-01758-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-021-01758-4
Keywords
- Stochastic algorithm
- Variance reduction
- Root-finding algorithm
- Operator splitting
- Monotone inclusions
- Saddle-point problems