Abstract
We introduce a new approach to develop stochastic optimization algorithms for a class of stochastic composite and possibly nonconvex optimization problems. The main idea is to combine a variance-reduced estimator and an unbiased stochastic one to create a new hybrid estimator which trades-off the variance and bias, and possesses useful properties for developing new algorithms. We first introduce our hybrid estimator and investigate its fundamental properties to form a foundational theory for algorithmic development. Next, we apply our new estimator to develop several variants of stochastic gradient method to solve both expectation and finite-sum composite optimization problems. Our first algorithm can be viewed as a variant of proximal stochastic gradient methods with a single loop and single sample, but can achieve the best-known oracle complexity bound as state-of-the-art double-loop algorithms in the literature. Then, we consider two different variants of our method: adaptive step-size and restarting schemes that have similar theoretical guarantees as in our first algorithm. We also study two mini-batch variants of the proposed methods. In all cases, we achieve the best-known complexity bounds under standard assumptions. We test our algorithms on several numerical examples with real datasets and compare them with many existing methods. Our numerical experiments show that the new algorithms are comparable and, in many cases, outperform their competitors.









Similar content being viewed by others
References
Agarwal, A., Bartlett, P.L., Ravikumar, P., Wainwright, M.J.: Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Trans. Inf. Theory 99, 1–1 (2010)
Agarwal, A., Bottou, L.: A lower bound for the optimization of finite sums. In: International Conference on Machine Learning, pp. 78–86 (2015)
Allen-Zhu, Z.: Katyusha: The first direct acceleration of stochastic gradient methods. Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pp. 1200–1205 (2017). Montreal, Canada
Allen-Zhu, Z.: Natasha: Faster non-convex stochastic optimization via strongly non-convex parameter. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 89–97 (2017)
Allen-Zhu, Z.: Natasha 2: Faster non-convex optimization than SGD. In: Advances in neural information processing systems, pp. 2675–2686 (2018)
Allen-Zhu, Z., Li. Y.: NEON2: Finding local minima via first-order oracles. In: Advances in Neural Information Processing Systems, pp. 3720–3730 (2018)
Allen-Zhu, Zeyuan, Yuan, Yang: Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives. In: ICML, pp. 1080–1089 (2016)
Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Srebro, N., Woodworth, B.: Lower bounds for non-convex stochastic optimization. arXiv:1912.02365, (2019)
Bertsekas, D.P.: Incremental proximal methods for large scale convex optimization. Math. Program. 129(2), 163–195 (2011)
Bollapragada, R., Byrd, R., Nocedal, J.: Exact and Inexact Subsampled Newton Methods for Optimization. IMA J. Numer. Anal. 39(2), 545–578 (2019)
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp. 177–186. Springer (2010)
Bottou, L.: Online learning and stochastic approximations. In: David, S. (ed.) Online Learning in Neural Networks, pp. 9–42. Cambridge University Press, New York (1998)
Richard, H.B., Hansen, S.L., Jorge, N., Yoram, S.: A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim. 26(2), 1008–1031 (2016)
Carmon, Y., Duchi, J., Hinder, O., Sidford, A.: Lower bounds for finding stationary points I. Math. Program. 5, 1–50 (2017)
Chambolle, A., Ehrhardt, M.J., Richtárik, P., Schönlieb, C.-B.: Stochastic primal-dual hybrid gradient algorithm with arbitrary sampling and imaging applications. SIAM J. Optim. 28(4), 2783–2808 (2018)
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 1–27 (2011)
Cutkosky, A., Orabona, F.: Momentum-based variance reduction in non-convex SGD. In: Advances in Neural Information Processing Systems, pp. 15210–15219 (2019)
Davis, D., Grimmer, B.: Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems. SIAM J. Optim. 29(3), 1908–1930 (2019)
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems (NIPS), pp. 1646–1654 (2014)
Defazio, A., Caetano, T., Domke. J.: Finito: A faster, permutable incremental gradient method for big data problems. In: International Conference on Machine Learning, pp. 1125–1133 (2014)
Driggs, D., Liang, J., Schönlieb, C.-B.: On the bias-variance tradeoff in stochastic gradient methods. arXiv:1906.01133 (2019)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
Erdogdu, M.A, Montanari, A.: Convergence rates of sub-sampled Newton methods. In: Advances in Neural Information Processing Systems, pp. 3052–3060 (2015)
Fang, C., Li, C. J., Lin, Z., Zhang, T.: SPIDER: near-optimal non-convex optimization via stochastic path integrated differential estimator. In: Advances in Neural Information Processing Systems, pp. 689–699 (2018)
Fang, C., Lin, Z., Zhang, T.: Sharp Analysis for Nonconvex SGD Escaping from Saddle Points. In: Conference on Learning Theory, pp. 1192–1234 (2019)
Foster, D., Sekhari, A., Shamir, O., Srebro, N., Sridharan, K., Woodworth, B.: The complexity of making the gradient small in stochastic convex optimization. In: Conference on Learning Theory, pp. 1319–1345 (2019)
Ge, R., Huang, F., Jin, C., Yuan, Y.: Escaping from saddle points–online stochastic gradient for tensor decomposition. In: Conference on Learning Theory, pp. 797–842 (2015)
Ge, R., Li, Z., Wang, W., Wang, X.: Stabilized SVRG: Simple variance reduction for nonconvex optimization. In: Conference on Learning Theory, pp. 1394–1448 (2019)
Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)
Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. 156(1–2), 59–99 (2016)
Ghadimi, S., Lan, G., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. 155(1–2), 267–305 (2016)
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning, vol. 1. MIT Press, Cambridge (2016)
Hanzely, F., Mishchenko, K., Richtárik, P.: SEGA: variance reduction via gradient sketching. In: Advances in Neural Information Processing Systems, pp. 2082–2093 (2018)
Jofré, A., Thompson, P.: On variance reduction for stochastic smooth convex optimization with multiplicative noise. Math. Program. 174(1–2), 253–292 (2019)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems (NIPS), pp. 315–323 (2013)
Karimi, H., Nutini, J. Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the Polyak-łojasiewicz condition. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795–811. Springer (2016)
Kingma, D.P., Ba, J.: ADAM: A Method for Stochastic Optimization. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR). arXiv:1412.6980 (2014)
Konečný, J., Liu, J., Richtárik, P., Takáč, M.: Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE J. Sel. Top. Signal Process. 10, 242–255 (2016)
Kovalev, D., Horvath, S., Richtarik, P.: Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. Proc 31st Int Conf Algorithmic Learn Theor. 117, 1–17 (2020)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Li, Z.: SSRGD: simple stochastic recursive gradient descent for escaping saddle points. In: Advances in Neural Information Processing Systems, pp. 1521–1531 (2019)
Li, Z., Li, J.: A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. In: Advances in Neural Information Processing Systems, pp. 5564–5574 (2018)
Lihua, L., Ju, C., Chen, J., Jordan, M.: Non-convex finite-sum optimization via SCSG methods. In: Advances in Neural Information Processing Systems, pp. 2348–2358 (2017)
Lin, H., Mairal, J., Harchaoui, Z.: A universal catalyst for first-order optimization. In: Advances in Neural Information Processing Systems, pp. 3384–3392 (2015)
Mairal, J.: Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM J. Optim. 25(2), 829–855 (2015)
Metel, M., Takeda, A.: Simple stochastic gradient methods for non-smooth non-convex regularized optimization. In: International Conference on Machine Learning, pp. 4537–4545 (2019)
Mokhtari, A., Ozdaglar, A., Jadbabaie, A.: Escaping saddle points in constrained optimization. In: Advances in Neural Information Processing Systems, pp. 3629–3639 (2018)
Moulines, Eric, Bach, Francis R: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Advances in Neural Information Processing Systems, pp. 451–459 (2011)
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
Nemirovskii, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley, New York (1983)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Applied Optimization, vol. 87. Kluwer Academic Publishers, London (2004)
Nesterov, Y., Polyak, B.T.: Cubic regularization of Newton method and its global performance. Math. Program. 108(1), 177–205 (2006)
Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: ICML (2017)
Nguyen, L.M., Nguyen, N.H., Phan, D.T., Kalagnanam, J.R., Scheinberg, K.: When does stochastic gradient algorithm work well? arXiv:1801.06159 (2018)
Nguyen, L.M., Scheinberg, K., Takac, M.: Inexact SARAH Algorithm for Stochastic Optimization. Optim. Methods Softw. (online first) (2020)
Nguyen, L.M., van Dijk, M., Phan, D.T., Nguyen, P.H., Weng, T.-W., Kalagnanam, J.R.: Optimal finite-sum smooth non-convex optimization with SARAH. arXiv:1901.07648 (2019)
Nguyen, L.M., Liu, J., Scheinberg, K., Takác, M.: Stochastic recursive gradient algorithm for nonconvex optimization. arXiv:1705.07261 (2017)
Oja, E.: Simplified neuron model as a principal component analyzer. J. Math. Biol. 15(3), 267–273 (1982)
Pham, H.N., Nguyen, M.L., Phan, T.D., Tran-Dinh, Q.: ProxSARAH: an efficient algorithmic framework for stochastic composite nonconvex optimization. J. Mach. Learn. Res. 21, 1–48 (2020)
Pilanci, M., Wainwright, M.J.: Newton sketch: a linear-time optimization algorithm with linear-quadratic convergence. SIAM J. Optim. 27(1), 205–245 (2017)
Polyak, B., Juditsky, A.: Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30(4), 838–855 (1992)
Reddi, S.J., Sra, S., Póczos, B., Smola, A.: Stochastic Frank-Wolfe methods for nonconvex optimization. In: 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1244–1251. IEEE (2016)
Reddi, S.J., Sra, S., Póczos, B., Smola, A.J.: Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In: Advances in Neural Information Processing Systems, pp. 1145–1153 (2016)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled Newton methods I: globally convergent algorithms. Math. Program. 174, 293–326 (2019)
Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017)
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)
Tran-Dinh, Q., Liu, D., Nguyen, L.M.: Hybrid variance-reduced SGD algorithms for minimax problems with nonconvex-linear function. Proc. of The Thirty-fourth Conference on Neural In-formation Processing Systems (NeurIPS) (2020)
Unser, M.: A Representer Theorem for Deep Neural Networks. J. Math. 20, 1–30 (2019)
Wang, M., Fang, E., Liu, L.: Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Math. Program. 161(1–2), 419–449 (2017)
Wang, Z., Ji, K., Zhou, Y., Liang, Y., Tarokh, V.: SpiderBoost and Momentum: Faster variance reduction algorithms. Proc. of The Thirty-third Conference on Neural Information Processing Systems (NeurIPS) (2019)
Wang, Z., Zhou, Y., Liang, Y., Lan, G.: Stochastic variance-reduced cubic regularization for nonconvex optimization. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2731–2740 (2019)
Woodworth, B.E., Srebro, N.: Tight complexity bounds for optimizing composite objectives. In: Advances in Neural Information Processing Systems (NIPS), pp. 3639–3647 (2016)
Zhang, J., Xiao, L., Zhang, S.: Adaptive stochastic variance reduction for subsampled Newton method with cubic regularization. arXiv:1811.11637 (2018)
Zhao, L., Mammadov, M., Yearwood, J.: From convex to nonconvex: a loss function analysis for binary classification. In: IEEE International Conference on Data Mining Workshops (ICDMW), pp. 1281–1288. IEEE (2010)
Zhao, P., Zhang, T.: Stochastic optimization with importance sampling for regularized loss minimization. In: International Conference on Machine Learning, pp. 1–9 (2015)
Zhou, D., Gu, Q.: Lower bounds for smooth nonconvex finite-sum optimization (2019)
Zhou, D., Gu, Q.: Stochastic recursive variance-reduced cubic regularization methods. Proc. of The 24th International Conference on Articial Intelligence and Statistics (AISTATS) (2020)
Zhou, D., Xu, P., Gu, Q.: Stochastic nested variance reduction for nonconvex optimization. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3925–3936. Curran Associates Inc. (2018)
Zhou, K., Shang, F., Cheng, J.: A simple stochastic variance reduced algorithm with fast convergence rates. In: International Conference on Machine Learning, pp. 5975–5984 (2018)
Zhou, Y., Wang, Z., Ji, K., Liang, Y., Tarokh, V.: Momentum schemes with stochastic variance reduction for nonconvex composite optimization. arXiv:1902.02715 (2019)
Acknowledgements
This paper is based upon work partially supported by the National Science Foundation (NSF) grant no. DMS-1619884 and the Office of Naval Research (ONR) grant no. N00014-20-1-2088.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix: Properties of hybrid stochastic estimators
This appendix provides the full proof of our theoretical results in Sect. 3. However, we also need the following lemma in the sequel. Hence, we prove it here.
Lemma 7
Given \(L > 0\), \(\delta > 0\), \(\epsilon > 0\), and \(\omega \in (0, 1)\), let \(\{\gamma _t\}_{t=0}^m\) be a sequence of positive real numbers updated by
for \(t=0,\cdots , m-1\). Then
Proof
First, from (49) it is obvious to show that \(0< \gamma _0< \cdots< \gamma _{m-1} = \frac{\delta }{L(1+\epsilon \omega )} < \gamma _m = \frac{\delta }{L}\). At the same time, since \(\omega \in (0, 1)\), we have \(1 \ge \omega \ge \omega ^2 \ge \cdots \ge \omega ^{m}\). By Chebyshev’s sum inequality, we have
From the update (49), we also have
Substituting (51) into (52), we get
Let us define \(\varSigma _m := \sum _{t=0}^m\gamma _t\) and \(S_m := \sum _{t=0}^m\gamma _t^2\). Summing up both sides of the above inequalities, we get
Using again Chebyshev’s sum inequality, we have
Note that \((m+1)S_m \ge \varSigma _m^2\) by Cauchy-Schwarz’s inequality, which shows that \(S_m + \varSigma _m^2 \ge \big (\frac{m+2}{m+1}\big )\varSigma _m^2\). Combining three last inequalities, we obtain the following quadratic inequation in \(\varSigma _m > 0\):
Solving this inequation with respect to \(\varSigma _m > 0\), we obtain
This proves (50). \(\square \)
1.1 The proof of Lemma 3: Error estimate with mini-batch
The proof of the first expression of (19) is the same as in Lemma 1. We only prove the second one. Let \(\varDelta _{\mathcal {B}_t} := \frac{1}{b_t}\sum _{\xi _i\in \mathcal {B}_t}\left[ G_{\xi _i}(x_t) - G_{\xi _i}(x_{t-1})\right] \), \(\varDelta _t := G(x_t) - G(x_{t-1})\), \(\hat{\delta }_t := \hat{v}_t - G(x_t)\), and \(\delta {u}_t := u_t - G(x_t)\). Clearly, we have
Moreover, we can rewrite \(\hat{v}_t\) as
Therefore, using these two expressions, we can derive
Similar to the proof of [59, Lemma 2], for the finite-sum case (i.e., \(\vert \varOmega \vert = n\)), we can show that
For the expectation case, we have
Using the definition of \(\rho \) in Lemma 4, we can unify these two expressions as
Substituting the last expression into the previous one, we obtain the second expression of (19). \(\square \)
1.2 The proof of Lemma 4: Upper bound of mini-batch “variance”
From Lemma 3, taking the expectation with respect to \(\mathcal {F}_{t+1} := \sigma (x_0,\mathcal {B}_0, \hat{\mathcal {B}}_0, \cdots , \mathcal {B}_t, \hat{\mathcal {B}}_t)\), we have
In addition, from [59, Lemma 2], we have \(\mathbb {E}_{\hat{\mathcal {B}}_t}\left[ \Vert u_t - G(x_t)\Vert ^2\right] \le \hat{\rho }\mathbb {E}_{\xi }\left[ \Vert G_{\xi }(x_t) - G(x_t)\Vert ^2\right] = \hat{\rho }\sigma _t^2\), where \(\sigma _t^2 := \mathbb {E}_{\xi }\left[ \Vert G_{\xi }(x_t) - G(x_t)\Vert ^2\right] \).
Let \(A_t^2 := \mathbb {E}\left[ \Vert \hat{v}_t - G(x_t)\Vert ^2\right] \) and \(B_t^2 := \mathbb {E}\left[ \Vert x_{t +1}- x_{t}\Vert ^2\right] \). Then, the above estimate can be upper bounded as
By following inductive step as in the proof of Lemma 2, we obtain from the last inequality that
Using the definition of \(\omega _{t}\), \(\omega _{i, t}\), and \(S_{t}\) from (16), the previous inequality becomes
which is the same as (20) by substituting the definition of \(A_t\) and \(B_t\) above into it. \(\square \)
The proof of technical results in Sect. 4: The single sample case
We provide the full proof of the technical results in Sect. 4.
1.1 The proof of Lemma 5: Key estimate
From the update \(x_{t+1} := (1-\gamma _t)x_t + \gamma _t\widehat{x}_{t+1}\) at Step 8 of Algorithm 1, we have \(x_{t+1} - x_t = \gamma _t(\widehat{x}_{t+1} - x_t)\). From the L-average smoothness condition in Assumption 2, one can write
Using convexity of \(\psi \), we can show that
where \(\nabla \psi (\widehat{x}_{t+1}) \in \partial \psi (\widehat{x}_{t+1})\) is any subgradient of \(\psi \) at \(\widehat{x}_{t+1}\).
Utilizing the optimality condition of \(\widehat{x}_{t+1} = \mathrm {prox}_{\eta _t \psi }(x_{t} - \eta _t v_t)\), we can show that \(\nabla \psi (\widehat{x}_{t+1}) = -v_t - \frac{1}{\eta _t}(\widehat{x}_{t+1} - x_t)\) for some \(\nabla \psi (\widehat{x}_{t+1}) \in \partial \psi (\widehat{x}_{t+1})\). Substituting this relation into (54), we get
Combining (53) and (55), and using \(F(x) := f(x) + \psi (x)\) from (1), we obtain
For any \(c_t > 0\), we can always write
Utilizing this expression, we can rewrite as
where \(\tilde{\sigma }_t^2 := \frac{\gamma _t}{c_t}\Vert \nabla {f}(x_t -v_t - c_t( \widehat{x}_{t+1} - x_t) \Vert ^2 \ge 0\).
Taking expectation both sides of this inequality over the entire history \(\mathcal {F}_{t+1}\), we obtain
Next, from the definition of gradient mapping \(\mathcal {G}_{\eta }(x):= \frac{1}{\eta }\left( x - \mathrm {prox}_{\eta \psi }(x - \eta \nabla f(x))\right) \) in (9), we can see that
Using this expression, the triangle inequality, and the nonexpansive property \(\Vert \mathrm {prox}_{\eta \psi }(z) - \mathrm {prox}_{\eta \psi }(w)\Vert \le \Vert z - w\Vert \) of \(\mathrm {prox}_{\eta \psi }\), we can derive that
Now, for any \(r_t > 0\), the last estimate leads to
Multiplying this inequality by \(\frac{q_t}{2} > 0\) and adding the result to (57), we finally get
Using the definition of \(\theta _t\) and \(\kappa _t\) from (22), i.e.:
we can simplify the last estimate as follows:
which is exactly (21). \(\square \)
1.2 The proof of Lemma 6: Key estimate of Lyapunov function
From (14), by taking the full expectation on the history \(\mathcal {F}_{t+1}\) and using the L-average smoothness of f, we can show that
where \(\sigma _t^2 := \mathbb {E}\left[ \Vert \nabla {f}_{\zeta _t}(x_t) - \nabla {f}(x_t)\Vert ^2\right] \).
Let V be the Lyapunov function defined by (23). Then, by multiplying the last inequality by \(\frac{\alpha _{t+1}}{2} > 0\), adding the result to (21), and then using this Lyapunov function we can show that
Let us choose \(\gamma _t\), \(\eta _t\), and other parameters such that the conditions (24) hold, i.e.:
In this case, (58) can be simplified as follows:
which proves (25).
Finally, summing up this inequality from \(t := 0\) to \(t := m\), we obtain
Note that \(V(x_{m+1}) := \mathbb {E}\left[ F(x_{m+1})\right] + \frac{\alpha _{m+1}}{2}\mathbb {E}\left[ \Vert v_{m+1} - \nabla {f}(x_{m+1})\Vert ^2\right] \ge \mathbb {E}\left[ F(x_{m+1})\right] \ge F^{\star }\) by Assumption 1 and \(V(x_0) = F(x_0) + \frac{\alpha _0}{2}\mathbb {E}\left[ \Vert v_0 - \nabla {f}(x_0)\Vert ^2\right] \). Using these estimates into the last inequality, we obtain the key estimate (26). \(\square \)
1.3 The proof of Theorem 2: The adaptive step-size case
Let \(\{(x_t, \hat{x}_{t})\}\) be generated by Algorithm 1. Let us again choose \(c_t := L\), \(r_t := 1\) and \(q_t := \frac{L\gamma _t}{2}\) and fix \(\eta _t := \eta \in (0, \frac{1}{L})\) in Lemma 2 as done in Theorem 1. Then, from (22), we have
Using these parameters into (21) and summing up the result from \(t := 0\) to \(t := m\), and then using (15) from Lemma 2, we obtain
where \(\bar{\sigma }^2 := \mathbb {E}\left[ \Vert v_0 - \nabla f(x_0)\Vert ^2\right] \ge 0\), \(\tilde{\sigma }_t^2 := \dfrac{\gamma _t}{2}\Vert \nabla f(x_t) - v_t - L(\hat{x}_{t+1} - x_t)\Vert ^2 \ge 0\), and \(\omega _{i,t}\), \(\omega _t\), and \(S_t\) are defined in Lemma 2.
By ignoring the nonnegative term \(\mathbb {E}\left[ \tilde{\sigma }_t^2\right] \), and using the expression of \(\theta _t\) and \(\kappa _t\) above, we can further estimate the last inequality as follows:
where \(\mathcal {T}_m\) is defined as follows:
Now, with the choice of \(\beta _t = \beta := 1- \frac{1}{\sqrt{\tilde{b}(m+1)}} \in (0, 1)\), we can easily show that \(\omega _t = \beta ^{2t}\), \(\omega _{i,t} = \beta ^{2(t-i)}\), and \(s_t := \big (\prod _{j=i+2}^{t}\beta _{j-1}^2\big )(1-\beta _i)^2 = (1-\beta )^2\Big [\frac{1-\beta ^{2t}}{1-\beta ^2}\Big ] < \frac{1-\beta }{1+\beta }\) due to Lemma 2.
Let \(w_i^2 := \mathbb {E}\left[ \Vert \widehat{x}_{i+1} - x_i\Vert ^2\right] \). To bound the quantity \(\mathcal {T}_m\) defined by (60), we note that
Using \(\delta := \frac{2}{\eta } - 2L\), we can write \(\mathcal {T}_m\) from (60) as
To guarantee \(\mathcal {T}_m \le 0\), from the last expression of \(\mathcal {T}_m\), we can impose the following condition:
It is obvious to show that the condition (61) leads to the following update of \(\gamma _t\):
which is exactly (33).
(a) Since \(\beta = 1 - \frac{1}{[\tilde{b}(m+1)]^{1/2}}\), we have
Moreover, since \(\eta \in (0, \frac{1}{L})\), with \(\epsilon := \frac{1+L^2\eta ^2}{L}\), \(\delta := \frac{2}{\eta }-2L\), and \(\omega := \beta ^2 \in (0,1)\), using the last inequalities, we can easily show that
Using (62), \(\sqrt{1-\omega } = \sqrt{1-\beta ^2} \ge \frac{1}{(\tilde{b}(m+1))^{1/4}}\), and \(\epsilon = \frac{1+L^2\eta ^2}{L}\) into (50) of Lemma 7, we can derive
Next, since \(\omega _t = \beta ^{2t}\), by Chebyshev’s sum inequality, we have
Utilizing this estimate, \(\bar{\sigma }^2 := \mathbb {E}\left[ \Vert v_0 - \nabla {f}(x_0)\Vert ^2\right] \le \frac{\sigma ^2}{\tilde{b}}\), and \(S_t \le \sigma ^2 s_t \le \frac{(1-\beta )\sigma ^2}{1+\beta }\) into (59), and noting that \(\mathcal {T}_m \le 0\), we can further upper bound it as
By Assumption 1, we have \(\mathbb {E}\left[ F(x_{m+1})\right] \ge F^{\star }\). Substituting this bound into the last estimate and then multiplying the result by \(\frac{4}{L\eta ^2\varSigma _m}\) we obtain
Since \(\beta = 1- \frac{1}{\tilde{b}^{1/2}(m+1)^{1/2}}\), we have \(\frac{1}{\tilde{b}(m+1)(1-\beta )} + (1-\beta ) = \frac{2}{\tilde{b}^{1/2}(m+1)^{1/2}}\). Utilizing this expression, (63), \(1+\eta ^2L^2 \le 2\), and \(\beta \in [0, 1]\), we can further upper bound the last estimate as
In addition, due to the choice of \(\overline{x}_m \sim \mathbb {U}_{\mathbf {p}}\left( \{x_t\}_{t=0}^m\right) \), we have \(\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(\overline{x}_m)\Vert ^2\right] = \displaystyle \frac{1}{\varSigma _m}\displaystyle \sum _{t=0}^m\gamma _t\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(x_t)\Vert ^2\right] \). Combining this expression and (64), we obtain (34).
(b) Let us choose \(\tilde{b} := \lceil c_1^2(m+1)^{1/3} \rceil \) for some constant \(c_1 > 0\). Since \(\beta = 1 - \frac{1}{[\tilde{b}(m+1)]^{1/2}}\), to guarantee \(\beta \ge 0\), we need to impose \(c_1 \ge \frac{1}{(m+1)^{2/3}}\). With this choice of \(\tilde{b}\), (34) reduces to
Let us denote by \(\varDelta _0 := \frac{8}{L^2\eta ^2}\left[ \frac{\sqrt{2}L( L + \sqrt{c_1L\delta })}{\delta }\big [F(x_0) - F^{\star }\big ] + \frac{\sigma ^2}{c_1}\right] \). Then, similar to the proof of Theorem 1, we can show that the number of iterations m is at most \(m := \left\lceil \frac{\varDelta _0^{3/2}}{\varepsilon ^3} \right\rceil \), and the total number \(\mathcal {T}_m\) of stochastic gradient evaluations \(\nabla {f}_{\xi }(x_t)\) is at most \(\mathcal {T}_m := \left\lceil \frac{c_1^2\varDelta _0^{1/2}}{\varepsilon } + \frac{3\varDelta _0^{3/2}}{\varepsilon ^3}\right\rceil \). \(\square \)
1.4 The proof of Theorem 3: The restarting variant
(a) Since we use the adaptive variant of Algorithm 1 as stated in Theorem 2 for the inner loop of Algorithm 2, from (64), we can see that at each stage s, the following estimate holds
where we use the superscript “\(^{(s)}\)” to indicate the stage s in Algorithm 2. Summing up this inequality from \(s := 1\) to \(s := S\), and then multiplying the result by \(\frac{1}{S}\) and using \(\mathbb {E}\left[ F(x_{m+1}^{(S)})\right] \ge F^{\star } > -\infty \), and \(\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(\overline{x}_T)\Vert ^2\right] = \dfrac{1}{S\varSigma _m}\displaystyle \sum _{s=1}^S\sum _{t=0}^m\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(x_t^{(s)})\Vert ^2\right] \), we get (35), i.e.:
(b) Let \(\varDelta _F := F(\overline{x}^{(0)}) - F^{\star } > 0\) and choose \(\tilde{b} := c_1^2(m + 1)\) for some constant \(c_1 > 0\). Since \(\beta = 1 - \frac{1}{[\tilde{b}(m+1)]^{1/2}} \in (0, 1)\), we need to choose \(c_1\) such that \(c_1 \ge \frac{1}{m+1}\).
Now, for any tolerance \(\varepsilon > 0\), to guarantee \(\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(\overline{x}_T)\Vert ^2\right] \le \varepsilon ^2\), from (66), we require
Let us break this inequality into two equal parts as
Then, we have
Let us choose \(m+1 = \left\lceil \frac{16}{L^2\eta ^2c_1}\cdot \frac{\max \left\{ 1,\sigma ^2\right\} }{\varepsilon ^2} \right\rceil \). Then, \(m + 1 \ge \frac{16}{L^2\eta ^2c_1\varepsilon ^2}\), and we can set
This leads to (36). Moreover, we can also show that
Consequently, the total number of stochastic gradient evaluations \(\nabla {f}_{\xi }(x_t)\) is at most
Since we choose \(\tilde{b} := \left\lceil \frac{16c_1}{L^2\eta ^2}\cdot \frac{\max \left\{ 1,\sigma ^2\right\} }{\varepsilon ^2} \right\rceil \), the final complexity is \(\mathcal {O}\left( \frac{\max \left\{ 1,\sigma ^2\right\} }{\varepsilon ^2} + \frac{\max \left\{ 1,\sigma \right\} }{\varepsilon ^3}\right) \), where other constants independent of \(\sigma \) and \(\varepsilon \) are hidden. The total number of proximal operators \(\mathrm {prox}_{\eta \psi }\) is at most
The estimate (37) follows from the bound of \(\mathcal {T}_{\nabla {f}}\) above and the choice of \(\tilde{b}\). \(\square \)
The proof of technical results in Sect. 5: The mini-batch case
This appendix presents the full proof of the results in Section 5 for the mini-batch case.
1.1 The proof of Theorem 4: The single-loop variant
Using (19) from Lemma 3 with \(G := \nabla {f}\), taking full expectation, and using a constant weight \(\beta _t := \beta \in (0, 1)\) and \(b_t := b \in \mathbb {N}_{+}\), we have
where \(\rho := \frac{1}{b}\) since we solve (1).
Since \(\mathbb {E}\left[ \Vert \nabla {f}_{\xi }(x_{t+1}) - \nabla {f}_{\xi }(x_t)\Vert ^2\right] \le L^2\mathbb {E}\left[ \Vert x_{t+1} - x_t\Vert ^2\right] \le L^2\gamma _{t}^2\mathbb {E}\left[ \Vert \widehat{x}_{t+1} - x_t\Vert ^2\right] \) by Assumption 2 and \(\mathbb {E}\left[ \Vert u_{t+1} - \nabla {f}(x_{t+1})\Vert ^2\right] \le \frac{\sigma ^2}{\hat{b}}\) by Assumption 3 and [59, Lemma 2], the last estimate leads to
Next, let us choose \(\eta _t := \eta > 0\), \(\gamma _t := \gamma > 0\), \(c_t := L\), \(r_t := 1\), and \(q_t := \frac{L\gamma }{2} > 0\) in Lemma 2. Then, we have \(\theta _t = \theta = \frac{(1 + L^2\eta ^2)\gamma }{L} > 0\) and \(\kappa _t = \kappa = \left( \frac{2}{\eta } - L\gamma - 2L\right) \gamma > 0\). Using these values into (21), we obtain
Multiplying (67) by \(\frac{\alpha }{2}\) for some \(\alpha > 0\), and adding the result to the above estimate, we obtain
Using the Lyapunov function V defined by (23), the last estimate leads to
If we impose the following conditions
then we get from the last inequality that
The conditions (68) can be simplified as
Moreover, by induction, \(V(x_{m+1}) \ge F^{\star }\), and \(V(x_0) := F(x_0) + \frac{\alpha }{2}\mathbb {E}\left[ \Vert \hat{v}_0 - \nabla {f}(x_0)\Vert ^2\right] \le F(x_0) + \frac{\alpha \sigma ^2}{2\tilde{b}}\), we can further derive from (69) that
By minimizing the last term on the right-hand side of (71) w.r.t. \(\beta \in [0, 1]\), we get \(\beta := 1 - \frac{\hat{b}^{1/2}}{[\tilde{b}(m+1)]^{1/2}}\). Clearly, with this choice of \(\beta \) if \(1 \le \hat{b} \le \tilde{b}(m+1)\), then \(\beta \in [0, 1)\).
(a) Next, we update \(\eta := \frac{2}{L(3 + \gamma )}\). Then, since \(\gamma \in [0, 1]\) we have \(\frac{1}{2L} \le \eta \le \frac{2}{3L}\). Moreover, we have \(\frac{2}{\eta } - 2L - L\gamma = L\) and \(\frac{1 + L^2\eta ^2}{L} \le \frac{13}{9L}\). In addition, since \(\beta \in [0, 1)\) we have \(1-\beta ^2 \ge 1 - \beta = \frac{\hat{b}^{1/2}}{[\tilde{b}(m+1)]^{1/2}}\). Consequently, the second condition of (70) holds if we choose \(\gamma \) as
Since \(\beta \in [0, 1]\), the first condition of (70) holds if we choose \(0 < \gamma \le \bar{\gamma } := \frac{b}{L\alpha }\). Combining both conditions on \(\gamma \), we get \(\frac{b}{L\alpha } = \frac{9L\alpha \hat{b}^{1/2}}{13\tilde{b}^{1/2}(m+1)^{1/2}}\), leading to \(\alpha := \frac{\sqrt{13}\tilde{b}^{1/4}b^{1/2}(m+1)^{1/4}}{3L\hat{b}^{1/4}}\). Therefore, we can update \(\gamma \) as
for some \(c_0 > 0\). Since \(1 \le \hat{b} \le \tilde{b}(m+1)\), we have \(\gamma \le \frac{3c_0b^{1/2}}{\sqrt{13}}\). If we choose \(0 < c_0 \le \frac{\sqrt{13}}{3b^{1/2}}\), then \(\gamma \in (0, 1]\). Consequently, we obtain (41).
(b) Now, we note that the choice of \(\alpha \) and \(\gamma \) also implies that
In addition, since \(\overline{x}_m\sim \mathbb {U}\left( \left\{ x_t\right\} _{t=0}^m\right) \), we have \(\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(\overline{x}_m)\Vert ^2\right] = \frac{1}{m+1}\sum _{t=0}^m\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(x_t)\Vert ^2\right] \). Using these expressions and \(L^2\eta ^2 \ge \frac{1}{4}\) into (71), we finally get
which proves (42).
Let us choose \(b = \hat{b} \in \mathbb {N}_{+}\) and \(\tilde{b} := \lceil c_1^2[b(m+1)]^{1/3} \rceil \) for some \(c_1 > 0\). Then (42) reduces to
Denote \(\varDelta _0 := \frac{16}{3}\left[ \frac{\sqrt{13 c_1}L}{c_0} \left[ F(x_0) - F^{\star }\right] + \frac{13\sigma ^2}{3c_1}\right] \). For any tolerance \(\varepsilon > 0\), to guarantee \(\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(\overline{x}_m)\Vert ^2\right] \le \varepsilon ^2\), we need to impose \(\frac{\varDelta _0}{[b(m+1)]^{2/3}} \le \varepsilon ^2\). This implies \(b(m+1) \ge \frac{\varDelta _0^{3/2}}{\varepsilon ^3}\), which also leads to \(m+1 \ge \frac{\varDelta _0^{3/2}}{b\varepsilon ^3}\). Therefore, the maximum number of iterations is at most \(m := \left\lceil \frac{\varDelta _0^{3/2}}{b\varepsilon ^3}\right\rceil \). This is also the number of proximal operations \(\mathrm {prox}_{\eta \psi }\).
The number of stochastic gradient evaluations \(\nabla {f_{\xi }}(x_t)\) is at most \(\mathcal {T}_m := \tilde{b} + 3(m+1)b = \left\lceil \frac{c_1^2\varDelta _0^{1/2}}{\varepsilon } + \frac{3\varDelta _0^{3/2}}{\varepsilon ^3} \right\rceil \). Finally, since \(1 \le b = \hat{b} \le \tilde{b}(m+1) = c_1^2b^{1/3}(m+1)^{4/3}\), we have \(b \le c_1^3(m+1)^2\), which is equivalent to \(c_1\ge \frac{b^{1/3}}{(m+1)^{2/3}}\). In addition, since \(\tilde{b} := \lceil c_1^2[b(m+1)]^{1/3} \rceil \) and \(b=\hat{b}\), we have \(\gamma := \frac{3c_0b^{2/3}}{\sqrt{13c_1}(m+1)^{1/3}}\). \(\square \)
1.2 The proof of Theorem 5: The restarting mini-batch variant
(a) Similar to the proof of Theorem 2, summing up (21) from \(t := 0\) to \(t := m\) and using (20) with \(\rho := \frac{1}{b}\) and \(\hat{\rho } := \frac{1}{\hat{b}}\) from Lemma 4, we obtain
where \(\gamma _t\), \(\eta \), \(\kappa _t\), \(\theta _t\), \(\omega _{i,t}\), \(\omega _t\), and \(S_t\) are defined in Lemma 2.
Let us fix \(c_t := L\), \(r_t := 1\), \(q_t := \frac{L\gamma _t}{2}\), and \(\beta _t := \beta \in [0, 1]\). Then \(\theta _t = \frac{(1 + L^2\eta ^2)}{L}\gamma _t\) and \(\kappa _t = \gamma _t\left( \frac{2}{\eta } - 2L - L\gamma _t\right) \) as before. Moreover, \(\omega _t = \beta ^{2t}\), \(\omega _{i,t} = \beta ^{2(t-i)}\), and \(s_t = (1-\beta )^2\Big [\frac{1-\beta ^{2t}}{1-\beta ^2}\Big ] < \frac{1-\beta }{1+\beta }\) due to Lemma 2, and \(\mathbb {E}\left[ \Vert \hat{v}_0^{(s)} - \nabla f(x_0^{(s)})\Vert ^2\right] \le \frac{\sigma ^2}{\tilde{b}}\).
Using this configuration and noting that \(\overline{x}^{(s)} = x_{m+1}^{(s)}\) and \(\overline{x}^{(s-1)} = x^{(s)}_0\), following the same argument as (64), (72) reduces to
where \(\widehat{\mathcal {T}}_m\) is defined as follows:
Similar to the proof of (33), if we choose \(\eta \in (0, \frac{1}{L})\), set \(\delta := \frac{2}{\eta } - 2L > 0\), and update \(\gamma \) as in (43):
then \(\widehat{\mathcal {T}}_m \le 0\). Moreover, since \(\beta \in [0, 1]\) and \(1+L^2\eta ^2 \le 2\), (73) can be simplified as
Summing up this inequality from \(s := 1\) to \(s := S\) and noting that \(F(\overline{x}^{(S)}) \ge F^{\star }\), we obtain
Let us first choose \(\beta := 1 - \frac{\hat{b}^{1/2}}{\tilde{b}^{1/2}(m+1)^{1/2}}\). Then, \(1 - \beta ^2 \le \frac{2\hat{b}^{1/2}}{\tilde{b}^{1/2}(m+1)^{1/2}}\) and \(\frac{(1+L^2\eta ^2)\beta ^2}{L} \le \frac{2}{L}\). Using these inequalities, similar to the proof of (62), we can upper bound
Using this bound, the update rule (43) of \(\gamma _t\), and \(\sqrt{1-\beta ^2} \ge \frac{\hat{b}^{1/4}}{[\tilde{b}(m+1)]^{1/4}}\), we apply Lemma 7 with \(\omega := \beta ^2\) and \(\epsilon := \frac{(1 + L^2\eta ^2)}{Lb}\) to obtain
Utilizing this bound into (75) and noting that \(\overline{x}_T \sim \mathbb {U}_{\mathbf {p}}\left( \{x^{(s)}_t\}_{t=0\rightarrow m}^{s=1\rightarrow S}\right) \), we can upper bound it as
which is exactly (44).
(b) Now, let us choose \(\hat{b} = b \in \mathbb {N}_{+}\) and assume that \(\tilde{b} := \lceil c_1^2b(m+1) \rceil \) for some \(c_1 > 0\). In this case, the right-hand side of (44) can be upper bounded as
where \(\varDelta _F := F(\overline{x}^{(0)}) - F^{\star } > 0\).
For any \(\varepsilon > 0\), to guarantee \(\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(\overline{x}_T)\Vert ^2\right] \le \varepsilon ^2\), we impose \(\mathcal {R}_T \le \varepsilon ^2\). From the upper bound of \(\mathcal {R}_T\), we can break its relaxed condition into three parts as
Let us choose \(m + 1 := \big \lceil \frac{24}{c_1L^2\eta ^2 b\varepsilon ^2}\max \left\{ \sigma ^2,1\right\} \big \rceil \). Then, \(\tilde{b} = \big \lceil \frac{24c_1}{L^2\eta ^2 \varepsilon ^2}\max \left\{ \sigma ^2,1\right\} \big \rceil \). Moreover, the last condition of (76) holds and \(m+1 \ge \frac{24}{c_1L^2\eta ^2 b\varepsilon ^2}\). Hence, the second condition of (76) leads to
From the second condition of (76), we also have
From this expression, to guarantee the first condition of (76), we need to impose
which leads to \(1\le b \le \frac{2\sqrt{6\delta }}{L\sqrt{L}\eta \varepsilon }\).
Finally, the total number of stochastic gradient evaluations \(\nabla {f}_{\xi }(x_t)\) is at most
The total number of proximal operations \(\mathrm {prox}_{\eta \psi }\) is at most \(\mathcal {T}_{\mathrm {prox}} = (m+1)S = \left\lceil \frac{96\sqrt{3}\varDelta _F}{\eta ^3L\sqrt{L\delta } b\varepsilon ^3}\max \left\{ 1,\sigma \right\} \right\rceil \). \(\square \)
About this article
Cite this article
Tran-Dinh, Q., Pham, N.H., Phan, D.T. et al. A hybrid stochastic optimization framework for composite nonconvex optimization. Math. Program. 191, 1005–1071 (2022). https://doi.org/10.1007/s10107-020-01583-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-020-01583-1
Keywords
- Hybrid stochastic estimator
- Stochastic optimization algorithm
- Oracle complexity
- Variance reduction
- Composite nonconvex optimization