Abstract
The importance of the step size in stochastic optimization has been confirmed both theoretically and empirically during the past few decades and reconsidered in recent years, especially for large-scale learning. Different rules of selecting the step size have been discussed since the arising of stochastic approximation methods. The first part of this work reviews the studies on several representative techniques of setting the step size, covering heuristic rules, meta-learning procedure, adaptive step size technique and line search technique. The second component of this work proposes a novel class of accelerating stochastic optimization methods by resorting to the Barzilai–Borwein (BB) technique with a diagonal selection rule for the metric, particularly, termed as DBB. We first explore the theoretical and empirical properties of variance reduced stochastic optimization algorithms with DBB. Especially, we study the theoretical and numerical properties of the resulting method under strongly convex and non-convex cases respectively. To great show the efficacy of the step size schedule of DBB, we extend it into more general stochastic optimization methods. The theoretical and empirical properties of such the case also developed under different cases. Extensive numerical results in machine learning are offered, suggesting that the proposed algorithms show much promise.
Similar content being viewed by others
Notes
The coding can be found in https://github.com/Zane4YZ/STCO.
rcv1, news20, covtype and ijcnn1 can be download from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
sido0 and cina0 can be download from http://www.causality.inf.ethz.ch/challenge.php?page=datasets.
\(CIFAR-10\) can be accessed from http://www.cs.toronto.edu/~kriz/cifar.html.
MNIST can be downloaded from http://yann.lecun.com/exdb/mnist/.
References
Antoniadis, A., Gijbels, I., Nikolova, M.: Penalized likelihood regression for generalized linear models with non-quadratic penalties. Ann. Inst. Stat. Math. 63, 585–615 (2011)
Asi, H., Duchi, J.C.: Stochastic (approximate) proximal point methods: convergence, optimality, and adaptivity. SIAM J. Optim. 29, 2257–2290 (2019)
Bach, F.: Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. J. Mach. Learn. Res. 15, 595–627 (2014)
Baker, J., Fearnhead, P., Fox, E.B., Nemeth, C.: Control variates for stochastic gradient mcmc. Stat. Comput. 29, 599–615 (2019)
Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8, 141–148 (1988)
Baydin, A.G., Cornish, R., Rubio, D.M., Schmidt, M., Wood, F.: Online learning rate adaptation with hypergradient descent. In: International Conference on Learning Representations (2018)
Benveniste, A., Métivier, M., Priouret, P.: Adaptive Algorithms and Stochastic Approximations, vol. 22. Springer, Berlin (2012)
Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., Anandkumar, A.: signsgd: compressed optimisation for non-convex problems. In: International Conference on Machine Learning, pp. 560–569 (2018)
Borkar, V.S.: Stochastic Approximation: A Dynamical Systems Viewpoint, vol. 48. Springer, Berlin (2009)
Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-newton method for large-scale optimization. SIAM J. Optim. 26, 1008–1031 (2016)
Chowdhury, A.K.R., Chellappa, R.: Stochastic approximation and rate-distortion analysis for robust structure and motion estimation. Int. J. Comput. Vis. 55, 27–53 (2003)
Cotter, A., Shamir, O., Srebro, N., Sridharan, K.: Better mini-batch algorithms via accelerated gradient methods. In: Proceedings of the 24th International Conference on Neural Information Processing Systems, pp. 1647–1655 (2011)
Crisci, S., Porta, F., Ruggiero, V., Zanni, L.: Spectral properties of Barzilai–Borwein rules in solving singly linearly constrained optimization problems subject to lower and upper bounds. SIAM J. Optim. 30, 1300–1326 (2020)
Csiba, D., Qu, Z., Richtárik, P.: Stochastic dual coordinate ascent with adaptive probabilities. In: International Conference on Machine Learning, pp. 674–683 (2015)
Delyon, B., Juditsky, A.: Accelerated stochastic approximation. SIAM J. Optim. 3, 868–881 (1993)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
Ekblom, J., Blomvall, J.: Importance sampling in stochastic optimization: an application to intertemporal portfolio choice. Eur. J. Oper. Res. 285, 106–119 (2020)
Ermoliev, Y.: Stochastic quasigradient methods. In: Numerical Techniques for Stochastic Optimization, pp. 141–185. Springer (1988)
Fang, C., Li, C.J., Lin, Z., Zhang, T.: SPIDER: near-optimal non-convex optimization via stochastic path integrated differential estimator. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 687–697 (2018)
Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23, 2341–2368 (2013)
Huang, Y., Liu, H.: Smoothing projected Barzilai–Borwein method for constrained non-lipschitz optimization. Comput. Optim. Appl. 65, 671–698 (2016)
Jacobs, R.A.: Increased rates of convergence through learning rate adaptation. Neural Netw. 1, 295–307 (1988)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)
Karimireddy, S.P., Rebjock, Q., Stich, S., Jaggi, M.: Error feedback fixes signsgd and other gradient compression schemes. In: International Conference on Machine Learning, pp. 3252–3261 (2019)
Kesten, H.: Accelerated stochastic approximation. Ann. Math. Stat. 29(1), 41–59 (1958)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)
Klein, S., Pluim, J.P., Staring, M., Viergever, M.A.: Adaptive stochastic gradient descent optimisation for image registration. Int. J. Comput. Vis. 81, 227 (2009)
Konečnỳ, J., Liu, J., Richtárik, P., Takáč, M.: Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE J. Sel. Top. Signal Process. 10, 242–255 (2016)
Kovalev, D., Horváth, S., Richtárik, P.: Don’t jump through hoops and remove those loops: SVRG and katyusha are better without the outer loop. In: Proceedings of the 31st International Conference on Algorithmic Learning Theory, PMLR, vol. 117, pp. 451–467 (2020)
Lei, Y., Tang, K.: Learning rates for stochastic gradient descent with nonconvex objectives. IEEE Trans. Pattern Anal. Mach. Intell. 43, 4505–4511 (2021)
Liang, J., Xu, Y., Bao, C., Quan, Y., Ji, H.: Barzilai–Borwein-based adaptive learning rate for deep learning. Pattern Recogn. Lett. 128, 197–203 (2019)
Loizou, N., Vaswani, S., Laradji, I.H., Lacoste-Julien, S.: Stochastic polyak step-size for sgd: an adaptive learning rate for fast convergence. In: International Conference on Artificial Intelligence and Statistics, pp. 1306–1314. PMLR (2021)
Mahsereci, M., Hennig, P.: Probabilistic line searches for stochastic optimization. J. Mach. Learn. Res. 18, 4262–4320 (2017)
Mokhtari, A., Ribeiro, A.: Stochastic quasi-Newton methods. In: Proceedings of the IEEE (2020)
Nesterov, Y.: Introductory Lectures on Convex Optimization: Basic Course. Kluwer Academic, Dordrecht (2004)
Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: International Conference on Machine Learning-Volume 70, pp. 2613–2621 (2017)
Nguyen, L.M., Tran-Dinh, Q., Phan, D.T., Nguyen, P.H., Van Dijk, M.: A unified convergence analysis for shuffling-type gradient methods. J. Mach. Learn. Res. 22, 9397–9440 (2021)
Nitanda, A.: Stochastic proximal gradient descent with acceleration techniques. In: Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 1, pp. 1574–1582 (2014)
Paquette, C., Scheinberg, K.: A stochastic line search method with expected complexity analysis. SIAM J. Optim. 30, 349–376 (2020)
Park, Y., Dhar, S., Boyd, S., Shah, M.: Variable metric proximal gradient method with diagonal Barzilai–Borwein stepsize. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3597–3601. IEEE (2020)
Plagianakos, V.P., Magoulas, G.D., Vrahatis, M.N.: Learning rate adaptation in stochastic gradient descent. In: Advances in Convex Analysis and Global Optimization: Honoring the Memory of C. Caratheodory (1873–1950), pp. 433–444 (2001)
Reddi, S.J., Kale, S., Kumar, S.: On the convergence of ADAM and beyond. In: International Conference on Learning Representations (2018)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Roux, N. L., Schmidt, M., Bach, F.: A stochastic gradient method with an exponential convergence rate for finite training sets. In: Proceedings of the 25th International Conference on Neural Information Processing Systems-Volume 2, pp. 2663–2671 (2012)
Saridis, G.N.: Learning applied to successive approximation algorithms. IEEE Trans. Syst. Sci. Cybern. 6, 97–103 (1970)
Schaul, T., Zhang, S., LeCun, Y.: No more pesky learning rates. In: International Conference on Machine Learning, pp. 343–351 (2013)
Schmidt, M., Babanezhad, R., Ahemd, M., Clifton, A., Sarkar, A.: Non-uniform stochastic average gradient method for training conditional random fields. In: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, PMLR, vol. 38, pp. 819–828 (2015)
Schraudolph, N.: Local gain adaptation in stochastic gradient descent. In: Proceedings of ICANN, pp. 569–574. IEE (1999)
Sebag, A., Schoenauer, M., Sebag, M.: Stochastic gradient descent: going as fast as possible but not faster. In: OPTML 2017: 10th NIPS Workshop on Optimization for Machine Learning, pp. 1–8 (2017)
Shao, S., Yip, P.P.: Rates of convergence of adaptive step-size of stochastic approximation algorithms. J. Math. Anal. Appl. 244, 333–347 (2000)
Sopyła, K., Drozda, P.: Stochastic gradient descent with Barzilai–Borwein update step for SVM. Inf. Sci. 316, 218–233 (2015)
Tan, C., Ma, S., Dai, Y. H., Qian, Y.: Barzilai-Borwein step size for stochastic gradient descent. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 685–693) (2016)
Tieleman, T., Hinton, G.: Rmsprop: Divide the gradient by a running average of its recent magnitude. In: COURSERA: Neural Networks for Machine Learning, vol. 4(2), pp. 26–31 (2012)
Toulis, P., Airoldi, E.M.: Scalable estimation strategies based on stochastic approximations: classical results and new insights. Stat. Comput. 25, 781–795 (2015)
Vaswani, S., Mishkin, A., Laradji, I., Schmidt, M., Gidel, G., Lacoste-Julien, S.: Painless stochastic gradient: interpolation, line-search, and convergence rates. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 3732–3745 (2019)
Wang, Z., Zhou, Y., Liang, Y., Lan, G.: Cubic regularization with momentum for nonconvex optimization. In: Proceedings of the 35th Uncertainty in Artificial Intelligence Conference, PMLR, vol. 115, pp. 313–322 (2020)
Ward, R., Wu, X., Bottou, L.: Adagrad stepsizes: sharp convergence over nonconvex landscapes. In: International Conference on Machine Learning, pp. 6677–6686. PMLR (2019)
Wei, F., Bao, C., Liu, Y.: Stochastic Anderson mixing for nonconvex stochastic optimization. Adv. Neural. Inf. Process. Syst. 34, 22995–23008 (2021)
Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4151–4161 (2017)
Yang, Z.: On the step size selection in variance-reduced algorithm for nonconvex optimization. Expert Syst. Appl. 169, 114336 (2020)
Yang, Z., Wang, C., Zang, Y., Li, J.: Mini-batch algorithms with Barzilai–Borwein update step. Neurocomputing 314, 177–185 (2018a)
Yang, Z., Wang, C., Zhang, Z., Li, J.: Random Barzilai–Borwein step size for mini-batch algorithms. Eng. Appl. Artif. Intell. 72, 124–135 (2018b)
Yang, Z., Wang, C., Zhang, Z., Li, J.: Accelerated stochastic gradient descent with step size selection rules. Signal Process. 159, 171–186 (2019a)
Yang, Z., Wang, C., Zhang, Z., Li, J.: Mini-batch algorithms with online step size. Knowl. Based Syst. 165, 228–240 (2019b)
Yu, T., Liu, X.-W., Dai, Y.-H., Sun, J.: A variable metric mini-batch proximal stochastic recursive gradient algorithm with diagonal Barzilai–Borwein stepsize (2020). arXiv:2010.00817
Zeiler, M.D.: Adadelta: an adaptive learning rate method (2012). arXiv:1212.5701
Zhao, P., Zhang, T.: Stochastic optimization with importance sampling for regularized loss minimization. In: International Conference on Machine Learning, pp. 1–9. PMLR (2015)
Zhou, D., Xu, P., Gu, Q.: Stochastic nested variance reduction for nonconvex optimization. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3925–3936 (2018)
Funding
This work was supported by grants from the China Postdoctoral Science Foundation under Grant 2019M663238. Also, this work was partially supported by Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.
Author information
Authors and Affiliations
Contributions
Zhuang Yang: Data curation, Formal analysis, Investigation, Methodology, Validation, Visualization, Writing - original draft, Funding acquisition. Li Ma: Formal analysis, Investigation.
Corresponding author
Ethics declarations
Conflict of interest
Author Zhuang Yang declares that he has no conflict of interest. Author Li Ma declares that he has no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent
Additional informed consent is obtained from all individual participants for whom identifying information is included in this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Proofs for mS2GD-DBB
1.1 A.1 Proof of Lemma 2
For clarity, we provide a brief proof for the Lemma 2.
Proof
From (17), i.e., \(\widetilde{\eta }_k^{SBB2}=\frac{\theta }{m}\cdot \frac{\langle s_{k}, y_{k}\rangle }{\Vert y_k\Vert ^2}\), we have
where in the first inequality we took Lemma 1 (a.k.a. (24)).
In addition, from (16), i.e., \(\widetilde{\eta }_k^{SBB1}=\frac{\theta }{m}\cdot \frac{\Vert s_{k}\Vert ^{2}}{\langle s_{k}, y_{k}\rangle }\), we have
where in the first inequality we took Assumption 2.
Further, using the Cauchy–Schwartz inequality, we have the result that \(\widetilde{\eta }_k^{SBB1} > \widetilde{\eta }_k^{SBB2}\).
Here, we finished the proof of Lemma 2. \(\square \)
1.2 A.2 Proof of Theorem 1
Proof
According to the definition of \(w_{k+1}\), i.e., \(w_{k+1}=w_k-(U_s)^{-1} v_k\), in Algorithm 1, we obtain
where the second equality holds due to \(\mathbb {E}[v_k]=\nabla F(w_k)\). In the last equality, we set \(W=[(w_k-w_*)^1, \ldots , (w_k-w_*)^d]^T\), and \(\bar{U}=[(U_s)^{-1}_{1, 1}, \ldots , (U_s)^{-1}_{d, d}]^T\).
Utilizing the Cauchy–Schwartz inequality, we have
where the last inequality holds due to Lemma 3.
Further, combining the convexity of the objective function, F(w), that is \(F(v)\ge F(w)+\langle \nabla F(w), v-w\rangle \), and the result (23), shown in Lemma 1, we have
where we set \(Z=[F(w_k)-F(w_*), \ldots , F(w_k)-F(w_*)]^T\) \(\in \mathbb {R}^d\), and in the last inequality, we use the fact \(\frac{\theta }{mL}\textrm{I}\le U^{-1} \le \frac{\theta }{m\mu }\textrm{I}\).
To hold the inequality (A.1), we only required that the following condition was satisfied:
In addition, according to the definition of \(\widetilde{w}_s\) in Algorithm 1, we have
Through summing the previous inequality over \(k=0, \ldots , m-1\), using expectation with all the history, and taking \(\widetilde{w}_{s}=w_{m}\) in Algorithm 1, we obtain
where in the last inequality we used the inequality (22), appearing in Assumption 2.
Further, we have
When setting \(\rho =\frac{\mu mb+2\theta ^2Ld}{d\mu \theta mb-4\theta ^2Ld}\), we had the desired bound. \(\square \)
1.3 A.3 Proof of Theorem 2
Proof
Combining \(w_{k+1}=w_k-(U_s)^{-1} v_k\) and Lemma 4, we reached to the following conclusion
where the second inequality used the Cauchy–Schwartz inequality and the second equality used the definition of \(v_k=\nabla F_S(w_k)-\nabla F_S(\tilde{w})+\nabla F(\tilde{w})\).
Additionally, according to the fact \((a_1+a_2+ \cdots +a_n)^2\le n(a_1^2+a_2^2+ \cdots + a_n^2)\), we have
where the first inequality can be hold due to , and Assumption 3.
Via summing the inequality (A.2), we easily arrive at
Further, we obtain
Since \(\mathbb {E}[\Vert \nabla F(w_m)\Vert ^2]=\frac{1}{m}\sum _{k=0}^{m-1}\Vert \nabla F(w_k)\Vert \), we induced
where this inequality used the condition \(w_*=\mathop {\mathrm {arg\,min}} F(w)\).
Note that, according to the inequality (4)
we observed that the right hand of the above inequality defined a quadratic function, where it reached to a minimal value at the point \(\bar{w}=v-\frac{\nabla f(v)}{L}\).
Practically, to satisfy the above inequality, it was an adequate to satisfy the following condition:
When taking \(w=w_*\) and \(v=w_0\), according to above inequality, we obtained the following result
Therefore, to hold the inequality (A.3), we only required that the following conclusion was satisfied:
According to the definition \(\tilde{w}_{s+1}=w_m\) and \(w_0=\tilde{w}_s\), denoted in Algorithm 1, we obtained
Resorting summing the above inequality over \(s=0, 1, \ldots , \hat{S}-1\), we obtained the results.
\(\square \)
Appendix B Proofs for SGD-DBB
1.1 B.1 Proof of Theorem 3
Proof
According to \(w_{k+1}=w_k-(U_s)^{-1}\nabla F_S(w_k)\), defined in Algorithm 2, we deduced
where the first inequality holds due to \(\mathbb {E}[\nabla F_S(w_k)]=\nabla F(w_k)\) and Cauchy–Schwartz inequality. The second inequality used the conclusion (23), shown in Lemma 1.
To satisfy the inequality (B.1), we only required
where in the first inequality, we used the convexity property of the objective function, i.e., \(F(v)\ge F(w)+\nabla F(w)(v-w)\), \(w_*=\mathop {\mathrm {arg\,min}}\nolimits _{w\in \mathbb {R}^d} F(w)\) and \(U^{-1}\le \frac{\theta }{mu}\textrm{I}\).
By summing the above inequality over \(k=0, 1, \ldots , m-1\), we obtain
where this inequality used \(\tilde{w}_{s+1}=w_m\), denoted in Algorithm 2.
Since \(\mathbb {E}[F(\tilde{w}_s)]=\frac{1}{m}\sum _{k=0}^{m-1}\mathbb {E}[F(w_k)]\) and take expectation on both sides, we ascertained
Further, we have
where the first inequality holds due to the definition \(w_0=\tilde{w}_s\) and the second inequality holds because of Assumption 2 (a.k.a. (22)).
Finally, summing the above inequality, we obtain
where we set \(\tau =\frac{m\mu }{md\mu \theta -Ld\theta ^2}\). \(\square \)
1.2 B.2 Proof of Theorem 4
Proof
Combining the fact \(w_{k+1}=w_k-(U_s)^{-1}\nabla F_S(w_k)\) and Lemma 4, we have
where the second inequality held due to the Cauchy–Schwartz inequality and the last inequality held because of \(\Vert a+b\Vert ^2\le 2(\Vert a\Vert ^2+\Vert b\Vert ^2)\).
In virtue of Assumption 3 and taking expectation on both sides of (B.2), we obtained
where the second inequality used the conditions \(\mathbb {E}[\nabla F_S(w_k)]=\nabla F(w_k)\) and .
By recursively adding the inequality (B.3) over \(k=0, 1, \ldots , m-1\), we easily had the following result:
Further, according to (B.4), we arrive at
Since \(\mathbb {E}[\Vert \nabla F(w_m)\Vert ^2]=\frac{1}{m}\sum _{k=0}^{m-1}\Vert \nabla F(w_k)\Vert \), we further derived
where the second inequality adopted the condition \(w_*=\mathop {\mathrm {arg\,min}} F(w)\).
In addition, according to \(w_0=\tilde{w}_s\) and \(\tilde{w}_{s+1}=w_m\), we induced
In order to satisfy (B.5), we just required that the following condition be held.
where this inequality used the conclusion in (A.4).
Finally, summing the inequality (B.6) over \(s=0, 1, \ldots , \hat{S}-1\), we obtained the desired results. \(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yang, Z., Ma, L. Adaptive step size rules for stochastic optimization in large-scale learning. Stat Comput 33, 45 (2023). https://doi.org/10.1007/s11222-023-10218-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-023-10218-2