Adaptive step size rules for stochastic optimization in large-scale learning

Yang, Zhuang; Ma, Li

doi:10.1007/s11222-023-10218-2

Adaptive step size rules for stochastic optimization in large-scale learning

Original Paper
Published: 20 February 2023

Volume 33, article number 45, (2023)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

499 Accesses
1 Altmetric
Explore all metrics

Abstract

The importance of the step size in stochastic optimization has been confirmed both theoretically and empirically during the past few decades and reconsidered in recent years, especially for large-scale learning. Different rules of selecting the step size have been discussed since the arising of stochastic approximation methods. The first part of this work reviews the studies on several representative techniques of setting the step size, covering heuristic rules, meta-learning procedure, adaptive step size technique and line search technique. The second component of this work proposes a novel class of accelerating stochastic optimization methods by resorting to the Barzilai–Borwein (BB) technique with a diagonal selection rule for the metric, particularly, termed as DBB. We first explore the theoretical and empirical properties of variance reduced stochastic optimization algorithms with DBB. Especially, we study the theoretical and numerical properties of the resulting method under strongly convex and non-convex cases respectively. To great show the efficacy of the step size schedule of DBB, we extend it into more general stochastic optimization methods. The theoretical and empirical properties of such the case also developed under different cases. Extensive numerical results in machine learning are offered, suggesting that the proposed algorithms show much promise.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Particle swarm optimization algorithm: an overview

Article 17 January 2017

An exhaustive review of the metaheuristic algorithms for search and optimization: taxonomy, applications, and open challenges

Article 09 April 2023

Bolstering stochastic gradient descent with model building

Article Open access 15 April 2024

Notes

The coding can be found in https://github.com/Zane4YZ/STCO.
rcv1, news20, covtype and ijcnn1 can be download from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
sido0 and cina0 can be download from http://www.causality.inf.ethz.ch/challenge.php?page=datasets.
$CIFAR-10$ can be accessed from http://www.cs.toronto.edu/~kriz/cifar.html.
MNIST can be downloaded from http://yann.lecun.com/exdb/mnist/.

References

Antoniadis, A., Gijbels, I., Nikolova, M.: Penalized likelihood regression for generalized linear models with non-quadratic penalties. Ann. Inst. Stat. Math. 63, 585–615 (2011)
Article MathSciNet MATH Google Scholar
Asi, H., Duchi, J.C.: Stochastic (approximate) proximal point methods: convergence, optimality, and adaptivity. SIAM J. Optim. 29, 2257–2290 (2019)
Article MathSciNet MATH Google Scholar
Bach, F.: Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. J. Mach. Learn. Res. 15, 595–627 (2014)
MathSciNet MATH Google Scholar
Baker, J., Fearnhead, P., Fox, E.B., Nemeth, C.: Control variates for stochastic gradient mcmc. Stat. Comput. 29, 599–615 (2019)
Article MathSciNet MATH Google Scholar
Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8, 141–148 (1988)
Article MathSciNet MATH Google Scholar
Baydin, A.G., Cornish, R., Rubio, D.M., Schmidt, M., Wood, F.: Online learning rate adaptation with hypergradient descent. In: International Conference on Learning Representations (2018)
Benveniste, A., Métivier, M., Priouret, P.: Adaptive Algorithms and Stochastic Approximations, vol. 22. Springer, Berlin (2012)
MATH Google Scholar
Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., Anandkumar, A.: signsgd: compressed optimisation for non-convex problems. In: International Conference on Machine Learning, pp. 560–569 (2018)
Borkar, V.S.: Stochastic Approximation: A Dynamical Systems Viewpoint, vol. 48. Springer, Berlin (2009)
MATH Google Scholar
Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-newton method for large-scale optimization. SIAM J. Optim. 26, 1008–1031 (2016)
Article MathSciNet MATH Google Scholar
Chowdhury, A.K.R., Chellappa, R.: Stochastic approximation and rate-distortion analysis for robust structure and motion estimation. Int. J. Comput. Vis. 55, 27–53 (2003)
Article Google Scholar
Cotter, A., Shamir, O., Srebro, N., Sridharan, K.: Better mini-batch algorithms via accelerated gradient methods. In: Proceedings of the 24th International Conference on Neural Information Processing Systems, pp. 1647–1655 (2011)
Crisci, S., Porta, F., Ruggiero, V., Zanni, L.: Spectral properties of Barzilai–Borwein rules in solving singly linearly constrained optimization problems subject to lower and upper bounds. SIAM J. Optim. 30, 1300–1326 (2020)
Article MathSciNet MATH Google Scholar
Csiba, D., Qu, Z., Richtárik, P.: Stochastic dual coordinate ascent with adaptive probabilities. In: International Conference on Machine Learning, pp. 674–683 (2015)
Delyon, B., Juditsky, A.: Accelerated stochastic approximation. SIAM J. Optim. 3, 868–881 (1993)
Article MathSciNet MATH Google Scholar
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
MathSciNet MATH Google Scholar
Ekblom, J., Blomvall, J.: Importance sampling in stochastic optimization: an application to intertemporal portfolio choice. Eur. J. Oper. Res. 285, 106–119 (2020)
Article MathSciNet MATH Google Scholar
Ermoliev, Y.: Stochastic quasigradient methods. In: Numerical Techniques for Stochastic Optimization, pp. 141–185. Springer (1988)
Fang, C., Li, C.J., Lin, Z., Zhang, T.: SPIDER: near-optimal non-convex optimization via stochastic path integrated differential estimator. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 687–697 (2018)
Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23, 2341–2368 (2013)
Article MathSciNet MATH Google Scholar
Huang, Y., Liu, H.: Smoothing projected Barzilai–Borwein method for constrained non-lipschitz optimization. Comput. Optim. Appl. 65, 671–698 (2016)
Article MathSciNet MATH Google Scholar
Jacobs, R.A.: Increased rates of convergence through learning rate adaptation. Neural Netw. 1, 295–307 (1988)
Article Google Scholar
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)
Karimireddy, S.P., Rebjock, Q., Stich, S., Jaggi, M.: Error feedback fixes signsgd and other gradient compression schemes. In: International Conference on Machine Learning, pp. 3252–3261 (2019)
Kesten, H.: Accelerated stochastic approximation. Ann. Math. Stat. 29(1), 41–59 (1958)
Article MathSciNet MATH Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)
Klein, S., Pluim, J.P., Staring, M., Viergever, M.A.: Adaptive stochastic gradient descent optimisation for image registration. Int. J. Comput. Vis. 81, 227 (2009)
Article MATH Google Scholar
Konečnỳ, J., Liu, J., Richtárik, P., Takáč, M.: Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE J. Sel. Top. Signal Process. 10, 242–255 (2016)
Article Google Scholar
Kovalev, D., Horváth, S., Richtárik, P.: Don’t jump through hoops and remove those loops: SVRG and katyusha are better without the outer loop. In: Proceedings of the 31st International Conference on Algorithmic Learning Theory, PMLR, vol. 117, pp. 451–467 (2020)
Lei, Y., Tang, K.: Learning rates for stochastic gradient descent with nonconvex objectives. IEEE Trans. Pattern Anal. Mach. Intell. 43, 4505–4511 (2021)
Article Google Scholar
Liang, J., Xu, Y., Bao, C., Quan, Y., Ji, H.: Barzilai–Borwein-based adaptive learning rate for deep learning. Pattern Recogn. Lett. 128, 197–203 (2019)
Article Google Scholar
Loizou, N., Vaswani, S., Laradji, I.H., Lacoste-Julien, S.: Stochastic polyak step-size for sgd: an adaptive learning rate for fast convergence. In: International Conference on Artificial Intelligence and Statistics, pp. 1306–1314. PMLR (2021)
Mahsereci, M., Hennig, P.: Probabilistic line searches for stochastic optimization. J. Mach. Learn. Res. 18, 4262–4320 (2017)
MathSciNet MATH Google Scholar
Mokhtari, A., Ribeiro, A.: Stochastic quasi-Newton methods. In: Proceedings of the IEEE (2020)
Nesterov, Y.: Introductory Lectures on Convex Optimization: Basic Course. Kluwer Academic, Dordrecht (2004)
Book MATH Google Scholar
Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: International Conference on Machine Learning-Volume 70, pp. 2613–2621 (2017)
Nguyen, L.M., Tran-Dinh, Q., Phan, D.T., Nguyen, P.H., Van Dijk, M.: A unified convergence analysis for shuffling-type gradient methods. J. Mach. Learn. Res. 22, 9397–9440 (2021)
MathSciNet MATH Google Scholar
Nitanda, A.: Stochastic proximal gradient descent with acceleration techniques. In: Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 1, pp. 1574–1582 (2014)
Paquette, C., Scheinberg, K.: A stochastic line search method with expected complexity analysis. SIAM J. Optim. 30, 349–376 (2020)
Article MathSciNet MATH Google Scholar
Park, Y., Dhar, S., Boyd, S., Shah, M.: Variable metric proximal gradient method with diagonal Barzilai–Borwein stepsize. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3597–3601. IEEE (2020)
Plagianakos, V.P., Magoulas, G.D., Vrahatis, M.N.: Learning rate adaptation in stochastic gradient descent. In: Advances in Convex Analysis and Global Optimization: Honoring the Memory of C. Caratheodory (1873–1950), pp. 433–444 (2001)
Reddi, S.J., Kale, S., Kumar, S.: On the convergence of ADAM and beyond. In: International Conference on Learning Representations (2018)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Article MathSciNet MATH Google Scholar
Roux, N. L., Schmidt, M., Bach, F.: A stochastic gradient method with an exponential convergence rate for finite training sets. In: Proceedings of the 25th International Conference on Neural Information Processing Systems-Volume 2, pp. 2663–2671 (2012)
Saridis, G.N.: Learning applied to successive approximation algorithms. IEEE Trans. Syst. Sci. Cybern. 6, 97–103 (1970)
Article MATH Google Scholar
Schaul, T., Zhang, S., LeCun, Y.: No more pesky learning rates. In: International Conference on Machine Learning, pp. 343–351 (2013)
Schmidt, M., Babanezhad, R., Ahemd, M., Clifton, A., Sarkar, A.: Non-uniform stochastic average gradient method for training conditional random fields. In: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, PMLR, vol. 38, pp. 819–828 (2015)
Schraudolph, N.: Local gain adaptation in stochastic gradient descent. In: Proceedings of ICANN, pp. 569–574. IEE (1999)
Sebag, A., Schoenauer, M., Sebag, M.: Stochastic gradient descent: going as fast as possible but not faster. In: OPTML 2017: 10th NIPS Workshop on Optimization for Machine Learning, pp. 1–8 (2017)
Shao, S., Yip, P.P.: Rates of convergence of adaptive step-size of stochastic approximation algorithms. J. Math. Anal. Appl. 244, 333–347 (2000)
Article MathSciNet MATH Google Scholar
Sopyła, K., Drozda, P.: Stochastic gradient descent with Barzilai–Borwein update step for SVM. Inf. Sci. 316, 218–233 (2015)
Article MATH Google Scholar
Tan, C., Ma, S., Dai, Y. H., Qian, Y.: Barzilai-Borwein step size for stochastic gradient descent. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 685–693) (2016)
Tieleman, T., Hinton, G.: Rmsprop: Divide the gradient by a running average of its recent magnitude. In: COURSERA: Neural Networks for Machine Learning, vol. 4(2), pp. 26–31 (2012)
Toulis, P., Airoldi, E.M.: Scalable estimation strategies based on stochastic approximations: classical results and new insights. Stat. Comput. 25, 781–795 (2015)
Article MathSciNet MATH Google Scholar
Vaswani, S., Mishkin, A., Laradji, I., Schmidt, M., Gidel, G., Lacoste-Julien, S.: Painless stochastic gradient: interpolation, line-search, and convergence rates. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 3732–3745 (2019)
Wang, Z., Zhou, Y., Liang, Y., Lan, G.: Cubic regularization with momentum for nonconvex optimization. In: Proceedings of the 35th Uncertainty in Artificial Intelligence Conference, PMLR, vol. 115, pp. 313–322 (2020)
Ward, R., Wu, X., Bottou, L.: Adagrad stepsizes: sharp convergence over nonconvex landscapes. In: International Conference on Machine Learning, pp. 6677–6686. PMLR (2019)
Wei, F., Bao, C., Liu, Y.: Stochastic Anderson mixing for nonconvex stochastic optimization. Adv. Neural. Inf. Process. Syst. 34, 22995–23008 (2021)
Google Scholar
Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4151–4161 (2017)
Yang, Z.: On the step size selection in variance-reduced algorithm for nonconvex optimization. Expert Syst. Appl. 169, 114336 (2020)
Article Google Scholar
Yang, Z., Wang, C., Zang, Y., Li, J.: Mini-batch algorithms with Barzilai–Borwein update step. Neurocomputing 314, 177–185 (2018a)
Yang, Z., Wang, C., Zhang, Z., Li, J.: Random Barzilai–Borwein step size for mini-batch algorithms. Eng. Appl. Artif. Intell. 72, 124–135 (2018b)
Yang, Z., Wang, C., Zhang, Z., Li, J.: Accelerated stochastic gradient descent with step size selection rules. Signal Process. 159, 171–186 (2019a)
Yang, Z., Wang, C., Zhang, Z., Li, J.: Mini-batch algorithms with online step size. Knowl. Based Syst. 165, 228–240 (2019b)
Yu, T., Liu, X.-W., Dai, Y.-H., Sun, J.: A variable metric mini-batch proximal stochastic recursive gradient algorithm with diagonal Barzilai–Borwein stepsize (2020). arXiv:2010.00817
Zeiler, M.D.: Adadelta: an adaptive learning rate method (2012). arXiv:1212.5701
Zhao, P., Zhang, T.: Stochastic optimization with importance sampling for regularized loss minimization. In: International Conference on Machine Learning, pp. 1–9. PMLR (2015)
Zhou, D., Xu, P., Gu, Q.: Stochastic nested variance reduction for nonconvex optimization. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3925–3936 (2018)

Download references

Funding

This work was supported by grants from the China Postdoctoral Science Foundation under Grant 2019M663238. Also, this work was partially supported by Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.

Author information

Authors and Affiliations

School of Computer Science and Technology, Soochow University, Suzhou, 215006, China
Zhuang Yang
School of Management, Xizang Minzu University, Xianyang, 712000, China
Li Ma

Authors

Zhuang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Li Ma
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Zhuang Yang: Data curation, Formal analysis, Investigation, Methodology, Validation, Visualization, Writing - original draft, Funding acquisition. Li Ma: Formal analysis, Investigation.

Corresponding author

Correspondence to Zhuang Yang.

Ethics declarations

Conflict of interest

Author Zhuang Yang declares that he has no conflict of interest. Author Li Ma declares that he has no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Additional informed consent is obtained from all individual participants for whom identifying information is included in this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Proofs for mS2GD-DBB

1.1 A.1 Proof of Lemma 2

For clarity, we provide a brief proof for the Lemma 2.

Proof

From (17), i.e., $\widetilde{\eta }_k^{SBB2}=\frac{\theta }{m}\cdot \frac{\langle s_{k}, y_{k}\rangle }{\Vert y_k\Vert ^2}$, we have

$$\begin{aligned} \widetilde{\eta }_k^{SBB2}\ge & {} \frac{\theta }{m}\cdot \frac{1}{L} \frac{\Vert y_k\Vert ^2}{\Vert y_k\Vert ^2}=\frac{\theta }{mL}, \end{aligned}$$

where in the first inequality we took Lemma 1 (a.k.a. (24)).

In addition, from (16), i.e., $\widetilde{\eta }_k^{SBB1}=\frac{\theta }{m}\cdot \frac{\Vert s_{k}\Vert ^{2}}{\langle s_{k}, y_{k}\rangle }$, we have

$$\begin{aligned}{} & {} \widetilde{\eta }_k^{SBB1} \le \frac{\theta }{m}\cdot \frac{\Vert s_k\Vert ^2}{\mu \Vert s_k\Vert ^2}=\frac{\theta }{\mu m}, \end{aligned}$$

where in the first inequality we took Assumption 2.

Further, using the Cauchy–Schwartz inequality, we have the result that $\widetilde{\eta }_k^{SBB1} > \widetilde{\eta }_k^{SBB2}$.

Here, we finished the proof of Lemma 2. $\square $

1.2 A.2 Proof of Theorem 1

Proof

According to the definition of $w_{k+1}$, i.e., $w_{k+1}=w_k-(U_s)^{-1} v_k$, in Algorithm 1, we obtain

$$\begin{aligned}{} & {} \mathbb {E}[\Vert w_{k+1}-w_*\Vert ^2] = \mathbb {E}[\Vert w_k-(U_s)^{-1}v_k-w_*\Vert ^2] \\{} & {} \quad = \Vert w_k-w_*\Vert ^2-2\langle w_k-w_*, (U_s)^{-1}\nabla F(w_k)\rangle \\{} & {} \qquad +\,\Vert (U_s)^{-1}v_k\Vert ^2\\{} & {} \quad =\Vert w_k-w_*\Vert ^2-2\langle W, Y\rangle \bar{U}+\Vert (U_s)^{-1}v_k\Vert ^2 \end{aligned}$$

where the second equality holds due to $\mathbb {E}[v_k]=\nabla F(w_k)$. In the last equality, we set $W=[(w_k-w_*)^1, \ldots , (w_k-w_*)^d]^T$, and $\bar{U}=[(U_s)^{-1}_{1, 1}, \ldots , (U_s)^{-1}_{d, d}]^T$.

Utilizing the Cauchy–Schwartz inequality, we have

$$\begin{aligned} \mathbb {E}[\Vert w_{k+1}-w_*\Vert ^2]\le & {} \Vert w_k-w_*\Vert ^2-2\langle W, Y\rangle \bar{U}\\{} & {} +\,\Vert (U_s)^{-1}\Vert ^2\Vert v_k\Vert ^2 \\\le & {} \Vert w_k-w_*\Vert ^2-2\langle W, Y\rangle \bar{U}\\{} & {} +\,\Vert (U_s)^{-1}\Vert ^2\biggl [\frac{4L}{b} \bigl [ F(w_{k})\\{} & {} -F(w_{*})+F(\widetilde{w})-F(w_*)]\\ {}{} & {} \quad +\frac{2}{b}\Vert \nabla F(w_{k-1})\Vert ^2\biggr ], \end{aligned}$$

where the last inequality holds due to Lemma 3.

Further, combining the convexity of the objective function, F(w), that is $F(v)\ge F(w)+\langle \nabla F(w), v-w\rangle $, and the result (23), shown in Lemma 1, we have

$$\begin{aligned}{} & {} \mathbb {E}[\Vert w_{k+1}-w_*\Vert ^2]\nonumber \\ {}{} & {} \quad \le \Vert w_k-w_*\Vert ^2-2\langle Z, \bar{U}\rangle \nonumber \\{} & {} \qquad +\,\Vert (U_s)^{-1}\Vert ^2\biggl [\frac{4L}{b} \bigl [ F(w_{k})-F(w_{*})+F(\widetilde{w})\nonumber \\{} & {} \qquad -\,F(w_*)\biggr ]+\frac{4L\Vert (U_s)^{-1}\Vert ^2}{b}[F(w_k)-F(w_*)]\nonumber \\{} & {} \quad \le \Vert w_k-w_*\Vert ^2-2\langle Z, \bar{U}\rangle \nonumber \\{} & {} \qquad +\,\frac{4\theta ^2Ld}{m^2\mu ^2b}[ F(w_{k})-F(w_{*})\nonumber \\{} & {} \qquad +\,F(\widetilde{w})-F(w_*)]\nonumber \\{} & {} \qquad +\,\frac{4\theta ^2Ld}{m^2\mu ^2b}[F(w_k)-F(w_*)], \end{aligned}$$

(A.1)

where we set $Z=[F(w_k)-F(w_*), \ldots , F(w_k)-F(w_*)]^T$ $\in \mathbb {R}^d$, and in the last inequality, we use the fact $\frac{\theta }{mL}\textrm{I}\le U^{-1} \le \frac{\theta }{m\mu }\textrm{I}$.

To hold the inequality (A.1), we only required that the following condition was satisfied:

$$\begin{aligned} \mathbb {E}[\Vert w_{k+1}-w_*\Vert ^2]\le & {} \Vert w_k-w_*\Vert ^2-\frac{2d\theta }{m\mu }[F(w_k)\\{} & {} -\,F(w_*)]+\frac{4\theta ^2Ld}{m^2\mu ^2b}[ F(w_{k})\\{} & {} -F(w_{*})+F(\widetilde{w})-F(w_*)]\\{} & {} +\,\frac{4\theta ^2Ld}{m^2\mu ^2b}[F(w_k)-F(w_*)]\\= & {} \Vert w_k-w_*\Vert ^2-\biggl (\frac{2d\theta }{m\mu }\\{} & {} -\,\frac{8\theta ^2Ld}{m^2\mu ^2b}\biggr )[F(w_k)-F(w_*)]\\{} & {} +\,\frac{4\theta ^2Ld}{m^2\mu ^2b}[F(\widetilde{w})-F(w_*)] \end{aligned}$$

In addition, according to the definition of $\widetilde{w}_s$ in Algorithm 1, we have

$$\begin{aligned}{} & {} \mathbb {E}[F(\widetilde{w}_s)]=\frac{1}{m} \sum _{k=0}^{m-1}\mathbb {E}[F(w_{k+1})]. \end{aligned}$$

Through summing the previous inequality over $k=0, \ldots , m-1$, using expectation with all the history, and taking $\widetilde{w}_{s}=w_{m}$ in Algorithm 1, we obtain

$$\begin{aligned}{} & {} \mathbb {E}[\Vert w_m-w_*\Vert ^2] + \biggl (\frac{2d\theta }{\mu }-\frac{8\theta ^2Ld}{\mu ^2mb}\biggr )\mathbb {E}[F(\widetilde{w}_s)-F(w_*)]\\{} & {} \quad \le \mathbb {E}[\Vert w_0-w_*\Vert ^2]+\frac{4\theta ^2Ld}{\mu ^2mb}[F(\widetilde{w})\\{} & {} \qquad -\,F(w_*)]\\{} & {} \quad = \mathbb {E}[\Vert \widetilde{w}-w_*\Vert ^2]+\frac{4\theta ^2Ld}{\mu ^2mb}\\{} & {} \qquad [F(\widetilde{w})-F(w_*)]\\{} & {} \quad \le \frac{2}{\mu }[F(\widetilde{w})-F(w_*)]\\{} & {} \qquad +\,\frac{4\theta ^2Ld}{\mu ^2mb}[F(\widetilde{w})-F(w_*)]\\{} & {} \quad = \biggl (\frac{2}{\mu }+\frac{4\theta ^2Ld}{\mu ^2mb}\biggr )\\{} & {} \qquad [F(\widetilde{w})-F(w_*)] \end{aligned}$$

where in the last inequality we used the inequality (22), appearing in Assumption 2.

Further, we have

$$\begin{aligned}{} & {} \mathbb {E}[F(\widetilde{w}_s)-F(w_*)] \\{} & {} \quad \le \biggl (\frac{\mu mb+2\theta ^2Ld}{d\mu \theta mb-4\theta ^2Ld}\biggr )\mathbb {E}[F(\widetilde{w}_{s-1})-F(w_*)] \end{aligned}$$

When setting $\rho =\frac{\mu mb+2\theta ^2Ld}{d\mu \theta mb-4\theta ^2Ld}$, we had the desired bound. $\square $

1.3 A.3 Proof of Theorem 2

Proof

Combining $w_{k+1}=w_k-(U_s)^{-1} v_k$ and Lemma 4, we reached to the following conclusion

$$\begin{aligned} F(w_{k+1})\le & {} F(w_k)+\langle \nabla F(w_k), w_{k+1}-w_{k}\rangle \\{} & {} +\,\frac{L}{2}\Vert w_{k+1}-w_{k}\Vert ^2\\= & {} F(w_k)-\langle \nabla F(w_k), (U_s)^{-1}v_k\rangle \\{} & {} +\,\frac{L}{2}\Vert (U_s)^{-1}v_k\Vert ^2\\\le & {} F(w_k)-\langle \nabla F(w_k), (U_s)^{-1}v_k\rangle \\{} & {} +\,\frac{L}{2}\Vert (U_s)^{-1}\Vert ^2\Vert v_k\Vert ^2\\= & {} F(w_k)-\langle \nabla F(w_k), (U_s)^{-1}v_k\rangle \\{} & {} +\,\frac{L}{2}\Vert (U_s)^{-1}\Vert ^2\Vert \nabla F_S(w_k)\\{} & {} -\,\nabla F(w_k)+\nabla F(w_k)\\{} & {} -\,\nabla F_S(\tilde{w})+\nabla F(\tilde{w})\Vert ^2 \end{aligned}$$

where the second inequality used the Cauchy–Schwartz inequality and the second equality used the definition of $v_k=\nabla F_S(w_k)-\nabla F_S(\tilde{w})+\nabla F(\tilde{w})$.

Additionally, according to the fact $(a_1+a_2+ \cdots +a_n)^2\le n(a_1^2+a_2^2+ \cdots + a_n^2)$, we have

$$\begin{aligned} \mathbb {E}[F(w_{k+1})]\le & {} \mathbb {E}\biggl [F(w_k)-\langle \nabla F(w_k), (U_s)^{-1}v_k\rangle \nonumber \\{} & {} +\,\frac{3L}{2}\Vert (U_s)^{-1}\Vert ^2[\Vert \nabla F_S(w_k)\nonumber \\{} & {} -\,\nabla F(w_k)\Vert ^2+\Vert \nabla F(w_k)\Vert ^2+\Vert \nabla F_S(\tilde{w})\nonumber \\{} & {} -\,\nabla F(\tilde{w})\Vert ^2]\biggr ]\nonumber \\\le & {} F(w_k)-\frac{\theta d}{m\zeta }\Vert \nabla F(w_k)\Vert ^2\nonumber \\{} & {} +\,\frac{3L}{2}\cdot \frac{d\theta ^2}{m^2\zeta ^2}\cdot \frac{\sigma ^2}{b}\nonumber \\{} & {} +\,\frac{3L}{2}\cdot \frac{d\theta ^2}{m^2\zeta ^2}\Vert \nabla F(w_k)\Vert ^2\nonumber \\{} & {} +\,\frac{3L}{2}\cdot \frac{d\theta ^2}{m^2\zeta ^2}\cdot \frac{\sigma ^2}{b}\nonumber \\= & {} F(w_k)-\biggl (\frac{\theta d}{m\zeta }\frac{3Ld\theta ^2}{2m^2\zeta ^2}\biggr )\Vert \nabla F(w_k)\Vert ^2 \nonumber \\{} & {} +\frac{3Ld\theta ^2\sigma ^2}{bm^2\zeta ^2}, \end{aligned}$$

(A.2)

where the first inequality can be hold due to , and Assumption 3.

Via summing the inequality (A.2), we easily arrive at

$$\begin{aligned} \mathbb {E}[F(w_m)]\le & {} F(w_0)-\biggl (\frac{\theta d}{m\zeta }-\frac{3Ld\theta ^2}{2m^2\zeta ^2}\biggr )\\{} & {} \sum _{k=0}^{n}\Vert \nabla F(w_k)\Vert ^2+\frac{3Ld\theta ^2\sigma ^2}{bm\zeta ^2}. \end{aligned}$$

Further, we obtain

$$\begin{aligned}{} & {} \sum _{k=0}^{n}\Vert \nabla F(w_k)\Vert ^2\le \frac{2m^2\zeta ^2}{2m\zeta \theta d-3Ld\theta ^2}\\{} & {} \quad \mathbb {E}[F(w_0)-F(w_{m})]+\frac{6mL\theta \sigma ^2}{b(2m\zeta -3L\theta )}. \end{aligned}$$

Since $\mathbb {E}[\Vert \nabla F(w_m)\Vert ^2]=\frac{1}{m}\sum _{k=0}^{m-1}\Vert \nabla F(w_k)\Vert $, we induced

$$\begin{aligned}{} & {} \mathbb {E}[\Vert \nabla F(w_m)\Vert ^2]\nonumber \\{} & {} \quad \le \frac{2m\zeta ^2}{2m\zeta \theta d-3Ld\theta ^2}\mathbb {E}[F(w_0)-F(w_*)]\nonumber \\{} & {} \quad +\frac{6mL\theta \sigma ^2}{b(2m\zeta -3L\theta )}, \end{aligned}$$

(A.3)

where this inequality used the condition $w_*=\mathop {\mathrm {arg\,min}} F(w)$.

Note that, according to the inequality (4)

$$\begin{aligned} F(w)\le F(v)+ \langle \nabla F(v), w-v \rangle +\frac{L}{2}\parallel w-v\parallel ^2, \end{aligned}$$

we observed that the right hand of the above inequality defined a quadratic function, where it reached to a minimal value at the point $\bar{w}=v-\frac{\nabla f(v)}{L}$.

Practically, to satisfy the above inequality, it was an adequate to satisfy the following condition:

$$\begin{aligned} F(w)\le & {} F(v)+\langle \nabla F(v), \bar{w}-v \rangle \\{} & {} +\,\frac{L}{2}\Vert \bar{w}-v\Vert ^2\\= & {} F(v)-\frac{1}{2L}\Vert \nabla F(v)\Vert ^2. \end{aligned}$$

When taking $w=w_*$ and $v=w_0$, according to above inequality, we obtained the following result

$$\begin{aligned} \frac{1}{2L}\Vert \nabla F(w_0)\Vert ^2\le F(w_0)-F(w_*). \end{aligned}$$

(A.4)

Therefore, to hold the inequality (A.3), we only required that the following conclusion was satisfied:

$$\begin{aligned} \mathbb {E}[\Vert \nabla F(w_m)\Vert ^2]\le & {} \frac{m\zeta ^2}{2Lm\zeta \theta d-3L^2d\theta ^2}\\{} & {} \mathbb {E}[\Vert \nabla F(w_0)\Vert ^2]+\frac{6mL\theta \sigma ^2}{b(2m\zeta -3L\theta )}. \end{aligned}$$

According to the definition $\tilde{w}_{s+1}=w_m$ and $w_0=\tilde{w}_s$, denoted in Algorithm 1, we obtained

$$\begin{aligned}{} & {} \mathbb {E}[\Vert \nabla F(\tilde{w}_{s+1})\Vert ^2]\\{} & {} \quad \le \frac{m\zeta ^2}{2Lm\zeta \theta d-3L^2d\theta ^2}\mathbb {E}[\Vert \nabla F(\tilde{w}_s)\Vert ^2]+\frac{6mL\theta \sigma ^2}{b(2m\zeta -3L\theta )}. \end{aligned}$$

Resorting summing the above inequality over $s=0, 1, \ldots , \hat{S}-1$, we obtained the results.

$\square $

Appendix B Proofs for SGD-DBB

1.1 B.1 Proof of Theorem 3

Proof

According to $w_{k+1}=w_k-(U_s)^{-1}\nabla F_S(w_k)$, defined in Algorithm 2, we deduced

$$\begin{aligned}{} & {} \mathbb {E}[\Vert w_{k+1}-w_*\Vert ^2] \mathbb {E}[\Vert w_{k}-(U_s)^{-1}\nonumber \\{} & {} \quad \nabla F_S(w_k)-w_{*}\Vert ^2] \nonumber \\{} & {} \quad =\mathbb {E}[\Vert w_k-w_*\Vert ^2-2\langle w_k-w_*, (U_s)^{-1}\nabla F_S(w_k)\rangle \nonumber \\{} & {} \qquad +\,\Vert (U_s)^{-1}\nabla F_S(w_k)\Vert ^2]\nonumber \\{} & {} \quad \le \mathbb {E}[\Vert w_k-w_*\Vert ^2]-2\langle w_k-w_*, (U_s)^{-1}\nonumber \\{} & {} \quad \nabla F(w_k)\rangle \nonumber \\{} & {} \qquad +\,\mathbb {E}[\Vert (U_s)^{-1}\Vert ^2\Vert \nabla F_S(w_k)\Vert ^2]\nonumber \\{} & {} \quad \le \mathbb {E}[\Vert w_k-w_*\Vert ^2]-2\langle w_k-w_*, (U_s)^{-1}\nonumber \\{} & {} \quad \nabla F(w_k)\rangle \nonumber \\{} & {} \qquad +\,2L\Vert (U_s)^{-1}\Vert ^2[F(w_k)\nonumber \\{} & {} \qquad -\,F(w_*)], \end{aligned}$$

(B.1)

where the first inequality holds due to $\mathbb {E}[\nabla F_S(w_k)]=\nabla F(w_k)$ and Cauchy–Schwartz inequality. The second inequality used the conclusion (23), shown in Lemma 1.

To satisfy the inequality (B.1), we only required

$$\begin{aligned} \mathbb {E}[\Vert w_{k+1}-w_*\Vert ^2]\le & {} \mathbb {E}[\Vert w_k-w_*\Vert ^2]\\{} & {} -\,\frac{2\theta d}{m\mu }[F(w_k)-F(w_*)]+\frac{2Ld\theta ^2}{m^2\mu ^2}\\{} & {} [F(w_k)-F(w_*)]\\= & {} \mathbb {E}[\Vert w_k-w_*\Vert ^2]-\biggl (\frac{2\theta d}{mu}\\ {}{} & {} \quad -\frac{2Ld \theta ^2}{m^2\mu ^2} \biggr )[F(w_k)-F(w_*)], \end{aligned}$$

where in the first inequality, we used the convexity property of the objective function, i.e., $F(v)\ge F(w)+\nabla F(w)(v-w)$, $w_*=\mathop {\mathrm {arg\,min}}\nolimits _{w\in \mathbb {R}^d} F(w)$ and $U^{-1}\le \frac{\theta }{mu}\textrm{I}$.

By summing the above inequality over $k=0, 1, \ldots , m-1$, we obtain

$$\begin{aligned}{} & {} \mathbb {E}[\Vert \tilde{w}_{s+1}-w_*\Vert ^2] \le \mathbb {E}[\Vert w_{0}-w_*\Vert ^2]\\{} & {} \quad -\,\biggl (\frac{2\theta d}{mu}-\frac{2Ld \theta ^2}{m^2\mu ^2} \biggr )\sum _{k=0}^{m-1}[F(w_k)-F(w_*)], \end{aligned}$$

where this inequality used $\tilde{w}_{s+1}=w_m$, denoted in Algorithm 2.

Since $\mathbb {E}[F(\tilde{w}_s)]=\frac{1}{m}\sum _{k=0}^{m-1}\mathbb {E}[F(w_k)]$ and take expectation on both sides, we ascertained

$$\begin{aligned}{} & {} \mathbb {E}[\Vert \tilde{w}_{s+1}-w_*\Vert ^2]+ \biggl (\frac{2\theta d}{u}-\frac{2Ld \theta ^2}{m\mu ^2} \biggr )\\{} & {} \quad \mathbb {E}[F(\tilde{w}_s)-F(w_*)] \le \mathbb {E}[\Vert w_{0}-w_*\Vert ^2]. \end{aligned}$$

Further, we have

$$\begin{aligned} \mathbb {E}[F(\tilde{w}_{s+1})-F(w_*)]\le & {} \frac{m\mu ^2}{2m\mu d\theta -2Ld\theta ^2}\\{} & {} \mathbb {E}[\Vert w_{0}-w_*\Vert ^2]\\= & {} \frac{m\mu ^2}{2m\mu d\theta -2Ld\theta ^2}\\{} & {} \mathbb {E}[\Vert \tilde{w}_{s}-w_*\Vert ^2]\\\le & {} \frac{m\mu }{md\mu \theta -Ld\theta ^2}\mathbb {E}[F(\tilde{w}_s)-F(w_*)], \end{aligned}$$

where the first inequality holds due to the definition $w_0=\tilde{w}_s$ and the second inequality holds because of Assumption 2 (a.k.a. (22)).

Finally, summing the above inequality, we obtain

$$\begin{aligned} \mathbb {E}[F(\tilde{w}_{s})-F(w_*)]\le \tau ^{s} \mathbb {E}[F(\tilde{w}_0)-F(w_*)], \end{aligned}$$

where we set $\tau =\frac{m\mu }{md\mu \theta -Ld\theta ^2}$. $\square $

1.2 B.2 Proof of Theorem 4

Proof

Combining the fact $w_{k+1}=w_k-(U_s)^{-1}\nabla F_S(w_k)$ and Lemma 4, we have

$$\begin{aligned} F(w_{k+1})\le & {} F(w_k)+\langle \nabla F(w_k), w_{k+1}-w_{k}\rangle \nonumber \\{} & {} +\,\frac{L}{2}\Vert w_{k+1}-w_{k}\Vert ^2 \nonumber \\= & {} F(w_k)-\langle \nabla F(w_k), (U_s)^{-1}\nabla F_S(w_k)\rangle \nonumber \\{} & {} +\,\frac{L}{2}\Vert (U_s)^{-1}\nabla F_S(w_k)\Vert ^2\nonumber \\\le & {} F(w_k)-\langle \nabla F(w_k), (U_s)^{-1}\nabla F_S(w_k)\rangle \nonumber \\{} & {} +\,\frac{L}{2}\Vert (U_s)^{-1}\Vert ^2\Vert \nabla F_S(w_k)\Vert ^2\nonumber \\= & {} F(w_k)-\langle \nabla F(w_k), (U_s)^{-1}\nabla F_S(w_k)\rangle \nonumber \\{} & {} +\,\frac{L}{2}\Vert (U_s)^{-1}\Vert ^2\Vert \nabla F_S(w_k)\nonumber \\{} & {} -\,\nabla F(w_k)+\nabla F(w_k)\Vert ^2\nonumber \\\le & {} F(w_k)-\langle \nabla F(w_k), (U_s)^{-1}\nabla F_S(w_k)\rangle \nonumber \\{} & {} +\,L\Vert (U_s)^{-1}\Vert ^2[\Vert \nabla F_S(w_k)\nonumber \\{} & {} -\,\nabla F(w_k)\Vert ^2+\Vert \nabla F(w_k)\Vert ^2], \end{aligned}$$

(B.2)

where the second inequality held due to the Cauchy–Schwartz inequality and the last inequality held because of $\Vert a+b\Vert ^2\le 2(\Vert a\Vert ^2+\Vert b\Vert ^2)$.

In virtue of Assumption 3 and taking expectation on both sides of (B.2), we obtained

$$\begin{aligned} \mathbb {E}[F(w_{k+1})]\le & {} \mathbb {E}[ F(w_k)-\langle \nabla F(w_k), (U_s)^{-1}\nonumber \\{} & {} \nabla F_S(w_k)\rangle +L\Vert (U_s)^{-1}\Vert ^2[\Vert \nabla F_S(w_k)\nonumber \\{} & {} -\,\nabla F(w_k)\Vert ^2+\Vert \nabla F(w_k)\Vert ^2] \nonumber \\\le & {} F(w_k)-\frac{\theta d}{m\zeta }\Vert \nabla F(w_k)\Vert ^2\nonumber \\{} & {} +\,\frac{L\theta ^2d}{m^2\zeta ^2}\cdot \frac{\sigma ^2}{b}\nonumber \\{} & {} +\,\frac{Ld\theta ^2}{m^2\zeta ^2}\Vert \nabla F(w_k)\Vert ^2\nonumber \\= & {} F(w_k)-\biggl (\frac{\theta d}{m\zeta }-\frac{L\theta ^2d}{m^2\zeta ^2}\biggr )\Vert \nabla F(w_k)\Vert ^2\nonumber \\{} & {} +\,\frac{Ld\theta ^2\sigma ^2}{bm^2\zeta ^2} \end{aligned}$$

(B.3)

where the second inequality used the conditions $\mathbb {E}[\nabla F_S(w_k)]=\nabla F(w_k)$ and .

By recursively adding the inequality (B.3) over $k=0, 1, \ldots , m-1$, we easily had the following result:

$$\begin{aligned} \mathbb {E}[F(w_{m})]\le & {} \mathbb {E}[F(w_0)]-\biggl (\frac{\theta d}{m\zeta }-\,\frac{L\theta ^2d}{m^2\zeta ^2}\biggr )\nonumber \\{} & {} \sum _{k=0}^{m-1}\Vert \nabla F(w_k)\Vert ^2\,+\frac{Ld\theta ^2\sigma ^2}{bm\zeta ^2}. \end{aligned}$$

(B.4)

Further, according to (B.4), we arrive at

$$\begin{aligned}{} & {} \sum _{k=0}^{m-1}\Vert \nabla F(w_k)\Vert ^2\le \frac{m^2\zeta ^2}{m\zeta \theta d-Ld\theta ^2}\mathbb {E}[F(w_0)-F(w_m)]\\{} & {} \quad +\frac{Lm\theta \sigma ^2}{b(m\zeta -L\theta )}. \end{aligned}$$

Since $\mathbb {E}[\Vert \nabla F(w_m)\Vert ^2]=\frac{1}{m}\sum _{k=0}^{m-1}\Vert \nabla F(w_k)\Vert $, we further derived

$$\begin{aligned} \mathbb {E}[\Vert \nabla F(w_m)\Vert ^2]\le & {} \frac{m\zeta ^2}{m\zeta \theta d-Ld\theta ^2}\mathbb {E}[F(w_0)-F(w_m)]\\{} & {} +\,\frac{L\theta \sigma ^2}{b(m\zeta -L\theta )}\\\le & {} \frac{m\zeta ^2}{m\zeta \theta d-Ld\theta ^2}\mathbb {E}[F(w_0)-F(w_*)]\\{} & {} +\,\frac{L\theta \sigma ^2}{b(m\zeta -L\theta )} \end{aligned}$$

where the second inequality adopted the condition $w_*=\mathop {\mathrm {arg\,min}} F(w)$.

In addition, according to $w_0=\tilde{w}_s$ and $\tilde{w}_{s+1}=w_m$, we induced

$$\begin{aligned}{} & {} \mathbb {E}[\Vert \nabla F(\tilde{w}_{s+1})\Vert ^2] \le \frac{m\zeta ^2}{m\zeta \theta d-Ld\theta ^2}\mathbb {E}[F(\tilde{w}_s)-F(w_*)]\nonumber \\{} & {} \quad +\,\frac{L\theta \sigma ^2}{b(m\zeta -L\theta )}. \end{aligned}$$

(B.5)

In order to satisfy (B.5), we just required that the following condition be held.

$$\begin{aligned}{} & {} \mathbb {E}[\Vert \nabla F(\tilde{w}_{s+1})\Vert ^2] \le \frac{m\zeta ^2}{2Lm\zeta \theta d-2L^2d\theta ^2}\nonumber \\{} & {} \qquad \mathbb {E}[\Vert \nabla F(\tilde{w}_s)\Vert ^2]+\frac{L\theta \sigma ^2}{b(m\zeta -L\theta )}, \end{aligned}$$

(B.6)

where this inequality used the conclusion in (A.4).

Finally, summing the inequality (B.6) over $s=0, 1, \ldots , \hat{S}-1$, we obtained the desired results. $\square $

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, Z., Ma, L. Adaptive step size rules for stochastic optimization in large-scale learning. Stat Comput 33, 45 (2023). https://doi.org/10.1007/s11222-023-10218-2

Download citation

Received: 14 May 2022
Accepted: 01 February 2023
Published: 20 February 2023
DOI: https://doi.org/10.1007/s11222-023-10218-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adaptive step size rules for stochastic optimization in large-scale learning

Abstract

Access this article

Similar content being viewed by others

Particle swarm optimization algorithm: an overview

An exhaustive review of the metaheuristic algorithms for search and optimization: taxonomy, applications, and open challenges

Bolstering stochastic gradient descent with model building

Notes

References

Funding