Skip to main content
Log in

Adaptive step size rules for stochastic optimization in large-scale learning

  • Original Paper
  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

The importance of the step size in stochastic optimization has been confirmed both theoretically and empirically during the past few decades and reconsidered in recent years, especially for large-scale learning. Different rules of selecting the step size have been discussed since the arising of stochastic approximation methods. The first part of this work reviews the studies on several representative techniques of setting the step size, covering heuristic rules, meta-learning procedure, adaptive step size technique and line search technique. The second component of this work proposes a novel class of accelerating stochastic optimization methods by resorting to the Barzilai–Borwein (BB) technique with a diagonal selection rule for the metric, particularly, termed as DBB. We first explore the theoretical and empirical properties of variance reduced stochastic optimization algorithms with DBB. Especially, we study the theoretical and numerical properties of the resulting method under strongly convex and non-convex cases respectively. To great show the efficacy of the step size schedule of DBB, we extend it into more general stochastic optimization methods. The theoretical and empirical properties of such the case also developed under different cases. Extensive numerical results in machine learning are offered, suggesting that the proposed algorithms show much promise.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. The coding can be found in https://github.com/Zane4YZ/STCO.

  2. rcv1, news20, covtype and ijcnn1 can be download from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

  3. sido0 and cina0 can be download from http://www.causality.inf.ethz.ch/challenge.php?page=datasets.

  4. \(CIFAR-10\) can be accessed from http://www.cs.toronto.edu/~kriz/cifar.html.

  5. MNIST can be downloaded from http://yann.lecun.com/exdb/mnist/.

References

  • Antoniadis, A., Gijbels, I., Nikolova, M.: Penalized likelihood regression for generalized linear models with non-quadratic penalties. Ann. Inst. Stat. Math. 63, 585–615 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • Asi, H., Duchi, J.C.: Stochastic (approximate) proximal point methods: convergence, optimality, and adaptivity. SIAM J. Optim. 29, 2257–2290 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  • Bach, F.: Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. J. Mach. Learn. Res. 15, 595–627 (2014)

    MathSciNet  MATH  Google Scholar 

  • Baker, J., Fearnhead, P., Fox, E.B., Nemeth, C.: Control variates for stochastic gradient mcmc. Stat. Comput. 29, 599–615 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  • Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8, 141–148 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  • Baydin, A.G., Cornish, R., Rubio, D.M., Schmidt, M., Wood, F.: Online learning rate adaptation with hypergradient descent. In: International Conference on Learning Representations (2018)

  • Benveniste, A., Métivier, M., Priouret, P.: Adaptive Algorithms and Stochastic Approximations, vol. 22. Springer, Berlin (2012)

    MATH  Google Scholar 

  • Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., Anandkumar, A.: signsgd: compressed optimisation for non-convex problems. In: International Conference on Machine Learning, pp. 560–569 (2018)

  • Borkar, V.S.: Stochastic Approximation: A Dynamical Systems Viewpoint, vol. 48. Springer, Berlin (2009)

    MATH  Google Scholar 

  • Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-newton method for large-scale optimization. SIAM J. Optim. 26, 1008–1031 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  • Chowdhury, A.K.R., Chellappa, R.: Stochastic approximation and rate-distortion analysis for robust structure and motion estimation. Int. J. Comput. Vis. 55, 27–53 (2003)

    Article  Google Scholar 

  • Cotter, A., Shamir, O., Srebro, N., Sridharan, K.: Better mini-batch algorithms via accelerated gradient methods. In: Proceedings of the 24th International Conference on Neural Information Processing Systems, pp. 1647–1655 (2011)

  • Crisci, S., Porta, F., Ruggiero, V., Zanni, L.: Spectral properties of Barzilai–Borwein rules in solving singly linearly constrained optimization problems subject to lower and upper bounds. SIAM J. Optim. 30, 1300–1326 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  • Csiba, D., Qu, Z., Richtárik, P.: Stochastic dual coordinate ascent with adaptive probabilities. In: International Conference on Machine Learning, pp. 674–683 (2015)

  • Delyon, B., Juditsky, A.: Accelerated stochastic approximation. SIAM J. Optim. 3, 868–881 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  • Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)

    MathSciNet  MATH  Google Scholar 

  • Ekblom, J., Blomvall, J.: Importance sampling in stochastic optimization: an application to intertemporal portfolio choice. Eur. J. Oper. Res. 285, 106–119 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  • Ermoliev, Y.: Stochastic quasigradient methods. In: Numerical Techniques for Stochastic Optimization, pp. 141–185. Springer (1988)

  • Fang, C., Li, C.J., Lin, Z., Zhang, T.: SPIDER: near-optimal non-convex optimization via stochastic path integrated differential estimator. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 687–697 (2018)

  • Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23, 2341–2368 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  • Huang, Y., Liu, H.: Smoothing projected Barzilai–Borwein method for constrained non-lipschitz optimization. Comput. Optim. Appl. 65, 671–698 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  • Jacobs, R.A.: Increased rates of convergence through learning rate adaptation. Neural Netw. 1, 295–307 (1988)

    Article  Google Scholar 

  • Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)

  • Karimireddy, S.P., Rebjock, Q., Stich, S., Jaggi, M.: Error feedback fixes signsgd and other gradient compression schemes. In: International Conference on Machine Learning, pp. 3252–3261 (2019)

  • Kesten, H.: Accelerated stochastic approximation. Ann. Math. Stat. 29(1), 41–59 (1958)

    Article  MathSciNet  MATH  Google Scholar 

  • Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)

  • Klein, S., Pluim, J.P., Staring, M., Viergever, M.A.: Adaptive stochastic gradient descent optimisation for image registration. Int. J. Comput. Vis. 81, 227 (2009)

    Article  MATH  Google Scholar 

  • Konečnỳ, J., Liu, J., Richtárik, P., Takáč, M.: Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE J. Sel. Top. Signal Process. 10, 242–255 (2016)

    Article  Google Scholar 

  • Kovalev, D., Horváth, S., Richtárik, P.: Don’t jump through hoops and remove those loops: SVRG and katyusha are better without the outer loop. In: Proceedings of the 31st International Conference on Algorithmic Learning Theory, PMLR, vol. 117, pp. 451–467 (2020)

  • Lei, Y., Tang, K.: Learning rates for stochastic gradient descent with nonconvex objectives. IEEE Trans. Pattern Anal. Mach. Intell. 43, 4505–4511 (2021)

    Article  Google Scholar 

  • Liang, J., Xu, Y., Bao, C., Quan, Y., Ji, H.: Barzilai–Borwein-based adaptive learning rate for deep learning. Pattern Recogn. Lett. 128, 197–203 (2019)

    Article  Google Scholar 

  • Loizou, N., Vaswani, S., Laradji, I.H., Lacoste-Julien, S.: Stochastic polyak step-size for sgd: an adaptive learning rate for fast convergence. In: International Conference on Artificial Intelligence and Statistics, pp. 1306–1314. PMLR (2021)

  • Mahsereci, M., Hennig, P.: Probabilistic line searches for stochastic optimization. J. Mach. Learn. Res. 18, 4262–4320 (2017)

    MathSciNet  MATH  Google Scholar 

  • Mokhtari, A., Ribeiro, A.: Stochastic quasi-Newton methods. In: Proceedings of the IEEE (2020)

  • Nesterov, Y.: Introductory Lectures on Convex Optimization: Basic Course. Kluwer Academic, Dordrecht (2004)

    Book  MATH  Google Scholar 

  • Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: International Conference on Machine Learning-Volume 70, pp. 2613–2621 (2017)

  • Nguyen, L.M., Tran-Dinh, Q., Phan, D.T., Nguyen, P.H., Van Dijk, M.: A unified convergence analysis for shuffling-type gradient methods. J. Mach. Learn. Res. 22, 9397–9440 (2021)

    MathSciNet  MATH  Google Scholar 

  • Nitanda, A.: Stochastic proximal gradient descent with acceleration techniques. In: Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 1, pp. 1574–1582 (2014)

  • Paquette, C., Scheinberg, K.: A stochastic line search method with expected complexity analysis. SIAM J. Optim. 30, 349–376 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  • Park, Y., Dhar, S., Boyd, S., Shah, M.: Variable metric proximal gradient method with diagonal Barzilai–Borwein stepsize. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3597–3601. IEEE (2020)

  • Plagianakos, V.P., Magoulas, G.D., Vrahatis, M.N.: Learning rate adaptation in stochastic gradient descent. In: Advances in Convex Analysis and Global Optimization: Honoring the Memory of C. Caratheodory (1873–1950), pp. 433–444 (2001)

  • Reddi, S.J., Kale, S., Kumar, S.: On the convergence of ADAM and beyond. In: International Conference on Learning Representations (2018)

  • Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  • Roux, N. L., Schmidt, M., Bach, F.: A stochastic gradient method with an exponential convergence rate for finite training sets. In: Proceedings of the 25th International Conference on Neural Information Processing Systems-Volume 2, pp. 2663–2671 (2012)

  • Saridis, G.N.: Learning applied to successive approximation algorithms. IEEE Trans. Syst. Sci. Cybern. 6, 97–103 (1970)

    Article  MATH  Google Scholar 

  • Schaul, T., Zhang, S., LeCun, Y.: No more pesky learning rates. In: International Conference on Machine Learning, pp. 343–351 (2013)

  • Schmidt, M., Babanezhad, R., Ahemd, M., Clifton, A., Sarkar, A.: Non-uniform stochastic average gradient method for training conditional random fields. In: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, PMLR, vol. 38, pp. 819–828 (2015)

  • Schraudolph, N.: Local gain adaptation in stochastic gradient descent. In: Proceedings of ICANN, pp. 569–574. IEE (1999)

  • Sebag, A., Schoenauer, M., Sebag, M.: Stochastic gradient descent: going as fast as possible but not faster. In: OPTML 2017: 10th NIPS Workshop on Optimization for Machine Learning, pp. 1–8 (2017)

  • Shao, S., Yip, P.P.: Rates of convergence of adaptive step-size of stochastic approximation algorithms. J. Math. Anal. Appl. 244, 333–347 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  • Sopyła, K., Drozda, P.: Stochastic gradient descent with Barzilai–Borwein update step for SVM. Inf. Sci. 316, 218–233 (2015)

    Article  MATH  Google Scholar 

  • Tan, C., Ma, S., Dai, Y. H., Qian, Y.: Barzilai-Borwein step size for stochastic gradient descent. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 685–693) (2016)

  • Tieleman, T., Hinton, G.: Rmsprop: Divide the gradient by a running average of its recent magnitude. In: COURSERA: Neural Networks for Machine Learning, vol. 4(2), pp. 26–31 (2012)

  • Toulis, P., Airoldi, E.M.: Scalable estimation strategies based on stochastic approximations: classical results and new insights. Stat. Comput. 25, 781–795 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  • Vaswani, S., Mishkin, A., Laradji, I., Schmidt, M., Gidel, G., Lacoste-Julien, S.: Painless stochastic gradient: interpolation, line-search, and convergence rates. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 3732–3745 (2019)

  • Wang, Z., Zhou, Y., Liang, Y., Lan, G.: Cubic regularization with momentum for nonconvex optimization. In: Proceedings of the 35th Uncertainty in Artificial Intelligence Conference, PMLR, vol. 115, pp. 313–322 (2020)

  • Ward, R., Wu, X., Bottou, L.: Adagrad stepsizes: sharp convergence over nonconvex landscapes. In: International Conference on Machine Learning, pp. 6677–6686. PMLR (2019)

  • Wei, F., Bao, C., Liu, Y.: Stochastic Anderson mixing for nonconvex stochastic optimization. Adv. Neural. Inf. Process. Syst. 34, 22995–23008 (2021)

    Google Scholar 

  • Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4151–4161 (2017)

  • Yang, Z.: On the step size selection in variance-reduced algorithm for nonconvex optimization. Expert Syst. Appl. 169, 114336 (2020)

    Article  Google Scholar 

  • Yang, Z., Wang, C., Zang, Y., Li, J.: Mini-batch algorithms with Barzilai–Borwein update step. Neurocomputing 314, 177–185 (2018a)

  • Yang, Z., Wang, C., Zhang, Z., Li, J.: Random Barzilai–Borwein step size for mini-batch algorithms. Eng. Appl. Artif. Intell. 72, 124–135 (2018b)

  • Yang, Z., Wang, C., Zhang, Z., Li, J.: Accelerated stochastic gradient descent with step size selection rules. Signal Process. 159, 171–186 (2019a)

  • Yang, Z., Wang, C., Zhang, Z., Li, J.: Mini-batch algorithms with online step size. Knowl. Based Syst. 165, 228–240 (2019b)

  • Yu, T., Liu, X.-W., Dai, Y.-H., Sun, J.: A variable metric mini-batch proximal stochastic recursive gradient algorithm with diagonal Barzilai–Borwein stepsize (2020). arXiv:2010.00817

  • Zeiler, M.D.: Adadelta: an adaptive learning rate method (2012). arXiv:1212.5701

  • Zhao, P., Zhang, T.: Stochastic optimization with importance sampling for regularized loss minimization. In: International Conference on Machine Learning, pp. 1–9. PMLR (2015)

  • Zhou, D., Xu, P., Gu, Q.: Stochastic nested variance reduction for nonconvex optimization. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3925–3936 (2018)

Download references

Funding

This work was supported by grants from the China Postdoctoral Science Foundation under Grant 2019M663238. Also, this work was partially supported by Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.

Author information

Authors and Affiliations

Authors

Contributions

Zhuang Yang: Data curation, Formal analysis, Investigation, Methodology, Validation, Visualization, Writing - original draft, Funding acquisition. Li Ma: Formal analysis, Investigation.

Corresponding author

Correspondence to Zhuang Yang.

Ethics declarations

Conflict of interest

Author Zhuang Yang declares that he has no conflict of interest. Author Li Ma declares that he has no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Additional informed consent is obtained from all individual participants for whom identifying information is included in this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Proofs for mS2GD-DBB

1.1 A.1 Proof of Lemma 2

For clarity, we provide a brief proof for the Lemma 2.

Proof

From (17), i.e., \(\widetilde{\eta }_k^{SBB2}=\frac{\theta }{m}\cdot \frac{\langle s_{k}, y_{k}\rangle }{\Vert y_k\Vert ^2}\), we have

$$\begin{aligned} \widetilde{\eta }_k^{SBB2}\ge & {} \frac{\theta }{m}\cdot \frac{1}{L} \frac{\Vert y_k\Vert ^2}{\Vert y_k\Vert ^2}=\frac{\theta }{mL}, \end{aligned}$$

where in the first inequality we took Lemma 1 (a.k.a. (24)).

In addition, from (16), i.e., \(\widetilde{\eta }_k^{SBB1}=\frac{\theta }{m}\cdot \frac{\Vert s_{k}\Vert ^{2}}{\langle s_{k}, y_{k}\rangle }\), we have

$$\begin{aligned}{} & {} \widetilde{\eta }_k^{SBB1} \le \frac{\theta }{m}\cdot \frac{\Vert s_k\Vert ^2}{\mu \Vert s_k\Vert ^2}=\frac{\theta }{\mu m}, \end{aligned}$$

where in the first inequality we took Assumption 2.

Further, using the Cauchy–Schwartz inequality, we have the result that \(\widetilde{\eta }_k^{SBB1} > \widetilde{\eta }_k^{SBB2}\).

Here, we finished the proof of Lemma 2. \(\square \)

1.2 A.2 Proof of Theorem 1

Proof

According to the definition of \(w_{k+1}\), i.e., \(w_{k+1}=w_k-(U_s)^{-1} v_k\), in Algorithm 1, we obtain

$$\begin{aligned}{} & {} \mathbb {E}[\Vert w_{k+1}-w_*\Vert ^2] = \mathbb {E}[\Vert w_k-(U_s)^{-1}v_k-w_*\Vert ^2] \\{} & {} \quad = \Vert w_k-w_*\Vert ^2-2\langle w_k-w_*, (U_s)^{-1}\nabla F(w_k)\rangle \\{} & {} \qquad +\,\Vert (U_s)^{-1}v_k\Vert ^2\\{} & {} \quad =\Vert w_k-w_*\Vert ^2-2\langle W, Y\rangle \bar{U}+\Vert (U_s)^{-1}v_k\Vert ^2 \end{aligned}$$

where the second equality holds due to \(\mathbb {E}[v_k]=\nabla F(w_k)\). In the last equality, we set \(W=[(w_k-w_*)^1, \ldots , (w_k-w_*)^d]^T\), and \(\bar{U}=[(U_s)^{-1}_{1, 1}, \ldots , (U_s)^{-1}_{d, d}]^T\).

Utilizing the Cauchy–Schwartz inequality, we have

$$\begin{aligned} \mathbb {E}[\Vert w_{k+1}-w_*\Vert ^2]\le & {} \Vert w_k-w_*\Vert ^2-2\langle W, Y\rangle \bar{U}\\{} & {} +\,\Vert (U_s)^{-1}\Vert ^2\Vert v_k\Vert ^2 \\\le & {} \Vert w_k-w_*\Vert ^2-2\langle W, Y\rangle \bar{U}\\{} & {} +\,\Vert (U_s)^{-1}\Vert ^2\biggl [\frac{4L}{b} \bigl [ F(w_{k})\\{} & {} -F(w_{*})+F(\widetilde{w})-F(w_*)]\\ {}{} & {} \quad +\frac{2}{b}\Vert \nabla F(w_{k-1})\Vert ^2\biggr ], \end{aligned}$$

where the last inequality holds due to Lemma 3.

Further, combining the convexity of the objective function, F(w), that is \(F(v)\ge F(w)+\langle \nabla F(w), v-w\rangle \), and the result (23), shown in Lemma 1, we have

$$\begin{aligned}{} & {} \mathbb {E}[\Vert w_{k+1}-w_*\Vert ^2]\nonumber \\ {}{} & {} \quad \le \Vert w_k-w_*\Vert ^2-2\langle Z, \bar{U}\rangle \nonumber \\{} & {} \qquad +\,\Vert (U_s)^{-1}\Vert ^2\biggl [\frac{4L}{b} \bigl [ F(w_{k})-F(w_{*})+F(\widetilde{w})\nonumber \\{} & {} \qquad -\,F(w_*)\biggr ]+\frac{4L\Vert (U_s)^{-1}\Vert ^2}{b}[F(w_k)-F(w_*)]\nonumber \\{} & {} \quad \le \Vert w_k-w_*\Vert ^2-2\langle Z, \bar{U}\rangle \nonumber \\{} & {} \qquad +\,\frac{4\theta ^2Ld}{m^2\mu ^2b}[ F(w_{k})-F(w_{*})\nonumber \\{} & {} \qquad +\,F(\widetilde{w})-F(w_*)]\nonumber \\{} & {} \qquad +\,\frac{4\theta ^2Ld}{m^2\mu ^2b}[F(w_k)-F(w_*)], \end{aligned}$$
(A.1)

where we set \(Z=[F(w_k)-F(w_*), \ldots , F(w_k)-F(w_*)]^T\) \(\in \mathbb {R}^d\), and in the last inequality, we use the fact \(\frac{\theta }{mL}\textrm{I}\le U^{-1} \le \frac{\theta }{m\mu }\textrm{I}\).

To hold the inequality (A.1), we only required that the following condition was satisfied:

$$\begin{aligned} \mathbb {E}[\Vert w_{k+1}-w_*\Vert ^2]\le & {} \Vert w_k-w_*\Vert ^2-\frac{2d\theta }{m\mu }[F(w_k)\\{} & {} -\,F(w_*)]+\frac{4\theta ^2Ld}{m^2\mu ^2b}[ F(w_{k})\\{} & {} -F(w_{*})+F(\widetilde{w})-F(w_*)]\\{} & {} +\,\frac{4\theta ^2Ld}{m^2\mu ^2b}[F(w_k)-F(w_*)]\\= & {} \Vert w_k-w_*\Vert ^2-\biggl (\frac{2d\theta }{m\mu }\\{} & {} -\,\frac{8\theta ^2Ld}{m^2\mu ^2b}\biggr )[F(w_k)-F(w_*)]\\{} & {} +\,\frac{4\theta ^2Ld}{m^2\mu ^2b}[F(\widetilde{w})-F(w_*)] \end{aligned}$$

In addition, according to the definition of \(\widetilde{w}_s\) in Algorithm 1, we have

$$\begin{aligned}{} & {} \mathbb {E}[F(\widetilde{w}_s)]=\frac{1}{m} \sum _{k=0}^{m-1}\mathbb {E}[F(w_{k+1})]. \end{aligned}$$

Through summing the previous inequality over \(k=0, \ldots , m-1\), using expectation with all the history, and taking \(\widetilde{w}_{s}=w_{m}\) in Algorithm 1, we obtain

$$\begin{aligned}{} & {} \mathbb {E}[\Vert w_m-w_*\Vert ^2] + \biggl (\frac{2d\theta }{\mu }-\frac{8\theta ^2Ld}{\mu ^2mb}\biggr )\mathbb {E}[F(\widetilde{w}_s)-F(w_*)]\\{} & {} \quad \le \mathbb {E}[\Vert w_0-w_*\Vert ^2]+\frac{4\theta ^2Ld}{\mu ^2mb}[F(\widetilde{w})\\{} & {} \qquad -\,F(w_*)]\\{} & {} \quad = \mathbb {E}[\Vert \widetilde{w}-w_*\Vert ^2]+\frac{4\theta ^2Ld}{\mu ^2mb}\\{} & {} \qquad [F(\widetilde{w})-F(w_*)]\\{} & {} \quad \le \frac{2}{\mu }[F(\widetilde{w})-F(w_*)]\\{} & {} \qquad +\,\frac{4\theta ^2Ld}{\mu ^2mb}[F(\widetilde{w})-F(w_*)]\\{} & {} \quad = \biggl (\frac{2}{\mu }+\frac{4\theta ^2Ld}{\mu ^2mb}\biggr )\\{} & {} \qquad [F(\widetilde{w})-F(w_*)] \end{aligned}$$

where in the last inequality we used the inequality (22), appearing in Assumption 2.

Further, we have

$$\begin{aligned}{} & {} \mathbb {E}[F(\widetilde{w}_s)-F(w_*)] \\{} & {} \quad \le \biggl (\frac{\mu mb+2\theta ^2Ld}{d\mu \theta mb-4\theta ^2Ld}\biggr )\mathbb {E}[F(\widetilde{w}_{s-1})-F(w_*)] \end{aligned}$$

When setting \(\rho =\frac{\mu mb+2\theta ^2Ld}{d\mu \theta mb-4\theta ^2Ld}\), we had the desired bound. \(\square \)

1.3 A.3 Proof of Theorem 2

Proof

Combining \(w_{k+1}=w_k-(U_s)^{-1} v_k\) and Lemma 4, we reached to the following conclusion

$$\begin{aligned} F(w_{k+1})\le & {} F(w_k)+\langle \nabla F(w_k), w_{k+1}-w_{k}\rangle \\{} & {} +\,\frac{L}{2}\Vert w_{k+1}-w_{k}\Vert ^2\\= & {} F(w_k)-\langle \nabla F(w_k), (U_s)^{-1}v_k\rangle \\{} & {} +\,\frac{L}{2}\Vert (U_s)^{-1}v_k\Vert ^2\\\le & {} F(w_k)-\langle \nabla F(w_k), (U_s)^{-1}v_k\rangle \\{} & {} +\,\frac{L}{2}\Vert (U_s)^{-1}\Vert ^2\Vert v_k\Vert ^2\\= & {} F(w_k)-\langle \nabla F(w_k), (U_s)^{-1}v_k\rangle \\{} & {} +\,\frac{L}{2}\Vert (U_s)^{-1}\Vert ^2\Vert \nabla F_S(w_k)\\{} & {} -\,\nabla F(w_k)+\nabla F(w_k)\\{} & {} -\,\nabla F_S(\tilde{w})+\nabla F(\tilde{w})\Vert ^2 \end{aligned}$$

where the second inequality used the Cauchy–Schwartz inequality and the second equality used the definition of \(v_k=\nabla F_S(w_k)-\nabla F_S(\tilde{w})+\nabla F(\tilde{w})\).

Additionally, according to the fact \((a_1+a_2+ \cdots +a_n)^2\le n(a_1^2+a_2^2+ \cdots + a_n^2)\), we have

$$\begin{aligned} \mathbb {E}[F(w_{k+1})]\le & {} \mathbb {E}\biggl [F(w_k)-\langle \nabla F(w_k), (U_s)^{-1}v_k\rangle \nonumber \\{} & {} +\,\frac{3L}{2}\Vert (U_s)^{-1}\Vert ^2[\Vert \nabla F_S(w_k)\nonumber \\{} & {} -\,\nabla F(w_k)\Vert ^2+\Vert \nabla F(w_k)\Vert ^2+\Vert \nabla F_S(\tilde{w})\nonumber \\{} & {} -\,\nabla F(\tilde{w})\Vert ^2]\biggr ]\nonumber \\\le & {} F(w_k)-\frac{\theta d}{m\zeta }\Vert \nabla F(w_k)\Vert ^2\nonumber \\{} & {} +\,\frac{3L}{2}\cdot \frac{d\theta ^2}{m^2\zeta ^2}\cdot \frac{\sigma ^2}{b}\nonumber \\{} & {} +\,\frac{3L}{2}\cdot \frac{d\theta ^2}{m^2\zeta ^2}\Vert \nabla F(w_k)\Vert ^2\nonumber \\{} & {} +\,\frac{3L}{2}\cdot \frac{d\theta ^2}{m^2\zeta ^2}\cdot \frac{\sigma ^2}{b}\nonumber \\= & {} F(w_k)-\biggl (\frac{\theta d}{m\zeta }\frac{3Ld\theta ^2}{2m^2\zeta ^2}\biggr )\Vert \nabla F(w_k)\Vert ^2 \nonumber \\{} & {} +\frac{3Ld\theta ^2\sigma ^2}{bm^2\zeta ^2}, \end{aligned}$$
(A.2)

where the first inequality can be hold due to , and Assumption 3.

Via summing the inequality (A.2), we easily arrive at

$$\begin{aligned} \mathbb {E}[F(w_m)]\le & {} F(w_0)-\biggl (\frac{\theta d}{m\zeta }-\frac{3Ld\theta ^2}{2m^2\zeta ^2}\biggr )\\{} & {} \sum _{k=0}^{n}\Vert \nabla F(w_k)\Vert ^2+\frac{3Ld\theta ^2\sigma ^2}{bm\zeta ^2}. \end{aligned}$$

Further, we obtain

$$\begin{aligned}{} & {} \sum _{k=0}^{n}\Vert \nabla F(w_k)\Vert ^2\le \frac{2m^2\zeta ^2}{2m\zeta \theta d-3Ld\theta ^2}\\{} & {} \quad \mathbb {E}[F(w_0)-F(w_{m})]+\frac{6mL\theta \sigma ^2}{b(2m\zeta -3L\theta )}. \end{aligned}$$

Since \(\mathbb {E}[\Vert \nabla F(w_m)\Vert ^2]=\frac{1}{m}\sum _{k=0}^{m-1}\Vert \nabla F(w_k)\Vert \), we induced

$$\begin{aligned}{} & {} \mathbb {E}[\Vert \nabla F(w_m)\Vert ^2]\nonumber \\{} & {} \quad \le \frac{2m\zeta ^2}{2m\zeta \theta d-3Ld\theta ^2}\mathbb {E}[F(w_0)-F(w_*)]\nonumber \\{} & {} \quad +\frac{6mL\theta \sigma ^2}{b(2m\zeta -3L\theta )}, \end{aligned}$$
(A.3)

where this inequality used the condition \(w_*=\mathop {\mathrm {arg\,min}} F(w)\).

Note that, according to the inequality (4)

$$\begin{aligned} F(w)\le F(v)+ \langle \nabla F(v), w-v \rangle +\frac{L}{2}\parallel w-v\parallel ^2, \end{aligned}$$

we observed that the right hand of the above inequality defined a quadratic function, where it reached to a minimal value at the point \(\bar{w}=v-\frac{\nabla f(v)}{L}\).

Practically, to satisfy the above inequality, it was an adequate to satisfy the following condition:

$$\begin{aligned} F(w)\le & {} F(v)+\langle \nabla F(v), \bar{w}-v \rangle \\{} & {} +\,\frac{L}{2}\Vert \bar{w}-v\Vert ^2\\= & {} F(v)-\frac{1}{2L}\Vert \nabla F(v)\Vert ^2. \end{aligned}$$

When taking \(w=w_*\) and \(v=w_0\), according to above inequality, we obtained the following result

$$\begin{aligned} \frac{1}{2L}\Vert \nabla F(w_0)\Vert ^2\le F(w_0)-F(w_*). \end{aligned}$$
(A.4)

Therefore, to hold the inequality (A.3), we only required that the following conclusion was satisfied:

$$\begin{aligned} \mathbb {E}[\Vert \nabla F(w_m)\Vert ^2]\le & {} \frac{m\zeta ^2}{2Lm\zeta \theta d-3L^2d\theta ^2}\\{} & {} \mathbb {E}[\Vert \nabla F(w_0)\Vert ^2]+\frac{6mL\theta \sigma ^2}{b(2m\zeta -3L\theta )}. \end{aligned}$$

According to the definition \(\tilde{w}_{s+1}=w_m\) and \(w_0=\tilde{w}_s\), denoted in Algorithm 1, we obtained

$$\begin{aligned}{} & {} \mathbb {E}[\Vert \nabla F(\tilde{w}_{s+1})\Vert ^2]\\{} & {} \quad \le \frac{m\zeta ^2}{2Lm\zeta \theta d-3L^2d\theta ^2}\mathbb {E}[\Vert \nabla F(\tilde{w}_s)\Vert ^2]+\frac{6mL\theta \sigma ^2}{b(2m\zeta -3L\theta )}. \end{aligned}$$

Resorting summing the above inequality over \(s=0, 1, \ldots , \hat{S}-1\), we obtained the results.

\(\square \)

Appendix B Proofs for SGD-DBB

1.1 B.1 Proof of Theorem 3

Proof

According to \(w_{k+1}=w_k-(U_s)^{-1}\nabla F_S(w_k)\), defined in Algorithm 2, we deduced

$$\begin{aligned}{} & {} \mathbb {E}[\Vert w_{k+1}-w_*\Vert ^2] \mathbb {E}[\Vert w_{k}-(U_s)^{-1}\nonumber \\{} & {} \quad \nabla F_S(w_k)-w_{*}\Vert ^2] \nonumber \\{} & {} \quad =\mathbb {E}[\Vert w_k-w_*\Vert ^2-2\langle w_k-w_*, (U_s)^{-1}\nabla F_S(w_k)\rangle \nonumber \\{} & {} \qquad +\,\Vert (U_s)^{-1}\nabla F_S(w_k)\Vert ^2]\nonumber \\{} & {} \quad \le \mathbb {E}[\Vert w_k-w_*\Vert ^2]-2\langle w_k-w_*, (U_s)^{-1}\nonumber \\{} & {} \quad \nabla F(w_k)\rangle \nonumber \\{} & {} \qquad +\,\mathbb {E}[\Vert (U_s)^{-1}\Vert ^2\Vert \nabla F_S(w_k)\Vert ^2]\nonumber \\{} & {} \quad \le \mathbb {E}[\Vert w_k-w_*\Vert ^2]-2\langle w_k-w_*, (U_s)^{-1}\nonumber \\{} & {} \quad \nabla F(w_k)\rangle \nonumber \\{} & {} \qquad +\,2L\Vert (U_s)^{-1}\Vert ^2[F(w_k)\nonumber \\{} & {} \qquad -\,F(w_*)], \end{aligned}$$
(B.1)

where the first inequality holds due to \(\mathbb {E}[\nabla F_S(w_k)]=\nabla F(w_k)\) and Cauchy–Schwartz inequality. The second inequality used the conclusion (23), shown in Lemma 1.

To satisfy the inequality (B.1), we only required

$$\begin{aligned} \mathbb {E}[\Vert w_{k+1}-w_*\Vert ^2]\le & {} \mathbb {E}[\Vert w_k-w_*\Vert ^2]\\{} & {} -\,\frac{2\theta d}{m\mu }[F(w_k)-F(w_*)]+\frac{2Ld\theta ^2}{m^2\mu ^2}\\{} & {} [F(w_k)-F(w_*)]\\= & {} \mathbb {E}[\Vert w_k-w_*\Vert ^2]-\biggl (\frac{2\theta d}{mu}\\ {}{} & {} \quad -\frac{2Ld \theta ^2}{m^2\mu ^2} \biggr )[F(w_k)-F(w_*)], \end{aligned}$$

where in the first inequality, we used the convexity property of the objective function, i.e., \(F(v)\ge F(w)+\nabla F(w)(v-w)\), \(w_*=\mathop {\mathrm {arg\,min}}\nolimits _{w\in \mathbb {R}^d} F(w)\) and \(U^{-1}\le \frac{\theta }{mu}\textrm{I}\).

By summing the above inequality over \(k=0, 1, \ldots , m-1\), we obtain

$$\begin{aligned}{} & {} \mathbb {E}[\Vert \tilde{w}_{s+1}-w_*\Vert ^2] \le \mathbb {E}[\Vert w_{0}-w_*\Vert ^2]\\{} & {} \quad -\,\biggl (\frac{2\theta d}{mu}-\frac{2Ld \theta ^2}{m^2\mu ^2} \biggr )\sum _{k=0}^{m-1}[F(w_k)-F(w_*)], \end{aligned}$$

where this inequality used \(\tilde{w}_{s+1}=w_m\), denoted in Algorithm 2.

Since \(\mathbb {E}[F(\tilde{w}_s)]=\frac{1}{m}\sum _{k=0}^{m-1}\mathbb {E}[F(w_k)]\) and take expectation on both sides, we ascertained

$$\begin{aligned}{} & {} \mathbb {E}[\Vert \tilde{w}_{s+1}-w_*\Vert ^2]+ \biggl (\frac{2\theta d}{u}-\frac{2Ld \theta ^2}{m\mu ^2} \biggr )\\{} & {} \quad \mathbb {E}[F(\tilde{w}_s)-F(w_*)] \le \mathbb {E}[\Vert w_{0}-w_*\Vert ^2]. \end{aligned}$$

Further, we have

$$\begin{aligned} \mathbb {E}[F(\tilde{w}_{s+1})-F(w_*)]\le & {} \frac{m\mu ^2}{2m\mu d\theta -2Ld\theta ^2}\\{} & {} \mathbb {E}[\Vert w_{0}-w_*\Vert ^2]\\= & {} \frac{m\mu ^2}{2m\mu d\theta -2Ld\theta ^2}\\{} & {} \mathbb {E}[\Vert \tilde{w}_{s}-w_*\Vert ^2]\\\le & {} \frac{m\mu }{md\mu \theta -Ld\theta ^2}\mathbb {E}[F(\tilde{w}_s)-F(w_*)], \end{aligned}$$

where the first inequality holds due to the definition \(w_0=\tilde{w}_s\) and the second inequality holds because of Assumption 2 (a.k.a. (22)).

Finally, summing the above inequality, we obtain

$$\begin{aligned} \mathbb {E}[F(\tilde{w}_{s})-F(w_*)]\le \tau ^{s} \mathbb {E}[F(\tilde{w}_0)-F(w_*)], \end{aligned}$$

where we set \(\tau =\frac{m\mu }{md\mu \theta -Ld\theta ^2}\). \(\square \)

1.2 B.2 Proof of Theorem 4

Proof

Combining the fact \(w_{k+1}=w_k-(U_s)^{-1}\nabla F_S(w_k)\) and Lemma 4, we have

$$\begin{aligned} F(w_{k+1})\le & {} F(w_k)+\langle \nabla F(w_k), w_{k+1}-w_{k}\rangle \nonumber \\{} & {} +\,\frac{L}{2}\Vert w_{k+1}-w_{k}\Vert ^2 \nonumber \\= & {} F(w_k)-\langle \nabla F(w_k), (U_s)^{-1}\nabla F_S(w_k)\rangle \nonumber \\{} & {} +\,\frac{L}{2}\Vert (U_s)^{-1}\nabla F_S(w_k)\Vert ^2\nonumber \\\le & {} F(w_k)-\langle \nabla F(w_k), (U_s)^{-1}\nabla F_S(w_k)\rangle \nonumber \\{} & {} +\,\frac{L}{2}\Vert (U_s)^{-1}\Vert ^2\Vert \nabla F_S(w_k)\Vert ^2\nonumber \\= & {} F(w_k)-\langle \nabla F(w_k), (U_s)^{-1}\nabla F_S(w_k)\rangle \nonumber \\{} & {} +\,\frac{L}{2}\Vert (U_s)^{-1}\Vert ^2\Vert \nabla F_S(w_k)\nonumber \\{} & {} -\,\nabla F(w_k)+\nabla F(w_k)\Vert ^2\nonumber \\\le & {} F(w_k)-\langle \nabla F(w_k), (U_s)^{-1}\nabla F_S(w_k)\rangle \nonumber \\{} & {} +\,L\Vert (U_s)^{-1}\Vert ^2[\Vert \nabla F_S(w_k)\nonumber \\{} & {} -\,\nabla F(w_k)\Vert ^2+\Vert \nabla F(w_k)\Vert ^2], \end{aligned}$$
(B.2)

where the second inequality held due to the Cauchy–Schwartz inequality and the last inequality held because of \(\Vert a+b\Vert ^2\le 2(\Vert a\Vert ^2+\Vert b\Vert ^2)\).

In virtue of Assumption 3 and taking expectation on both sides of (B.2), we obtained

$$\begin{aligned} \mathbb {E}[F(w_{k+1})]\le & {} \mathbb {E}[ F(w_k)-\langle \nabla F(w_k), (U_s)^{-1}\nonumber \\{} & {} \nabla F_S(w_k)\rangle +L\Vert (U_s)^{-1}\Vert ^2[\Vert \nabla F_S(w_k)\nonumber \\{} & {} -\,\nabla F(w_k)\Vert ^2+\Vert \nabla F(w_k)\Vert ^2] \nonumber \\\le & {} F(w_k)-\frac{\theta d}{m\zeta }\Vert \nabla F(w_k)\Vert ^2\nonumber \\{} & {} +\,\frac{L\theta ^2d}{m^2\zeta ^2}\cdot \frac{\sigma ^2}{b}\nonumber \\{} & {} +\,\frac{Ld\theta ^2}{m^2\zeta ^2}\Vert \nabla F(w_k)\Vert ^2\nonumber \\= & {} F(w_k)-\biggl (\frac{\theta d}{m\zeta }-\frac{L\theta ^2d}{m^2\zeta ^2}\biggr )\Vert \nabla F(w_k)\Vert ^2\nonumber \\{} & {} +\,\frac{Ld\theta ^2\sigma ^2}{bm^2\zeta ^2} \end{aligned}$$
(B.3)

where the second inequality used the conditions \(\mathbb {E}[\nabla F_S(w_k)]=\nabla F(w_k)\) and .

By recursively adding the inequality (B.3) over \(k=0, 1, \ldots , m-1\), we easily had the following result:

$$\begin{aligned} \mathbb {E}[F(w_{m})]\le & {} \mathbb {E}[F(w_0)]-\biggl (\frac{\theta d}{m\zeta }-\,\frac{L\theta ^2d}{m^2\zeta ^2}\biggr )\nonumber \\{} & {} \sum _{k=0}^{m-1}\Vert \nabla F(w_k)\Vert ^2\,+\frac{Ld\theta ^2\sigma ^2}{bm\zeta ^2}. \end{aligned}$$
(B.4)

Further, according to (B.4), we arrive at

$$\begin{aligned}{} & {} \sum _{k=0}^{m-1}\Vert \nabla F(w_k)\Vert ^2\le \frac{m^2\zeta ^2}{m\zeta \theta d-Ld\theta ^2}\mathbb {E}[F(w_0)-F(w_m)]\\{} & {} \quad +\frac{Lm\theta \sigma ^2}{b(m\zeta -L\theta )}. \end{aligned}$$

Since \(\mathbb {E}[\Vert \nabla F(w_m)\Vert ^2]=\frac{1}{m}\sum _{k=0}^{m-1}\Vert \nabla F(w_k)\Vert \), we further derived

$$\begin{aligned} \mathbb {E}[\Vert \nabla F(w_m)\Vert ^2]\le & {} \frac{m\zeta ^2}{m\zeta \theta d-Ld\theta ^2}\mathbb {E}[F(w_0)-F(w_m)]\\{} & {} +\,\frac{L\theta \sigma ^2}{b(m\zeta -L\theta )}\\\le & {} \frac{m\zeta ^2}{m\zeta \theta d-Ld\theta ^2}\mathbb {E}[F(w_0)-F(w_*)]\\{} & {} +\,\frac{L\theta \sigma ^2}{b(m\zeta -L\theta )} \end{aligned}$$

where the second inequality adopted the condition \(w_*=\mathop {\mathrm {arg\,min}} F(w)\).

In addition, according to \(w_0=\tilde{w}_s\) and \(\tilde{w}_{s+1}=w_m\), we induced

$$\begin{aligned}{} & {} \mathbb {E}[\Vert \nabla F(\tilde{w}_{s+1})\Vert ^2] \le \frac{m\zeta ^2}{m\zeta \theta d-Ld\theta ^2}\mathbb {E}[F(\tilde{w}_s)-F(w_*)]\nonumber \\{} & {} \quad +\,\frac{L\theta \sigma ^2}{b(m\zeta -L\theta )}. \end{aligned}$$
(B.5)

In order to satisfy (B.5), we just required that the following condition be held.

$$\begin{aligned}{} & {} \mathbb {E}[\Vert \nabla F(\tilde{w}_{s+1})\Vert ^2] \le \frac{m\zeta ^2}{2Lm\zeta \theta d-2L^2d\theta ^2}\nonumber \\{} & {} \qquad \mathbb {E}[\Vert \nabla F(\tilde{w}_s)\Vert ^2]+\frac{L\theta \sigma ^2}{b(m\zeta -L\theta )}, \end{aligned}$$
(B.6)

where this inequality used the conclusion in (A.4).

Finally, summing the inequality (B.6) over \(s=0, 1, \ldots , \hat{S}-1\), we obtained the desired results. \(\square \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, Z., Ma, L. Adaptive step size rules for stochastic optimization in large-scale learning. Stat Comput 33, 45 (2023). https://doi.org/10.1007/s11222-023-10218-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11222-023-10218-2

Keywords

Navigation