Skip to main content
Log in

Stochastic variance reduced gradient with hyper-gradient for non-convex large-scale learning

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Non-convex optimization, which can better capture the problem structure, has received considerable attention in the applications of machine learning, image/signal processing, statistics, etc. With faster convergence rate, there have been tremendous studies on developing stochastic variance reduced algorithms to solve these non-convex optimization problems. However, as a crucial hyper-parameter for stochastic variance reduced algorithms, that how to select an appropriate step size is less researched in solving non-convex optimization problems. To address this gap, we propose a new class of stochastic variance reduced algorithms based on hyper-gradient, which has the ability to automatically obtain the online step size. Specifically, we focus on the variance-reduced stochastic optimization algorithms, the stochastic variance reduced gradient (SVRG) algorithm, which computes a full gradient periodically. We analyze theoretically the convergence of the proposed algorithm for non-convex optimization problems. Moreover, we show that the proposed algorithm enjoys the same complexities as state-of-the-art algorithms for solving non-convex problems in terms of finding an approximate stationary point. Thorough numerical results on empirical risk minimization with non-convex loss functions validate the efficacy of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability

The datasets generated during and analysed during the current study are available from the corresponding author on reasonable request.

Notes

  1. For a function \(y=f(u)\), where \(u=\phi (x)\), the partial derivative of the function y at point x can be abbreviated as: \(\frac{\partial y}{\partial x}=\frac{\partial y}{\partial \phi }\cdot \frac{\partial \phi }{\partial x}\).

  2. All data sets are available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

References

  1. Al-Betar MA, Awadallah MA, Krishan MM (2020) A non-convex economic load dispatch problem with valve loading effect using a hybrid grey wolf optimizer. Neural Comput Appl 32(16):12127–12154

    Article  Google Scholar 

  2. Allen-Zhu Z (2017) Katyusha: The first direct acceleration of stochastic gradient methods. J Mach Learn Res 18(1):8194–8244

    MathSciNet  MATH  Google Scholar 

  3. Antoniadis A, Gijbels I, Nikolova M (2011) Penalized likelihood regression for generalized linear models with non-quadratic penalties. Ann Inst Stat Math 63(3):585–615

    Article  MathSciNet  MATH  Google Scholar 

  4. Auer P, Cesa-Bianchi N, Gentile C (2002) Adaptive and self-confident on-line learning algorithms. J Comput Syst Sci 64(1):48–75

    Article  MathSciNet  MATH  Google Scholar 

  5. Barzilai J, Borwein JM (1988) Two-point step size gradient methods. IMA J Numer Anal 8(1):141–148

    Article  MathSciNet  MATH  Google Scholar 

  6. Baydin AG, Cornish R, Rubio DM, Schmidt MW, Wood FD (2018) Online learning rate adaptation with hypergradient descent. In: International conference on learning representations

  7. Csiba D, Qu Z, Richtárik P (2015) Stochastic dual coordinate ascent with adaptive probabilities. In: International conference on machine learning. p 674–683

  8. De S, Yadav A, Jacobs D, Goldstein T (2017) Automated inference with adaptive batches. In: International conference on artificial intelligence and statistics

  9. Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12(Jul):2121–2159

  10. Ertekin S, Bottou L, Giles CL (2010) Nonconvex online support vector machines. IEEE Trans Pattern Anal Mach Intell 33(2):368–381

    Article  Google Scholar 

  11. Fang C, Li CJ, Lin Z, Zhang T (2018) SPIDER: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In: Advances in neural information processing systems. p 689–699

  12. Ge R, Li Z, Wang W, Wang X (2019) Stabilized SVRG: Simple variance reduction for nonconvex optimization. In: Conference on learning theory, PMLR. p 1394–1448

  13. Itakura K, Atarashi K, Oyama S, Kurihara M (2020) Adapting the learning rate of the learning rate in hypergradient descent. In: International Conference on Soft Computing and Intelligent Systems and 21st International Symposium on Advanced Intelligent Systems (SCIS-ISIS). IEEE, p 1–6

  14. Jacobs RA (1988) Increased rates of convergence through learning rate adaptation. Neural Netw 1(4):295–307

    Article  Google Scholar 

  15. Jie R, Gao J, Vasnev A, Tran MN (2022) Adaptive hierarchical hyper-gradient descent. Int J Mach Learn Cybern 13(12):3785–3805

    Article  Google Scholar 

  16. Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in neural information processing systems. p 315–323

  17. Kesten H (1958) Accelerated stochastic approximation. Ann Math Stat 29(1):41–59

    Article  MathSciNet  MATH  Google Scholar 

  18. Konečnỳ J, Liu J, Richtárik P, Takáč M (2016) Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing 10(2):242–255

    Article  Google Scholar 

  19. Kresoja M, Lužanin Z, Stojkovska I (2017) Adaptive stochastic approximation algorithm. Numerical Algorithms 76(4):917–937

    Article  MathSciNet  MATH  Google Scholar 

  20. Lei L, Ju C, Chen J, Jordan MI (2017) Non-convex finite-sum optimization via SCSG methods. In: Advances in neural information processing systems. p 2348–2358

  21. Li X, Orabona F (2019) On the convergence of stochastic gradient descent with adaptive stepsizes. In: International conference on artificial intelligence and statistics. pp 983–992

  22. Liu L, Liu J, Tao D (2021) Variance reduced methods for non-convex composition optimization. IEEE Trans Pattern Anal Mach Intell

  23. Ma K, Zeng J, Xiong J, Xu Q, Cao X, Liu W, Yao Y (2018) Stochastic Non-convex Ordinal Embedding with Stabilized Barzilai-Borwein step size. In: AAAI conference on artificial intelligence

  24. Mahsereci M, Hennig P (2017) Probabilistic line searches for stochastic optimization. J Mach Learn Res 18(1):4262–4320

    MathSciNet  MATH  Google Scholar 

  25. Nesterov Y (2004) Introductory lectures on convex optimization : basic course. Kluwer Academic

    Book  MATH  Google Scholar 

  26. Nguyen LM, Liu J, Scheinberg K, Takáč M (2017) SARAH: A novel method for machine learning problems using stochastic recursive gradient. Int Conf Mach Learn 70:2613–2621

    Google Scholar 

  27. Nguyen LM, Liu J, Scheinberg K, Takáč M (2017) Stochastic recursive gradient algorithm for nonconvex optimization. arXiv:1705.07261

  28. Nitanda A (2014) Stochastic proximal gradient descent with acceleration techniques. In: Advances in neural information processing systems. p 1574–1582

  29. Pham NH, Nguyen LM, Phan DT, Tran-Dinh Q (2020) ProxSARAH: An efficient algorithmic framework for stochastic composite nonconvex optimization. J Mach Learn Res 21:110–1

    MathSciNet  MATH  Google Scholar 

  30. Reddi SJ, Hefny A, Sra S, Poczos B, Smola A (2016) Stochastic variance reduction for nonconvex optimization. In: International conference on machine learning. p 314–323

  31. Reddi SJ, Sra S, Poczos B, Smola AJ (2016) Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In: Advances in neural information processing systems. p 1145–1153

  32. Roux NL, Schmidt M, Bach FR (2012) A stochastic gradient method with an exponential convergence _rate for finite training sets. In: Advances in neural information processing systems. p 2663–2671

  33. Saeedi T, Rezghi M (2020) A novel enriched version of truncated nuclear norm regularization for matrix completion of inexact observed data. IEEE Trans Knowl Data Eng

  34. Schmidt M, Babanezhad R, Ahmed MO, Defazio A, Clifton A, Sarkar A (2015) Non-uniform stochastic average gradient method for training conditional random fields. In: International conference on artificial intelligence and statistics

  35. Sopyła K, Drozda P (2015) Stochastic gradient descent with Barzilai-Borwein update step for SVM. Inf Sci 316:218–233

    Article  MATH  Google Scholar 

  36. Suzuki K, Yukawa M (2020) Robust recovery of jointly-sparse signals using minimax concave loss function. IEEE Trans Signal Process 69:669–681

    Article  MathSciNet  MATH  Google Scholar 

  37. Tan C, Ma S, Dai YH, Qian Y (2016) Barzilai-Borwein step size for stochastic gradient descent. In: Advances in neural information processing systems. p 685–693

  38. Wang J, Wang M, Hu X, Yan S (2015) Visual data denoising with a unified schatten-p norm and lq norm regularized principal component pursuit. Pattern Recogn 48(10):3135–3144

    Article  Google Scholar 

  39. Wang S, Chen Y, Cen Y, Zhang L, Wang H, Voronin V (2022) Nonconvex low-rank and sparse tensor representation for multi-view subspace clustering. Appl Intell 1–14

  40. Yang J, Kiyavash N, He N (2020) Global convergence and variance reduction for a class of nonconvex-nonconcave minimax problems. Adv Neural Inf Process Syst 33

  41. Yang Z (2021) On the step size selection in variance-reduced algorithm for nonconvex optimization. Expert Syst Appl 169:114336

  42. Yang Z, Wang C, Zang Y, Li J (2018) Mini-batch algorithms with Barzilai-Borwein update step. Neurocomputing 314:177–185

    Article  Google Scholar 

  43. Yang Z, Wang C, Zhang Z, Li J (2018) Random Barzilai-Borwein step size for mini-batch algorithms. Eng Appl Artif Intell 72:124–135

    Article  Google Scholar 

  44. Yang Z, Wang C, Zhang Z, Li J (2019) Accelerated stochastic gradient descent with step size selection rules. Signal Process 159:171–186

    Article  Google Scholar 

  45. Yang Z, Wang C, Zhang Z, Li J (2019) Mini-batch algorithms with online step size. Knowl-Based Syst 165:228–240

    Article  Google Scholar 

  46. Ying J, de Miranda Cardoso JV, Palomar D (2020) Nonconvex sparse graph learning under laplacian constrained graphical model. Adv Neural Inf Process Syst 33

  47. Zhang T (2010) Analysis of multi-stage convex relaxation for sparse regularization. J Mach Learn Res 11(Mar):1081–1107

    MathSciNet  MATH  Google Scholar 

  48. Zhou D, Xu P, Gu Q (2018) Stochastic nested variance reduced gradient descent for nonconvex optimization. In: Advances in neural information processing systems. p 3921–3932

Download references

Acknowledgements

This work was supported by grants from the China Postdoctoral Science Foundation under Grant 2019M663238. Also, this work was partially supported by Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.

Author information

Authors and Affiliations

Authors

Contributions

Zhuang Yang: Data curation, Formal analysis, Investigation, Methodology, Validation, Visualization, Writing - original draft, Funding acquisition.

Corresponding author

Correspondence to Zhuang Yang.

Ethics declarations

Conflicts of interest

Author Zhuang Yang declares that he has no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Additional informed consent is obtained from all individual participants for whom identifying information is included in this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Proofs for MSVRG-HD

Appendix A: Proofs for MSVRG-HD

1.1 A.1 Proof of Lemma 1

Proof

According to Algorithm 1 and F has a \(\sigma \)-bounded gradient, we have

$$\begin{aligned}&\mathbb {E}[\eta _k] \le |\eta _0|+\mathbb {E}\biggl [\beta \sum _{i=1}^{m-1} |\nabla F_{\hat{S}}(w_{i+1}^{s+1})\cdot \nabla F_{\hat{S}}(w_i^{s+1})|\biggr ] \\&\le |\eta _0|+\beta \sum _{i=1}^{m-1}\Vert \nabla F(w_{i+1}^{s+1})\Vert \Vert \nabla F(w_i^{s+1})\Vert \le \eta _0+m \beta \sigma ^2. \end{aligned}$$

Notice that the above inequality holds because we choose the sample i independently by adopting a uniformly randomly (with replacement) sample from [n]. In other word, the resulting algorithm (MSVRG-HD) utilizes an unbiased estimator of gradient per iterative step. Additionally, the evaluation of the step size \(\eta _k\) in Algorithm 1 can utilize different batch samples uniformly randomly selecting from [n] for \(\nabla F_{\hat{S}}(w_{i+1}^{s+1})\) and \(\nabla F_{\hat{S}}(w_{i}^{s+1})\). Due to the tactic of uniformly randomly (with replacement), the inequality in the above can be held. \(\square \)

1.2 A.2 Proof of Lemma 2

Proof

From (3) and \(w_{k+1}^{s+1}=w_{k}^{s+1}-\eta _{k+1} \phi _{k}^{s+1}\) in Algorithm 1, we have

$$\begin{aligned} \mathbb {E}[ F(w_{k+1}^{s+1})]\le & {} \mathbb {E}\biggl [ F(w_{k}^{s+1}) - \langle \nabla F(w_{k}^{s+1}), w_{k+1}^{s+1}-w_{k}^{s+1}\rangle \nonumber \\{} & {} + \frac{L}{2} \Vert w_{k+1}^{s+1}-w_{k}^{s+1}\Vert ^2 \biggr ] \nonumber \\= & {} \mathbb {E}\biggl [F(w_k^{s+1})-\eta _{k+1}\langle \nabla F(w_k)^{s+1}, \phi _{k}^{s+1}\rangle \rangle \nonumber \\ {}{} & {} +\frac{L}{2}\Vert \eta _{k+1}\phi _{k}^{s+1}\Vert ^2\biggr ] \nonumber \\= & {} \mathbb {E}\biggl [F(w_k^{s+1})+\frac{L\eta _{k+1}^2}{2}\Vert \phi _{k}^{s+1}\Vert ^2\biggr ]\nonumber \\{} & {} -\mathbb {E}\left[ \eta _{k+1}\right] \Vert \nabla F(w_k^{s+1})\Vert ^2, \end{aligned}$$
(14)

where the last equality holds due to \(\mathbb {E}[\phi _{k}^{s+1}]=\nabla F(w_k^{s+1})\).

We emphasize here that the computation of the step size \(\eta _k\) in (5) can use different samples uniformly randomly drawing from [n] for \(\nabla f_i(w_{k-1})\) and \(\nabla f_i(w_{k-1})\). As a consequence, because of uniformly randomly (with replacement) sample from [n], (14) can be satisfied, which is similar to the proof in Lemma 1.

Now we consider the Lyapunov function, i.e.,

$$\begin{aligned} R_k^{s+1}= \mathbb {E}[F(w_k^{s+1})+c_k\Vert w_k^{s+1}-\widetilde{w}^s\Vert ^2]. \end{aligned}$$

In order to bound it, we provide the following

$$\begin{aligned}{} & {} \mathbb {E}[\Vert w_{k+1}^{s+1}-\widetilde{w}^s\Vert ^2] =\mathbb {E}[\Vert w_{k+1}^{s+1}-w_k^{s+1}+w_k^{s+1}-\widetilde{w}^s\Vert ^2] \nonumber \\= & {} \mathbb {E}[\Vert w_{k+1}^{s+1}-w_k^{s+1}\Vert ^2+\Vert w_k^{s+1}-\widetilde{w}^s\Vert ^2+2\langle w_{k+1}^{s+1}\nonumber \\{} & {} -w_k^{s+1}, w_k^{s+1}-\widetilde{w}^s\rangle ]\nonumber \\= & {} \mathbb {E}[\Vert w_{k+1}^{s+1}-w_k^{s+1}\Vert ^2+\Vert w_k^{s+1}-\widetilde{w}^s\Vert ^2\nonumber \\{} & {} +2\langle -\eta _{k+1}\nabla \phi _k^{s+1}, w_k^{s+1}-\widetilde{w}^s\rangle ]\nonumber \\= & {} \mathbb {E}[\eta _{k+1}^2\Vert \phi _k^{s+1}\Vert ^2+\Vert w_k^{s+1}-\widetilde{w}^s\Vert ^2]\nonumber \\{} & {} -2\mathbb {E}[\eta _{k+1}]\langle \nabla F(w_k^{s+1}), w_{k}^{s+1}-\widetilde{w}^s\rangle \nonumber \\\le & {} \mathbb {E}[\eta _{k+1}^2\Vert \phi _k^{s+1}\Vert ^2+\Vert w_k^{s+1}-\widetilde{w}^s\Vert ^2]+2\\{} & {} \cdot \mathbb {E}[\eta _{k+1}]\biggl (\frac{1}{2\alpha _k}\Vert \nabla F(w_k^{s+1})\Vert ^2\!+\frac{\alpha _k}{2}\mathbb {E}\left[ \Vert w_k^{s+1}\!-\widetilde{w}^s\Vert ^2\right] \biggr ), \nonumber \end{aligned}$$
(15)

where in the third equality we used \(w_{k+1}^{s+1}=w_{k}^{s+1}-\eta _{k+1} \phi _{k}^{s+1}\), in the fourth equality we used \(\mathbb {E}[\phi _{k}^{s+1}]=\nabla F(w_k^{s+1})\) and in the last inequality we used Cauchy-Schwarz and Young’s inequality. Here, we also used the hypothesis that the samples were chosen from \(\{1, \ldots ,n\}\) with replacement independently.

Taking (14) and (15) into \(R_{k+1}^{s+1}\), we have the following boundary:

$$\begin{aligned} R_{k+1}^{s+1}\le & {} \mathbb {E}\biggl [F(w_k^{s+1})-\eta _{k+1}\Vert \nabla F(w_k^{s+1})\Vert ^2+\frac{L\eta _{k+1}^2}{2}\Vert \phi _{k}^{s+1}\Vert ^2\biggr ] \\{} & {} +\mathbb {E}[c_{k+1}\eta _{k+1}^2\Vert \phi _k^{s+1}\Vert ^2+c_{k+1}\Vert w_k^{s+1}-\widetilde{w}^s\Vert ^2]+2c_{k+1}\\{} & {} \mathbb {E}[\eta _{k+1}]\biggl (\frac{1}{2\alpha _k}\Vert \nabla F(w_k^{s+1})\Vert ^2\!+\frac{\alpha _k}{2}\mathbb {E}\left[ \Vert w_k^{s+1}\!-\widetilde{w}^s\Vert ^2\right] \biggr )\\\le & {} \mathbb {E}\left[ F(w_k^{s+1})\right] -\mathbb {E}\biggl [\eta _{k+1}-\frac{c_{k+1}\eta _{k+1}}{\alpha _k}\biggr ]\Vert \nabla F(w_k^{s+1})\Vert ^2\\{} & {} +\mathbb {E}\biggl [\biggl (\frac{L\eta _{k+1}^2}{2}+c_{k+1}\eta _{k+1}^2\biggr )\Vert \phi _{k}^{s+1}\Vert ^2\biggr ]\\{} & {} +\mathbb {E}[c_{k+1}+c_{k+1}\eta _{k+1}\alpha _k]\mathbb {E}[\Vert w_k^{s+1}-\widetilde{w}^s\Vert ^2]. \end{aligned}$$

According to the upper boundary of \(\phi _{k}^{s+1}\), i.e., Lemma 3, we ascertain that

$$\begin{aligned} R_{k+1}^{s+1}\le & {} \mathbb {E}[F(w_k^{s+1})]-\mathbb {E}\biggl [\eta _{k+1}-\frac{c_{k+1}\eta _{k+1}}{\alpha _k}\biggr ]\Vert \nabla F(w_k^{s+1})\Vert ^2\\{} & {} +\mathbb {E}\biggl [\frac{L\eta _{k+1}^2}{2}+c_{k+1}\eta _{k+1}^2\biggr ]\biggl ( 2\mathbb {E}[\Vert \nabla F(w_k^{s+1})\Vert ^2]+\frac{2L^2}{b}\\{} & {} \mathbb {E}[\Vert w_k^{s+1}-\widetilde{w}^s\Vert ^2]\biggr )+\mathbb {E}[c_{k+1}+c_{k+1}\eta _{k+1}\alpha _k]\mathbb {E}[\Vert w_k^{s+1}\nonumber \\{} & {} -\widetilde{w}^s\Vert ^2]\\= & {} \mathbb {E}[F(w_k^{s+1})]-\mathbb {E}\biggl [\eta _{k+1}-\frac{c_{k+1}\eta _{k+1}}{\alpha _k}-L\eta _{k+1}^2\\{} & {} -2c_{k+1}\eta _{k+1}^2\biggr ]\Vert \nabla F(w_k^{s+1})\Vert ^2+\mathbb {E}\biggl [c_{k+1}+c_{k+1}\eta _{k+1}\alpha _k\\{} & {} +\frac{L^3\eta _{k+1}^2}{b}+\frac{2L^2c_{k+1}\eta _{k+1}^2}{b}\biggr ]\mathbb {E}[\Vert w_k^{s+1}-\widetilde{w}^s\Vert ^2]. \end{aligned}$$

Further, utilizing Lemma 1, we have

$$\begin{aligned} R_{k+1}^{s+1}\le & {} \mathbb {E}[F(w_k^{s+1})]-\biggl (Q_m-\frac{c_{k+1}Q_m}{\alpha _k}-LQ_m^2\\{} & {} -2c_{k+1}Q_m^2\biggr )\cdot \Vert \nabla F(w_k^{s+1})\Vert ^2\\{} & {} +\biggl (c_{k+1}+c_{k+1}Q_m\alpha _k+\frac{L^3Q_m^2}{b}+\frac{2L^2c_{k+1}Q_m^2}{b}\biggr )\\{} & {} \cdot \mathbb {E}[\Vert w_k^{s+1}-\widetilde{w}^s\Vert ^2]\\= & {} \mathbb {E}[F(w_k^{s+1})]+c_k\mathbb {E}[\Vert w_k^{s+1}-\widetilde{w}^s\Vert ^2]\\{} & {} -\biggl (Q_m-\frac{c_{k+1}Q_m}{\alpha _k}-LQ_m^2-2c_{k+1}Q_m^2\biggr )\\{} & {} \times \Vert \nabla F(w_k^{s+1})\Vert ^2, \end{aligned}$$

where the first equality holds due to Lemma 2.

Thereby, we complete the proof of Lemma 2.

Notice that to make \({\varGamma }_{k, m}>0\), we only require \(2c_{k+1}Q_m+LQ_m+\frac{c_{k+1}}{\alpha _k}<1\). When choosing \(c_k\), \(Q_m\) and \(\alpha _k\) from (0, 1), such condition is easy to be satisfied. From the definition of \(Q_m\), we can choose enough small parameters, \(\eta _0\) and \(\beta \), making \(Q_m\) small enough. Also, from the definition of \(c_k\), we can make \(c_k\) small enough. Specifically, as an example, when setting \(c_{k+1}\ll \alpha _k\) at the same time, we have the conclusion, \(2c_{k+1}Q_m+LQ_m+\frac{c_{k+1}}{\alpha _k}\ll 1\). Therefore, we ascertain that the condition \({\varGamma }_{k, m}>0\) is satisfied when choosing the appropriate parameters, \(c_k\), \(Q_m\) and \(\alpha _k\). \(\square \)

1.3 A.3 Proof of Theorem 1

Proof

Combining Lemma 2 and \(\gamma _m=\min _k {\varGamma }_{k, m}\), by summing over \(k=0, \ldots , m-1\), we have

$$\begin{aligned} \sum _{k=0}^{m-1} \mathbb {E}\left[ \Vert \nabla F(w_k^{s+1})\Vert ^2\right] \le \frac{R_0^{s+1}-R_m^{s+1}}{\gamma _m} \end{aligned}$$

Above mentioned inequality indicates that

$$\begin{aligned} \sum _{k=0}^{m-1} \mathbb {E}\left[ \Vert \nabla F(w_k^{s+1})\Vert ^2\right] \le \frac{\mathbb {E}[F(\widetilde{w}^s)-F(\widetilde{w}^{s+1})]}{\gamma _m}, \end{aligned}$$
(16)

where we took the fact that \(R_m^{s+1}=\mathbb {E}[F(w_m^{s+1})]=\mathbb {E}[F(\widetilde{w}^{s+1})]\) (since \(c_m=0\)) and that \(R_0^{s+1}=\mathbb {E}[f(\widetilde{w}^s)]\) (since \(w_0^{s+1}=\widetilde{w}^s\)).

By summing over all epochs, we have

$$\begin{aligned} \frac{1}{T} \sum _{s=0}^{S-1}\sum _{k=0}^{m-1} \mathbb {E}\left[ \Vert \nabla F(w_k^{s+1})\Vert ^2\right] \le \frac{F(w^0)-F(w^s)}{T\gamma _m}. \end{aligned}$$
(17)

The above inequality used the fact that \(\widetilde{w}^0=w^0\). Thus, we complete the proof of Theorem 1. Note that, although the output of our algorithm is the last iteration, \(w_{m}^{s+1}\), of the inner loop, rather than sampling randomly from the set \(\{w_k^s\}\) for \(s=0, \ldots ,S-1\) and \(k=0, \ldots ,m-1\), the quantitative relationship in (17) still can be used in our proof. This is indeed the case. Many studies had pointed out that these two ways in determining the last iterate achieved almost similar numerical performance in many problems.

Further, considering Assumption 1, i.e., \(\Vert \nabla f_i(w)-\nabla f_i(v)\Vert \le \Vert w-v\Vert \), we have

$$\begin{aligned} \Vert \nabla F(\tilde{w}^S)-\nabla F(w^*)\Vert ^2\le L^2\Vert \tilde{w}^S-w^*\Vert ^2. \end{aligned}$$

To satisfy the result in Theorem 1, \(\mathbb {E}[\Vert \nabla F(\widetilde{w}^{S})\Vert ^2]\le \frac{F(w^{0})-F(w^*)}{T\gamma _m}\), it is enough to hold the following condition

$$\begin{aligned} \mathbb {E}[L^2\Vert \tilde{w}^S-w^*\Vert ^2]\le \frac{F(w^{0})-F(w^*)}{T\gamma _m}. \end{aligned}$$

Therefore, the conclusion in Theorem 1 can be rewritten as

$$\begin{aligned} \mathbb {E}[\Vert \tilde{w}^S-w^*\Vert ^2]\le \frac{F(w^{0})-F(w^*)}{L^2T\gamma _m}. \end{aligned}$$

\(\square \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, Z. Stochastic variance reduced gradient with hyper-gradient for non-convex large-scale learning. Appl Intell 53, 28627–28641 (2023). https://doi.org/10.1007/s10489-023-05025-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-05025-1

Keywords

Navigation