Stochastic variance reduced gradient with hyper-gradient for non-convex large-scale learning

Yang, Zhuang

doi:10.1007/s10489-023-05025-1

Stochastic variance reduced gradient with hyper-gradient for non-convex large-scale learning

Published: 10 October 2023

Volume 53, pages 28627–28641, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Zhuang Yang ORCID: orcid.org/0000-0002-8374-2928¹

188 Accesses
Explore all metrics

Abstract

Non-convex optimization, which can better capture the problem structure, has received considerable attention in the applications of machine learning, image/signal processing, statistics, etc. With faster convergence rate, there have been tremendous studies on developing stochastic variance reduced algorithms to solve these non-convex optimization problems. However, as a crucial hyper-parameter for stochastic variance reduced algorithms, that how to select an appropriate step size is less researched in solving non-convex optimization problems. To address this gap, we propose a new class of stochastic variance reduced algorithms based on hyper-gradient, which has the ability to automatically obtain the online step size. Specifically, we focus on the variance-reduced stochastic optimization algorithms, the stochastic variance reduced gradient (SVRG) algorithm, which computes a full gradient periodically. We analyze theoretically the convergence of the proposed algorithm for non-convex optimization problems. Moreover, we show that the proposed algorithm enjoys the same complexities as state-of-the-art algorithms for solving non-convex problems in terms of finding an approximate stationary point. Thorough numerical results on empirical risk minimization with non-convex loss functions validate the efficacy of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm 1

An accelerated variance reducing stochastic method with Douglas-Rachford splitting

Article 01 May 2019

Stochastic three-term conjugate gradient method with variance technique for non-convex learning

Article 27 March 2024

Stochastic Variance Reduced Gradient Methods Using a Trust-Region-Like Scheme

Article 17 February 2021

Data Availability

The datasets generated during and analysed during the current study are available from the corresponding author on reasonable request.

Notes

For a function $y=f(u)$, where $u=\phi (x)$, the partial derivative of the function y at point x can be abbreviated as: $\frac{\partial y}{\partial x}=\frac{\partial y}{\partial \phi }\cdot \frac{\partial \phi }{\partial x}$.
All data sets are available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

References

Al-Betar MA, Awadallah MA, Krishan MM (2020) A non-convex economic load dispatch problem with valve loading effect using a hybrid grey wolf optimizer. Neural Comput Appl 32(16):12127–12154
Article Google Scholar
Allen-Zhu Z (2017) Katyusha: The first direct acceleration of stochastic gradient methods. J Mach Learn Res 18(1):8194–8244
MathSciNet MATH Google Scholar
Antoniadis A, Gijbels I, Nikolova M (2011) Penalized likelihood regression for generalized linear models with non-quadratic penalties. Ann Inst Stat Math 63(3):585–615
Article MathSciNet MATH Google Scholar
Auer P, Cesa-Bianchi N, Gentile C (2002) Adaptive and self-confident on-line learning algorithms. J Comput Syst Sci 64(1):48–75
Article MathSciNet MATH Google Scholar
Barzilai J, Borwein JM (1988) Two-point step size gradient methods. IMA J Numer Anal 8(1):141–148
Article MathSciNet MATH Google Scholar
Baydin AG, Cornish R, Rubio DM, Schmidt MW, Wood FD (2018) Online learning rate adaptation with hypergradient descent. In: International conference on learning representations
Csiba D, Qu Z, Richtárik P (2015) Stochastic dual coordinate ascent with adaptive probabilities. In: International conference on machine learning. p 674–683
De S, Yadav A, Jacobs D, Goldstein T (2017) Automated inference with adaptive batches. In: International conference on artificial intelligence and statistics
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12(Jul):2121–2159
Ertekin S, Bottou L, Giles CL (2010) Nonconvex online support vector machines. IEEE Trans Pattern Anal Mach Intell 33(2):368–381
Article Google Scholar
Fang C, Li CJ, Lin Z, Zhang T (2018) SPIDER: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In: Advances in neural information processing systems. p 689–699
Ge R, Li Z, Wang W, Wang X (2019) Stabilized SVRG: Simple variance reduction for nonconvex optimization. In: Conference on learning theory, PMLR. p 1394–1448
Itakura K, Atarashi K, Oyama S, Kurihara M (2020) Adapting the learning rate of the learning rate in hypergradient descent. In: International Conference on Soft Computing and Intelligent Systems and 21st International Symposium on Advanced Intelligent Systems (SCIS-ISIS). IEEE, p 1–6
Jacobs RA (1988) Increased rates of convergence through learning rate adaptation. Neural Netw 1(4):295–307
Article Google Scholar
Jie R, Gao J, Vasnev A, Tran MN (2022) Adaptive hierarchical hyper-gradient descent. Int J Mach Learn Cybern 13(12):3785–3805
Article Google Scholar
Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in neural information processing systems. p 315–323
Kesten H (1958) Accelerated stochastic approximation. Ann Math Stat 29(1):41–59
Article MathSciNet MATH Google Scholar
Konečnỳ J, Liu J, Richtárik P, Takáč M (2016) Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing 10(2):242–255
Article Google Scholar
Kresoja M, Lužanin Z, Stojkovska I (2017) Adaptive stochastic approximation algorithm. Numerical Algorithms 76(4):917–937
Article MathSciNet MATH Google Scholar
Lei L, Ju C, Chen J, Jordan MI (2017) Non-convex finite-sum optimization via SCSG methods. In: Advances in neural information processing systems. p 2348–2358
Li X, Orabona F (2019) On the convergence of stochastic gradient descent with adaptive stepsizes. In: International conference on artificial intelligence and statistics. pp 983–992
Liu L, Liu J, Tao D (2021) Variance reduced methods for non-convex composition optimization. IEEE Trans Pattern Anal Mach Intell
Ma K, Zeng J, Xiong J, Xu Q, Cao X, Liu W, Yao Y (2018) Stochastic Non-convex Ordinal Embedding with Stabilized Barzilai-Borwein step size. In: AAAI conference on artificial intelligence
Mahsereci M, Hennig P (2017) Probabilistic line searches for stochastic optimization. J Mach Learn Res 18(1):4262–4320
MathSciNet MATH Google Scholar
Nesterov Y (2004) Introductory lectures on convex optimization : basic course. Kluwer Academic
Book MATH Google Scholar
Nguyen LM, Liu J, Scheinberg K, Takáč M (2017) SARAH: A novel method for machine learning problems using stochastic recursive gradient. Int Conf Mach Learn 70:2613–2621
Google Scholar
Nguyen LM, Liu J, Scheinberg K, Takáč M (2017) Stochastic recursive gradient algorithm for nonconvex optimization. arXiv:1705.07261
Nitanda A (2014) Stochastic proximal gradient descent with acceleration techniques. In: Advances in neural information processing systems. p 1574–1582
Pham NH, Nguyen LM, Phan DT, Tran-Dinh Q (2020) ProxSARAH: An efficient algorithmic framework for stochastic composite nonconvex optimization. J Mach Learn Res 21:110–1
MathSciNet MATH Google Scholar
Reddi SJ, Hefny A, Sra S, Poczos B, Smola A (2016) Stochastic variance reduction for nonconvex optimization. In: International conference on machine learning. p 314–323
Reddi SJ, Sra S, Poczos B, Smola AJ (2016) Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In: Advances in neural information processing systems. p 1145–1153
Roux NL, Schmidt M, Bach FR (2012) A stochastic gradient method with an exponential convergence _rate for finite training sets. In: Advances in neural information processing systems. p 2663–2671
Saeedi T, Rezghi M (2020) A novel enriched version of truncated nuclear norm regularization for matrix completion of inexact observed data. IEEE Trans Knowl Data Eng
Schmidt M, Babanezhad R, Ahmed MO, Defazio A, Clifton A, Sarkar A (2015) Non-uniform stochastic average gradient method for training conditional random fields. In: International conference on artificial intelligence and statistics
Sopyła K, Drozda P (2015) Stochastic gradient descent with Barzilai-Borwein update step for SVM. Inf Sci 316:218–233
Article MATH Google Scholar
Suzuki K, Yukawa M (2020) Robust recovery of jointly-sparse signals using minimax concave loss function. IEEE Trans Signal Process 69:669–681
Article MathSciNet MATH Google Scholar
Tan C, Ma S, Dai YH, Qian Y (2016) Barzilai-Borwein step size for stochastic gradient descent. In: Advances in neural information processing systems. p 685–693
Wang J, Wang M, Hu X, Yan S (2015) Visual data denoising with a unified schatten-p norm and lq norm regularized principal component pursuit. Pattern Recogn 48(10):3135–3144
Article Google Scholar
Wang S, Chen Y, Cen Y, Zhang L, Wang H, Voronin V (2022) Nonconvex low-rank and sparse tensor representation for multi-view subspace clustering. Appl Intell 1–14
Yang J, Kiyavash N, He N (2020) Global convergence and variance reduction for a class of nonconvex-nonconcave minimax problems. Adv Neural Inf Process Syst 33
Yang Z (2021) On the step size selection in variance-reduced algorithm for nonconvex optimization. Expert Syst Appl 169:114336
Yang Z, Wang C, Zang Y, Li J (2018) Mini-batch algorithms with Barzilai-Borwein update step. Neurocomputing 314:177–185
Article Google Scholar
Yang Z, Wang C, Zhang Z, Li J (2018) Random Barzilai-Borwein step size for mini-batch algorithms. Eng Appl Artif Intell 72:124–135
Article Google Scholar
Yang Z, Wang C, Zhang Z, Li J (2019) Accelerated stochastic gradient descent with step size selection rules. Signal Process 159:171–186
Article Google Scholar
Yang Z, Wang C, Zhang Z, Li J (2019) Mini-batch algorithms with online step size. Knowl-Based Syst 165:228–240
Article Google Scholar
Ying J, de Miranda Cardoso JV, Palomar D (2020) Nonconvex sparse graph learning under laplacian constrained graphical model. Adv Neural Inf Process Syst 33
Zhang T (2010) Analysis of multi-stage convex relaxation for sparse regularization. J Mach Learn Res 11(Mar):1081–1107
MathSciNet MATH Google Scholar
Zhou D, Xu P, Gu Q (2018) Stochastic nested variance reduced gradient descent for nonconvex optimization. In: Advances in neural information processing systems. p 3921–3932

Download references

Acknowledgements

This work was supported by grants from the China Postdoctoral Science Foundation under Grant 2019M663238. Also, this work was partially supported by Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.

Author information

Authors and Affiliations

School of Computer Science and Technology, Soochow University, 215006, Suzhou, China
Zhuang Yang

Authors

Zhuang Yang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Zhuang Yang: Data curation, Formal analysis, Investigation, Methodology, Validation, Visualization, Writing - original draft, Funding acquisition.

Corresponding author

Correspondence to Zhuang Yang.

Ethics declarations

Conflicts of interest

Author Zhuang Yang declares that he has no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Additional informed consent is obtained from all individual participants for whom identifying information is included in this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Proofs for MSVRG-HD

1.1 A.1 Proof of Lemma 1

Proof

According to Algorithm 1 and F has a $\sigma $-bounded gradient, we have

$$\begin{aligned}&\mathbb {E}[\eta _k] \le |\eta _0|+\mathbb {E}\biggl [\beta \sum _{i=1}^{m-1} |\nabla F_{\hat{S}}(w_{i+1}^{s+1})\cdot \nabla F_{\hat{S}}(w_i^{s+1})|\biggr ] \\&\le |\eta _0|+\beta \sum _{i=1}^{m-1}\Vert \nabla F(w_{i+1}^{s+1})\Vert \Vert \nabla F(w_i^{s+1})\Vert \le \eta _0+m \beta \sigma ^2. \end{aligned}$$

Notice that the above inequality holds because we choose the sample i independently by adopting a uniformly randomly (with replacement) sample from [n]. In other word, the resulting algorithm (MSVRG-HD) utilizes an unbiased estimator of gradient per iterative step. Additionally, the evaluation of the step size $\eta _k$ in Algorithm 1 can utilize different batch samples uniformly randomly selecting from [n] for $\nabla F_{\hat{S}}(w_{i+1}^{s+1})$ and $\nabla F_{\hat{S}}(w_{i}^{s+1})$. Due to the tactic of uniformly randomly (with replacement), the inequality in the above can be held. $\square $

1.2 A.2 Proof of Lemma 2

Proof

From (3) and $w_{k+1}^{s+1}=w_{k}^{s+1}-\eta _{k+1} \phi _{k}^{s+1}$ in Algorithm 1, we have

$$\begin{aligned} \mathbb {E}[ F(w_{k+1}^{s+1})]\le & {} \mathbb {E}\biggl [ F(w_{k}^{s+1}) - \langle \nabla F(w_{k}^{s+1}), w_{k+1}^{s+1}-w_{k}^{s+1}\rangle \nonumber \\{} & {} + \frac{L}{2} \Vert w_{k+1}^{s+1}-w_{k}^{s+1}\Vert ^2 \biggr ] \nonumber \\= & {} \mathbb {E}\biggl [F(w_k^{s+1})-\eta _{k+1}\langle \nabla F(w_k)^{s+1}, \phi _{k}^{s+1}\rangle \rangle \nonumber \\ {}{} & {} +\frac{L}{2}\Vert \eta _{k+1}\phi _{k}^{s+1}\Vert ^2\biggr ] \nonumber \\= & {} \mathbb {E}\biggl [F(w_k^{s+1})+\frac{L\eta _{k+1}^2}{2}\Vert \phi _{k}^{s+1}\Vert ^2\biggr ]\nonumber \\{} & {} -\mathbb {E}\left[ \eta _{k+1}\right] \Vert \nabla F(w_k^{s+1})\Vert ^2, \end{aligned}$$

(14)

where the last equality holds due to $\mathbb {E}[\phi _{k}^{s+1}]=\nabla F(w_k^{s+1})$.

We emphasize here that the computation of the step size $\eta _k$ in (5) can use different samples uniformly randomly drawing from [n] for $\nabla f_i(w_{k-1})$ and $\nabla f_i(w_{k-1})$. As a consequence, because of uniformly randomly (with replacement) sample from [n], (14) can be satisfied, which is similar to the proof in Lemma 1.

Now we consider the Lyapunov function, i.e.,

$$\begin{aligned} R_k^{s+1}= \mathbb {E}[F(w_k^{s+1})+c_k\Vert w_k^{s+1}-\widetilde{w}^s\Vert ^2]. \end{aligned}$$

In order to bound it, we provide the following

$$\begin{aligned}{} & {} \mathbb {E}[\Vert w_{k+1}^{s+1}-\widetilde{w}^s\Vert ^2] =\mathbb {E}[\Vert w_{k+1}^{s+1}-w_k^{s+1}+w_k^{s+1}-\widetilde{w}^s\Vert ^2] \nonumber \\= & {} \mathbb {E}[\Vert w_{k+1}^{s+1}-w_k^{s+1}\Vert ^2+\Vert w_k^{s+1}-\widetilde{w}^s\Vert ^2+2\langle w_{k+1}^{s+1}\nonumber \\{} & {} -w_k^{s+1}, w_k^{s+1}-\widetilde{w}^s\rangle ]\nonumber \\= & {} \mathbb {E}[\Vert w_{k+1}^{s+1}-w_k^{s+1}\Vert ^2+\Vert w_k^{s+1}-\widetilde{w}^s\Vert ^2\nonumber \\{} & {} +2\langle -\eta _{k+1}\nabla \phi _k^{s+1}, w_k^{s+1}-\widetilde{w}^s\rangle ]\nonumber \\= & {} \mathbb {E}[\eta _{k+1}^2\Vert \phi _k^{s+1}\Vert ^2+\Vert w_k^{s+1}-\widetilde{w}^s\Vert ^2]\nonumber \\{} & {} -2\mathbb {E}[\eta _{k+1}]\langle \nabla F(w_k^{s+1}), w_{k}^{s+1}-\widetilde{w}^s\rangle \nonumber \\\le & {} \mathbb {E}[\eta _{k+1}^2\Vert \phi _k^{s+1}\Vert ^2+\Vert w_k^{s+1}-\widetilde{w}^s\Vert ^2]+2\\{} & {} \cdot \mathbb {E}[\eta _{k+1}]\biggl (\frac{1}{2\alpha _k}\Vert \nabla F(w_k^{s+1})\Vert ^2\!+\frac{\alpha _k}{2}\mathbb {E}\left[ \Vert w_k^{s+1}\!-\widetilde{w}^s\Vert ^2\right] \biggr ), \nonumber \end{aligned}$$

(15)

where in the third equality we used $w_{k+1}^{s+1}=w_{k}^{s+1}-\eta _{k+1} \phi _{k}^{s+1}$, in the fourth equality we used $\mathbb {E}[\phi _{k}^{s+1}]=\nabla F(w_k^{s+1})$ and in the last inequality we used Cauchy-Schwarz and Young’s inequality. Here, we also used the hypothesis that the samples were chosen from $\{1, \ldots ,n\}$ with replacement independently.

Taking (14) and (15) into $R_{k+1}^{s+1}$, we have the following boundary:

$$\begin{aligned} R_{k+1}^{s+1}\le & {} \mathbb {E}\biggl [F(w_k^{s+1})-\eta _{k+1}\Vert \nabla F(w_k^{s+1})\Vert ^2+\frac{L\eta _{k+1}^2}{2}\Vert \phi _{k}^{s+1}\Vert ^2\biggr ] \\{} & {} +\mathbb {E}[c_{k+1}\eta _{k+1}^2\Vert \phi _k^{s+1}\Vert ^2+c_{k+1}\Vert w_k^{s+1}-\widetilde{w}^s\Vert ^2]+2c_{k+1}\\{} & {} \mathbb {E}[\eta _{k+1}]\biggl (\frac{1}{2\alpha _k}\Vert \nabla F(w_k^{s+1})\Vert ^2\!+\frac{\alpha _k}{2}\mathbb {E}\left[ \Vert w_k^{s+1}\!-\widetilde{w}^s\Vert ^2\right] \biggr )\\\le & {} \mathbb {E}\left[ F(w_k^{s+1})\right] -\mathbb {E}\biggl [\eta _{k+1}-\frac{c_{k+1}\eta _{k+1}}{\alpha _k}\biggr ]\Vert \nabla F(w_k^{s+1})\Vert ^2\\{} & {} +\mathbb {E}\biggl [\biggl (\frac{L\eta _{k+1}^2}{2}+c_{k+1}\eta _{k+1}^2\biggr )\Vert \phi _{k}^{s+1}\Vert ^2\biggr ]\\{} & {} +\mathbb {E}[c_{k+1}+c_{k+1}\eta _{k+1}\alpha _k]\mathbb {E}[\Vert w_k^{s+1}-\widetilde{w}^s\Vert ^2]. \end{aligned}$$

According to the upper boundary of $\phi _{k}^{s+1}$, i.e., Lemma 3, we ascertain that

$$\begin{aligned} R_{k+1}^{s+1}\le & {} \mathbb {E}[F(w_k^{s+1})]-\mathbb {E}\biggl [\eta _{k+1}-\frac{c_{k+1}\eta _{k+1}}{\alpha _k}\biggr ]\Vert \nabla F(w_k^{s+1})\Vert ^2\\{} & {} +\mathbb {E}\biggl [\frac{L\eta _{k+1}^2}{2}+c_{k+1}\eta _{k+1}^2\biggr ]\biggl ( 2\mathbb {E}[\Vert \nabla F(w_k^{s+1})\Vert ^2]+\frac{2L^2}{b}\\{} & {} \mathbb {E}[\Vert w_k^{s+1}-\widetilde{w}^s\Vert ^2]\biggr )+\mathbb {E}[c_{k+1}+c_{k+1}\eta _{k+1}\alpha _k]\mathbb {E}[\Vert w_k^{s+1}\nonumber \\{} & {} -\widetilde{w}^s\Vert ^2]\\= & {} \mathbb {E}[F(w_k^{s+1})]-\mathbb {E}\biggl [\eta _{k+1}-\frac{c_{k+1}\eta _{k+1}}{\alpha _k}-L\eta _{k+1}^2\\{} & {} -2c_{k+1}\eta _{k+1}^2\biggr ]\Vert \nabla F(w_k^{s+1})\Vert ^2+\mathbb {E}\biggl [c_{k+1}+c_{k+1}\eta _{k+1}\alpha _k\\{} & {} +\frac{L^3\eta _{k+1}^2}{b}+\frac{2L^2c_{k+1}\eta _{k+1}^2}{b}\biggr ]\mathbb {E}[\Vert w_k^{s+1}-\widetilde{w}^s\Vert ^2]. \end{aligned}$$

Further, utilizing Lemma 1, we have

$$\begin{aligned} R_{k+1}^{s+1}\le & {} \mathbb {E}[F(w_k^{s+1})]-\biggl (Q_m-\frac{c_{k+1}Q_m}{\alpha _k}-LQ_m^2\\{} & {} -2c_{k+1}Q_m^2\biggr )\cdot \Vert \nabla F(w_k^{s+1})\Vert ^2\\{} & {} +\biggl (c_{k+1}+c_{k+1}Q_m\alpha _k+\frac{L^3Q_m^2}{b}+\frac{2L^2c_{k+1}Q_m^2}{b}\biggr )\\{} & {} \cdot \mathbb {E}[\Vert w_k^{s+1}-\widetilde{w}^s\Vert ^2]\\= & {} \mathbb {E}[F(w_k^{s+1})]+c_k\mathbb {E}[\Vert w_k^{s+1}-\widetilde{w}^s\Vert ^2]\\{} & {} -\biggl (Q_m-\frac{c_{k+1}Q_m}{\alpha _k}-LQ_m^2-2c_{k+1}Q_m^2\biggr )\\{} & {} \times \Vert \nabla F(w_k^{s+1})\Vert ^2, \end{aligned}$$

where the first equality holds due to Lemma 2.

Thereby, we complete the proof of Lemma 2.

Notice that to make ${\varGamma }_{k, m}>0$, we only require $2c_{k+1}Q_m+LQ_m+\frac{c_{k+1}}{\alpha _k}<1$. When choosing $c_k$, $Q_m$ and $\alpha _k$ from (0, 1), such condition is easy to be satisfied. From the definition of $Q_m$, we can choose enough small parameters, $\eta _0$ and $\beta $, making $Q_m$ small enough. Also, from the definition of $c_k$, we can make $c_k$ small enough. Specifically, as an example, when setting $c_{k+1}\ll \alpha _k$ at the same time, we have the conclusion, $2c_{k+1}Q_m+LQ_m+\frac{c_{k+1}}{\alpha _k}\ll 1$. Therefore, we ascertain that the condition ${\varGamma }_{k, m}>0$ is satisfied when choosing the appropriate parameters, $c_k$, $Q_m$ and $\alpha _k$. $\square $

1.3 A.3 Proof of Theorem 1

Proof

Combining Lemma 2 and $\gamma _m=\min _k {\varGamma }_{k, m}$, by summing over $k=0, \ldots , m-1$, we have

$$\begin{aligned} \sum _{k=0}^{m-1} \mathbb {E}\left[ \Vert \nabla F(w_k^{s+1})\Vert ^2\right] \le \frac{R_0^{s+1}-R_m^{s+1}}{\gamma _m} \end{aligned}$$

Above mentioned inequality indicates that

$$\begin{aligned} \sum _{k=0}^{m-1} \mathbb {E}\left[ \Vert \nabla F(w_k^{s+1})\Vert ^2\right] \le \frac{\mathbb {E}[F(\widetilde{w}^s)-F(\widetilde{w}^{s+1})]}{\gamma _m}, \end{aligned}$$

(16)

where we took the fact that $R_m^{s+1}=\mathbb {E}[F(w_m^{s+1})]=\mathbb {E}[F(\widetilde{w}^{s+1})]$ (since $c_m=0$) and that $R_0^{s+1}=\mathbb {E}[f(\widetilde{w}^s)]$ (since $w_0^{s+1}=\widetilde{w}^s$).

By summing over all epochs, we have

$$\begin{aligned} \frac{1}{T} \sum _{s=0}^{S-1}\sum _{k=0}^{m-1} \mathbb {E}\left[ \Vert \nabla F(w_k^{s+1})\Vert ^2\right] \le \frac{F(w^0)-F(w^s)}{T\gamma _m}. \end{aligned}$$

(17)

The above inequality used the fact that $\widetilde{w}^0=w^0$. Thus, we complete the proof of Theorem 1. Note that, although the output of our algorithm is the last iteration, $w_{m}^{s+1}$, of the inner loop, rather than sampling randomly from the set $\{w_k^s\}$ for $s=0, \ldots ,S-1$ and $k=0, \ldots ,m-1$, the quantitative relationship in (17) still can be used in our proof. This is indeed the case. Many studies had pointed out that these two ways in determining the last iterate achieved almost similar numerical performance in many problems.

Further, considering Assumption 1, i.e., $\Vert \nabla f_i(w)-\nabla f_i(v)\Vert \le \Vert w-v\Vert $, we have

$$\begin{aligned} \Vert \nabla F(\tilde{w}^S)-\nabla F(w^*)\Vert ^2\le L^2\Vert \tilde{w}^S-w^*\Vert ^2. \end{aligned}$$

To satisfy the result in Theorem 1, $\mathbb {E}[\Vert \nabla F(\widetilde{w}^{S})\Vert ^2]\le \frac{F(w^{0})-F(w^*)}{T\gamma _m}$, it is enough to hold the following condition

$$\begin{aligned} \mathbb {E}[L^2\Vert \tilde{w}^S-w^*\Vert ^2]\le \frac{F(w^{0})-F(w^*)}{T\gamma _m}. \end{aligned}$$

Therefore, the conclusion in Theorem 1 can be rewritten as

$$\begin{aligned} \mathbb {E}[\Vert \tilde{w}^S-w^*\Vert ^2]\le \frac{F(w^{0})-F(w^*)}{L^2T\gamma _m}. \end{aligned}$$

$\square $

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, Z. Stochastic variance reduced gradient with hyper-gradient for non-convex large-scale learning. Appl Intell 53, 28627–28641 (2023). https://doi.org/10.1007/s10489-023-05025-1

Download citation

Accepted: 18 September 2023
Published: 10 October 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10489-023-05025-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stochastic variance reduced gradient with hyper-gradient for non-convex large-scale learning

Abstract

Access this article

Similar content being viewed by others

An accelerated variance reducing stochastic method with Douglas-Rachford splitting

Stochastic three-term conjugate gradient method with variance technique for non-convex learning

Stochastic Variance Reduced Gradient Methods Using a Trust-Region-Like Scheme

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Ethical approval

Informed consent

Additional information

Publisher's Note

Appendix A: Proofs for MSVRG-HD

1.1 A.1 Proof of Lemma 1

Proof

1.2 A.2 Proof of Lemma 2

Proof

1.3 A.3 Proof of Theorem 1

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Stochastic variance reduced gradient with hyper-gradient for non-convex large-scale learning

Abstract

Access this article

Similar content being viewed by others

An accelerated variance reducing stochastic method with Douglas-Rachford splitting

Stochastic three-term conjugate gradient method with variance technique for non-convex learning

Stochastic Variance Reduced Gradient Methods Using a Trust-Region-Like Scheme

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Ethical approval

Informed consent

Additional information

Publisher's Note

Appendix A: Proofs for MSVRG-HD

Appendix A: Proofs for MSVRG-HD

1.1 A.1 Proof of Lemma 1

Proof

1.2 A.2 Proof of Lemma 2

Proof

1.3 A.3 Proof of Theorem 1

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation