Abstract
Non-convex optimization, which can better capture the problem structure, has received considerable attention in the applications of machine learning, image/signal processing, statistics, etc. With faster convergence rate, there have been tremendous studies on developing stochastic variance reduced algorithms to solve these non-convex optimization problems. However, as a crucial hyper-parameter for stochastic variance reduced algorithms, that how to select an appropriate step size is less researched in solving non-convex optimization problems. To address this gap, we propose a new class of stochastic variance reduced algorithms based on hyper-gradient, which has the ability to automatically obtain the online step size. Specifically, we focus on the variance-reduced stochastic optimization algorithms, the stochastic variance reduced gradient (SVRG) algorithm, which computes a full gradient periodically. We analyze theoretically the convergence of the proposed algorithm for non-convex optimization problems. Moreover, we show that the proposed algorithm enjoys the same complexities as state-of-the-art algorithms for solving non-convex problems in terms of finding an approximate stationary point. Thorough numerical results on empirical risk minimization with non-convex loss functions validate the efficacy of our method.
Similar content being viewed by others
Data Availability
The datasets generated during and analysed during the current study are available from the corresponding author on reasonable request.
Notes
For a function \(y=f(u)\), where \(u=\phi (x)\), the partial derivative of the function y at point x can be abbreviated as: \(\frac{\partial y}{\partial x}=\frac{\partial y}{\partial \phi }\cdot \frac{\partial \phi }{\partial x}\).
All data sets are available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
References
Al-Betar MA, Awadallah MA, Krishan MM (2020) A non-convex economic load dispatch problem with valve loading effect using a hybrid grey wolf optimizer. Neural Comput Appl 32(16):12127–12154
Allen-Zhu Z (2017) Katyusha: The first direct acceleration of stochastic gradient methods. J Mach Learn Res 18(1):8194–8244
Antoniadis A, Gijbels I, Nikolova M (2011) Penalized likelihood regression for generalized linear models with non-quadratic penalties. Ann Inst Stat Math 63(3):585–615
Auer P, Cesa-Bianchi N, Gentile C (2002) Adaptive and self-confident on-line learning algorithms. J Comput Syst Sci 64(1):48–75
Barzilai J, Borwein JM (1988) Two-point step size gradient methods. IMA J Numer Anal 8(1):141–148
Baydin AG, Cornish R, Rubio DM, Schmidt MW, Wood FD (2018) Online learning rate adaptation with hypergradient descent. In: International conference on learning representations
Csiba D, Qu Z, Richtárik P (2015) Stochastic dual coordinate ascent with adaptive probabilities. In: International conference on machine learning. p 674–683
De S, Yadav A, Jacobs D, Goldstein T (2017) Automated inference with adaptive batches. In: International conference on artificial intelligence and statistics
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12(Jul):2121–2159
Ertekin S, Bottou L, Giles CL (2010) Nonconvex online support vector machines. IEEE Trans Pattern Anal Mach Intell 33(2):368–381
Fang C, Li CJ, Lin Z, Zhang T (2018) SPIDER: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In: Advances in neural information processing systems. p 689–699
Ge R, Li Z, Wang W, Wang X (2019) Stabilized SVRG: Simple variance reduction for nonconvex optimization. In: Conference on learning theory, PMLR. p 1394–1448
Itakura K, Atarashi K, Oyama S, Kurihara M (2020) Adapting the learning rate of the learning rate in hypergradient descent. In: International Conference on Soft Computing and Intelligent Systems and 21st International Symposium on Advanced Intelligent Systems (SCIS-ISIS). IEEE, p 1–6
Jacobs RA (1988) Increased rates of convergence through learning rate adaptation. Neural Netw 1(4):295–307
Jie R, Gao J, Vasnev A, Tran MN (2022) Adaptive hierarchical hyper-gradient descent. Int J Mach Learn Cybern 13(12):3785–3805
Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in neural information processing systems. p 315–323
Kesten H (1958) Accelerated stochastic approximation. Ann Math Stat 29(1):41–59
Konečnỳ J, Liu J, Richtárik P, Takáč M (2016) Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing 10(2):242–255
Kresoja M, Lužanin Z, Stojkovska I (2017) Adaptive stochastic approximation algorithm. Numerical Algorithms 76(4):917–937
Lei L, Ju C, Chen J, Jordan MI (2017) Non-convex finite-sum optimization via SCSG methods. In: Advances in neural information processing systems. p 2348–2358
Li X, Orabona F (2019) On the convergence of stochastic gradient descent with adaptive stepsizes. In: International conference on artificial intelligence and statistics. pp 983–992
Liu L, Liu J, Tao D (2021) Variance reduced methods for non-convex composition optimization. IEEE Trans Pattern Anal Mach Intell
Ma K, Zeng J, Xiong J, Xu Q, Cao X, Liu W, Yao Y (2018) Stochastic Non-convex Ordinal Embedding with Stabilized Barzilai-Borwein step size. In: AAAI conference on artificial intelligence
Mahsereci M, Hennig P (2017) Probabilistic line searches for stochastic optimization. J Mach Learn Res 18(1):4262–4320
Nesterov Y (2004) Introductory lectures on convex optimization : basic course. Kluwer Academic
Nguyen LM, Liu J, Scheinberg K, Takáč M (2017) SARAH: A novel method for machine learning problems using stochastic recursive gradient. Int Conf Mach Learn 70:2613–2621
Nguyen LM, Liu J, Scheinberg K, Takáč M (2017) Stochastic recursive gradient algorithm for nonconvex optimization. arXiv:1705.07261
Nitanda A (2014) Stochastic proximal gradient descent with acceleration techniques. In: Advances in neural information processing systems. p 1574–1582
Pham NH, Nguyen LM, Phan DT, Tran-Dinh Q (2020) ProxSARAH: An efficient algorithmic framework for stochastic composite nonconvex optimization. J Mach Learn Res 21:110–1
Reddi SJ, Hefny A, Sra S, Poczos B, Smola A (2016) Stochastic variance reduction for nonconvex optimization. In: International conference on machine learning. p 314–323
Reddi SJ, Sra S, Poczos B, Smola AJ (2016) Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In: Advances in neural information processing systems. p 1145–1153
Roux NL, Schmidt M, Bach FR (2012) A stochastic gradient method with an exponential convergence _rate for finite training sets. In: Advances in neural information processing systems. p 2663–2671
Saeedi T, Rezghi M (2020) A novel enriched version of truncated nuclear norm regularization for matrix completion of inexact observed data. IEEE Trans Knowl Data Eng
Schmidt M, Babanezhad R, Ahmed MO, Defazio A, Clifton A, Sarkar A (2015) Non-uniform stochastic average gradient method for training conditional random fields. In: International conference on artificial intelligence and statistics
Sopyła K, Drozda P (2015) Stochastic gradient descent with Barzilai-Borwein update step for SVM. Inf Sci 316:218–233
Suzuki K, Yukawa M (2020) Robust recovery of jointly-sparse signals using minimax concave loss function. IEEE Trans Signal Process 69:669–681
Tan C, Ma S, Dai YH, Qian Y (2016) Barzilai-Borwein step size for stochastic gradient descent. In: Advances in neural information processing systems. p 685–693
Wang J, Wang M, Hu X, Yan S (2015) Visual data denoising with a unified schatten-p norm and lq norm regularized principal component pursuit. Pattern Recogn 48(10):3135–3144
Wang S, Chen Y, Cen Y, Zhang L, Wang H, Voronin V (2022) Nonconvex low-rank and sparse tensor representation for multi-view subspace clustering. Appl Intell 1–14
Yang J, Kiyavash N, He N (2020) Global convergence and variance reduction for a class of nonconvex-nonconcave minimax problems. Adv Neural Inf Process Syst 33
Yang Z (2021) On the step size selection in variance-reduced algorithm for nonconvex optimization. Expert Syst Appl 169:114336
Yang Z, Wang C, Zang Y, Li J (2018) Mini-batch algorithms with Barzilai-Borwein update step. Neurocomputing 314:177–185
Yang Z, Wang C, Zhang Z, Li J (2018) Random Barzilai-Borwein step size for mini-batch algorithms. Eng Appl Artif Intell 72:124–135
Yang Z, Wang C, Zhang Z, Li J (2019) Accelerated stochastic gradient descent with step size selection rules. Signal Process 159:171–186
Yang Z, Wang C, Zhang Z, Li J (2019) Mini-batch algorithms with online step size. Knowl-Based Syst 165:228–240
Ying J, de Miranda Cardoso JV, Palomar D (2020) Nonconvex sparse graph learning under laplacian constrained graphical model. Adv Neural Inf Process Syst 33
Zhang T (2010) Analysis of multi-stage convex relaxation for sparse regularization. J Mach Learn Res 11(Mar):1081–1107
Zhou D, Xu P, Gu Q (2018) Stochastic nested variance reduced gradient descent for nonconvex optimization. In: Advances in neural information processing systems. p 3921–3932
Acknowledgements
This work was supported by grants from the China Postdoctoral Science Foundation under Grant 2019M663238. Also, this work was partially supported by Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.
Author information
Authors and Affiliations
Contributions
Zhuang Yang: Data curation, Formal analysis, Investigation, Methodology, Validation, Visualization, Writing - original draft, Funding acquisition.
Corresponding author
Ethics declarations
Conflicts of interest
Author Zhuang Yang declares that he has no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent
Additional informed consent is obtained from all individual participants for whom identifying information is included in this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: Proofs for MSVRG-HD
Appendix A: Proofs for MSVRG-HD
1.1 A.1 Proof of Lemma 1
Proof
According to Algorithm 1 and F has a \(\sigma \)-bounded gradient, we have
Notice that the above inequality holds because we choose the sample i independently by adopting a uniformly randomly (with replacement) sample from [n]. In other word, the resulting algorithm (MSVRG-HD) utilizes an unbiased estimator of gradient per iterative step. Additionally, the evaluation of the step size \(\eta _k\) in Algorithm 1 can utilize different batch samples uniformly randomly selecting from [n] for \(\nabla F_{\hat{S}}(w_{i+1}^{s+1})\) and \(\nabla F_{\hat{S}}(w_{i}^{s+1})\). Due to the tactic of uniformly randomly (with replacement), the inequality in the above can be held. \(\square \)
1.2 A.2 Proof of Lemma 2
Proof
From (3) and \(w_{k+1}^{s+1}=w_{k}^{s+1}-\eta _{k+1} \phi _{k}^{s+1}\) in Algorithm 1, we have
where the last equality holds due to \(\mathbb {E}[\phi _{k}^{s+1}]=\nabla F(w_k^{s+1})\).
We emphasize here that the computation of the step size \(\eta _k\) in (5) can use different samples uniformly randomly drawing from [n] for \(\nabla f_i(w_{k-1})\) and \(\nabla f_i(w_{k-1})\). As a consequence, because of uniformly randomly (with replacement) sample from [n], (14) can be satisfied, which is similar to the proof in Lemma 1.
Now we consider the Lyapunov function, i.e.,
In order to bound it, we provide the following
where in the third equality we used \(w_{k+1}^{s+1}=w_{k}^{s+1}-\eta _{k+1} \phi _{k}^{s+1}\), in the fourth equality we used \(\mathbb {E}[\phi _{k}^{s+1}]=\nabla F(w_k^{s+1})\) and in the last inequality we used Cauchy-Schwarz and Young’s inequality. Here, we also used the hypothesis that the samples were chosen from \(\{1, \ldots ,n\}\) with replacement independently.
Taking (14) and (15) into \(R_{k+1}^{s+1}\), we have the following boundary:
According to the upper boundary of \(\phi _{k}^{s+1}\), i.e., Lemma 3, we ascertain that
Further, utilizing Lemma 1, we have
where the first equality holds due to Lemma 2.
Thereby, we complete the proof of Lemma 2.
Notice that to make \({\varGamma }_{k, m}>0\), we only require \(2c_{k+1}Q_m+LQ_m+\frac{c_{k+1}}{\alpha _k}<1\). When choosing \(c_k\), \(Q_m\) and \(\alpha _k\) from (0, 1), such condition is easy to be satisfied. From the definition of \(Q_m\), we can choose enough small parameters, \(\eta _0\) and \(\beta \), making \(Q_m\) small enough. Also, from the definition of \(c_k\), we can make \(c_k\) small enough. Specifically, as an example, when setting \(c_{k+1}\ll \alpha _k\) at the same time, we have the conclusion, \(2c_{k+1}Q_m+LQ_m+\frac{c_{k+1}}{\alpha _k}\ll 1\). Therefore, we ascertain that the condition \({\varGamma }_{k, m}>0\) is satisfied when choosing the appropriate parameters, \(c_k\), \(Q_m\) and \(\alpha _k\). \(\square \)
1.3 A.3 Proof of Theorem 1
Proof
Combining Lemma 2 and \(\gamma _m=\min _k {\varGamma }_{k, m}\), by summing over \(k=0, \ldots , m-1\), we have
Above mentioned inequality indicates that
where we took the fact that \(R_m^{s+1}=\mathbb {E}[F(w_m^{s+1})]=\mathbb {E}[F(\widetilde{w}^{s+1})]\) (since \(c_m=0\)) and that \(R_0^{s+1}=\mathbb {E}[f(\widetilde{w}^s)]\) (since \(w_0^{s+1}=\widetilde{w}^s\)).
By summing over all epochs, we have
The above inequality used the fact that \(\widetilde{w}^0=w^0\). Thus, we complete the proof of Theorem 1. Note that, although the output of our algorithm is the last iteration, \(w_{m}^{s+1}\), of the inner loop, rather than sampling randomly from the set \(\{w_k^s\}\) for \(s=0, \ldots ,S-1\) and \(k=0, \ldots ,m-1\), the quantitative relationship in (17) still can be used in our proof. This is indeed the case. Many studies had pointed out that these two ways in determining the last iterate achieved almost similar numerical performance in many problems.
Further, considering Assumption 1, i.e., \(\Vert \nabla f_i(w)-\nabla f_i(v)\Vert \le \Vert w-v\Vert \), we have
To satisfy the result in Theorem 1, \(\mathbb {E}[\Vert \nabla F(\widetilde{w}^{S})\Vert ^2]\le \frac{F(w^{0})-F(w^*)}{T\gamma _m}\), it is enough to hold the following condition
Therefore, the conclusion in Theorem 1 can be rewritten as
\(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yang, Z. Stochastic variance reduced gradient with hyper-gradient for non-convex large-scale learning. Appl Intell 53, 28627–28641 (2023). https://doi.org/10.1007/s10489-023-05025-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-05025-1