Skip to main content

Advertisement

Log in

Accelerating SGD using flexible variance reduction on large-scale datasets

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Stochastic gradient descent (SGD) is a popular optimization method widely used in machine learning, while the variance of gradient estimation leads to slow convergence. To accelerate the speed, many variance reduction methods have been proposed. However, most of these methods require additional memory cost or computational burden on full gradient, which results in low efficiency or even unavailable while applied to real-world applications with large-scale datasets. To handle this issue, we propose a new flexible variance reduction method for SGD, named FVR-SGD, which can reduce memory overhead and speed up the convergence using flexible subset without extra computation. We analyze the details of convergence property for our method, and linear convergence rate can be guaranteed while using flexible variance reduction. Some efficient variants for distributed environment and deep neural networks are also proposed in this paper. Several numerical experiments are conducted on a genre of real-world large-scale datasets. The experimental results demonstrated that FVR-SGD outperforms the currently popular algorithms. Specifically, the proposed method can achieve up to 40% reduction in the training time to solve the optimization problem of logistic regression, SVM and neural networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Roux NL, Schmidt M, Bach FR (2012) A stochastic gradient method with an exponential convergence rate for finite training sets. Adv Neural Inf Process Syst 2012:2663–2671

    Google Scholar 

  2. Reddi SJ, Hefny A, Sra S, Poczos B, Smola AJ (2015) On variance reduction in stochastic gradient descent and its asynchronous variants. Adv Neural Inf Process Syst 2015:2647–2655

    Google Scholar 

  3. Schmidt M, Le Roux N, Bach F (2017) Minimizing finite sums with the stochastic average gradient. Math Program 162(1–2):83–112

    Article  MathSciNet  Google Scholar 

  4. Defazio A, Bach F, Lacoste-Julien S (2014) SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. Adv Neural Inf Process Syst 2014:1646–1654

    Google Scholar 

  5. De S, Goldstein T (2016) Efficient distributed SGD with variance reduction. In: 2016 IEEE 16th international conference on data mining (ICDM), 2016. pp 111–120

  6. Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. Adv Neural Inf Process Syst 2013:315–323

    Google Scholar 

  7. Tang M, Huang Z, Qiao L, Du S, Peng Y, Wang C (2018) FVR-SGD: a new flexible variance-reduction method for SGD on large-scale datasets. In: The 25th international conference on neural information processing, 01, 2018. pp 181–193

  8. Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12(Jul):2121–2159

    MathSciNet  MATH  Google Scholar 

  9. Zeiler MD (2012) ADADELTA: an adaptive learning rate method. arXiv:12125701

  10. Tieleman T, Hinton G (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw Mach Learn 4(2):26–31

    Google Scholar 

  11. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:14126980

  12. Zhang T, Hu F, Li L, Xu X, Yang Z, Chen Y (2018) An adaptive mechanism to achieve learning rate dynamically. Neural Comput Appl. https://doi.org/10.1007/s00521-018-3495-0

    Article  Google Scholar 

  13. Friedlander MP, Schmidt M (2012) Hybrid deterministic-stochastic methods for data fitting. SIAM J Sci Comput 34(3):A1380–A1405

    Article  MathSciNet  Google Scholar 

  14. De S, Yadav A, Jacobs D, Goldstein T (2016) Big batch SGD: automated inference using adaptive batch sizes. arXiv:161005792

  15. Li M, Zhang T, Chen Y, Smola AJ (2014) Efficient mini-batch training for stochastic optimization. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014. pp 661–670

  16. Dekel O, Gilad-Bachrach R, Shamir O, Xiao L (2012) Optimal distributed online prediction using mini-batches. J Mach Learn Res 13(Jan):165–202

    MathSciNet  MATH  Google Scholar 

  17. Qian N (1999) On the momentum term in gradient descent learning algorithms. Neural Netw 12(1):145–151

    Article  MathSciNet  Google Scholar 

  18. Nesterov YE (1983) A method for solving the convex programming problem with convergence rate \(O(1/k^2)\). Dokl. Akad. Nauk SSSR 1983:543–547

    Google Scholar 

  19. Hocenski Mladen, Filko D (2010) Accelerated gradient learning algorithm for neural network weights update. Neural Comput Appl 19(2):219–225

    Article  Google Scholar 

  20. Zinkevich M, Weimer M, Li L, Smola AJ (2010) Parallelized stochastic gradient descent. Adv Neural Inf Process Syst 2010:2595–2603

    Google Scholar 

  21. Recht B, Re C, Wright S, Niu F (2011) Hogwild!: a lock-free approach to parallelizing stochastic gradient descent. Adv Neural Inf Process Syst 2011:693–701

    Google Scholar 

  22. Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Senior A, Tucker P, Yang K, Le QV (2012) Large scale distributed deep networks. Adv Neural Inf Process Syst 2012:1223–1231

    Google Scholar 

  23. Nesterov Y (2012) Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J Optim 22(2):341–362

    Article  MathSciNet  Google Scholar 

  24. Shalev-Shwartz S, Zhang T (2013) Stochastic dual coordinate ascent methods for regularized loss minimization. J Mach Learn Res 14(Feb):567–599

    MathSciNet  MATH  Google Scholar 

  25. Xie T, Liu B, Xu Y, Ghavamzadeh M, Chow Y, Lyu D, Yoon D (2018) A block coordinate ascent algorithm for mean-variance optimization. Adv Neural Inf Process Syst 2018:1073–1083

    Google Scholar 

  26. Zhang Y, Xiao L (2017) Stochastic primal-dual coordinate method for regularized empirical risk minimization. J Mach Learn Res 18(1):2939–2980

    MathSciNet  MATH  Google Scholar 

  27. Shalev-Shwartz S, Tewari A (2011) Stochastic methods for \(\ell _1\) regularized loss minimization. J Mach Learn Res 12(2):1865–1892

    MathSciNet  MATH  Google Scholar 

  28. Xiao L, Zhang T (2014) A proximal stochastic gradient method with progressive variance reduction. SIAM J Optim 24(4):2057–2075

    Article  MathSciNet  Google Scholar 

  29. Shalev-Shwartz S, Zhang T (2014) Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Int Conf Mach Learn 2014:64–72

    MATH  Google Scholar 

  30. Liu X, Hsieh C-J (2018) Fast variance reduction method with Stochastic batch size. Int Conf Mach Learn 2018:3185–3194

    Google Scholar 

  31. Shang F, Zhou K, Cheng J, Tsang IW, Zhang L, Tao D (2018) VR-SGD: a simple stochastic variance reduction method for machine learning. arXiv:1802.09932

  32. De S, Taylor G, Goldstein T (2015) Variance reduction for distributed stochastic gradient descent. arXiv:151201708

  33. Zhao K, Matsukawa T, Suzuki E (2018) Retraining: a simple way to improve the ensemble accuracy of deep neural networks for image classification. In: 2018 24th international conference on pattern recognition (ICPR), 2018. IEEE, pp 860–867

  34. Zhang T, Wiliem A, Yang S, Lovell B (2018) Tv-gan: generative adversarial network based thermal to visible face recognition. In: 2018 international conference on biometrics (ICB), 2018. IEEE, pp 174–181

  35. Yu X, Deng F (2013) Convergence of gradient method for training ridge polynomial neural network. Neural Comput Appl 22(1):333–339

    Article  Google Scholar 

  36. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  37. Krizhevsky A, Hinton G (2010) Convolutional deep belief networks on cifar-10. Unpublished manuscript 40(7)

  38. Lewis DD, Yang Y, Rose TG, Li F (2004) Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res 5(Apr):361–397

    Google Scholar 

  39. Lang K (1995) Newsweeder: learning to filter netnews. In: Machine learning proceedings 1995. Elsevier, pp 331–339

  40. West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA, Marks JR, Nevins JR (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci 98(20):11462–11467

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by The National Key Research and Development Program of China (2016YFB1000100).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhen Huang.

Ethics declarations

Conflict of interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, and there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled “Accelerating SGD using Flexible Variance Reduction on Large-Scale Datasets.”

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary version of this work was accepted by the 25th International Conference on Neural Information Processing (ICONIP 2018).

Appendix

Appendix

Theorem 1

Suppose that\(w_*\) is optimum and\(\nabla F(w_*)=0\), \(\phi ^l_m\) denotes the accumulatedw inm-th epoch and \(\tilde{G}_m = \sum _{l=1}^{2n} F(\phi _m^l)/2n\), define

$$\begin{aligned} \alpha&> 0,\\ \beta&= 2n\left( 2 \eta + \frac{2\eta \alpha L^2}{\mu } - 4 \eta ^2 L\right) ,\\ \gamma&= (2n-2K)\left( 2 \eta + \frac{2\eta \alpha L^2}{\mu } - 4 \eta ^2 L\right) \\&\quad + 2K\left( 8 \eta ^2 L + \frac{16 \eta ^2 L^2}{\mu } + \frac{2 \eta L^2}{\mu }\right) ,\\ \delta&= \max \left( 1-\mu \eta +\eta , \frac{\gamma }{\beta }\right) , \end{aligned}$$

there exists\(\beta > 0, 0<\delta <1\)that makes:

$$\begin{aligned}&{\mathbb {E}} \parallel w_{m+2}^{1} - w_* \parallel ^2 + \beta {\mathbb {E}}(\tilde{G}_{m+1}-F(w_*)) \\&\quad \le \delta [{\mathbb {E}} \parallel w_{m+1}^{1} - w_* \parallel ^2 + \beta {\mathbb {E}}(\tilde{G}_{m}-F(w_*)], \end{aligned}$$

which means our method converges linearly.

Proof

Given any i, suppose

$$\begin{aligned} g_i(w)=f_i(w)-f_i(w_*)-\nabla f_i(w_*)^\top (w-w_*), \end{aligned}$$

we know that \(g_i(w)\) is convex and \(\nabla g_i(w_*)=0\), so \(g_i(w_*)=\min \limits _\eta g_i(w)\). Based on the continuity of \(g_i(w)\), we can obtain

$$\begin{aligned} 0&=g_i(w_*) \le \min \limits _\eta [g_i(w-\eta \nabla g_i(w))] \\&\le \min \limits _\eta [g_i(w)-\eta \Vert \nabla g_i(w)\Vert _2^2 + 0.5L\eta ^2 \Vert \nabla g_i(w)\Vert _2^2 ]\\&=g_i(w)- \frac{1}{2L}\Vert \nabla g_i(w)\Vert _2^2. \end{aligned}$$

That is,

$$\begin{aligned}&\Vert f_i(w)-f_i(w_*)\Vert _2^2 \le 2L[f_i(w)-f_i(w_*)\\&\quad -\nabla f_i(w_*)^\top (w-w_*)]. \end{aligned}$$

Taking the expectation of i and using the fact that \({\mathbb {E}} [\nabla f_i(w_*)]=0\), we obtain

$$\begin{aligned} {\mathbb {E}} \parallel \nabla f_i(w) - \nabla f_i(w_*) \parallel ^2 \le 2L (F(w)-F(w_*)), \end{aligned}$$
(17)

Inequality (18) can be obtained from the continuity (10) and convexity (11) of F.

$$\begin{aligned} \parallel \nabla f_i(w) - \nabla f_i(w_*) \parallel ^2&\le L^2\parallel w - w_* \parallel ^2 \nonumber \\&\le \frac{2L^2}{\mu } (F(w) - F(w_*)). \end{aligned}$$
(18)

Now we can take expectation of \(v^k_{m+1}\) with respect to \(i_k\) and obtain

$$\begin{aligned}&{\mathbb {E}} \parallel v^k_{m+1} \parallel ^2 = {\mathbb {E}} \parallel \nabla f_{i_k}(w^{k}_{m+1}) - \nabla f_{i_k}(\tilde{w}_m) + \tilde{g}_m \parallel ^2\nonumber \\&\quad \le 2{\mathbb {E}} \parallel \nabla f_{i_k}(w^{k}_{m+1}) - \nabla f_{i_k}(w_*) \parallel ^2 \nonumber \\&\qquad + 2{\mathbb {E}} \parallel \nabla f_{i_k}(\tilde{w}_m) - \nabla f_{i_k}(w_*) - \tilde{g}_m \parallel ^2 \nonumber \\&\quad \le 2{\mathbb {E}} \parallel \nabla f_{i_k}(w^{k}_{m+1}) - \nabla f_{i_k}(w_*) \parallel ^2 \nonumber \\&\qquad + 4{\mathbb {E}} \parallel \nabla f_{i_k}(\tilde{w}_m) - \nabla f_{i_k}(w_*) \parallel ^2 \nonumber \\&\qquad + 8 \parallel \frac{1}{2n} \sum _{l=1}^{2n} [\nabla f_{i_l}(\phi _m^l) - \nabla f_{i_l}(w_*)] \parallel ^2 \nonumber \\&\qquad + 8 \parallel \frac{1}{2n} \sum _{l=1}^{2n} \nabla f_{i_l}(w_*) \parallel ^2 \nonumber \\&\quad \le 4L(F(w^{k}_{m+1})-F(w_*)) + 8L(F(\tilde{w}_m)-F(w_*)) \nonumber \\&\qquad + \frac{4}{n} \sum _{l=1}^{2n} \parallel \nabla f_{i_l}(\phi _m^l) - \nabla f_{i_l}(w_*) \parallel ^2\nonumber \\&\quad \le 4L(F(w^{k}_{m+1})-F(w_*)) + \left( 8L + \frac{16L^2}{\mu }\right) (\tilde{G}_m - F(w_*)). \end{aligned}$$
(19)

The first and second inequalities use \(\parallel a + b \parallel ^2 \le 2\parallel a \parallel ^2 + 2\parallel b \parallel ^2\). The third inequality uses inequality (17) and \(\parallel \sum _{i=1}^n a_n \parallel ^2 \le n\sum _{i=1}^n \parallel a_i \parallel ^2\). The forth inequality uses inequality (18) and \(F(\tilde{w}_m)=F(\frac{1}{2n} \sum _{l=1}^{2n}\phi _m^l) \le \frac{1}{2n} \sum _{l=1}^{2n} F(\phi _m^l)\). Note that \({\mathbb {E}} v^k_{m+1} = \nabla F(w^k_{m+1}) - \nabla F(\tilde{w}_m) + \tilde{g}_m\). Conditioned on \(w_{m+1}^{k-1}\), we can obtain the bound:

$$\begin{aligned}&{\mathbb {E}} \parallel w_{m+1}^{k} - w_* \parallel ^2 = {\mathbb {E}} \parallel w_{m+1}^{k-1} - \eta v^{k-1}_{m+1} - w_* \parallel ^2\nonumber \\&\quad = \parallel w_{m+1}^{k-1} - w_* \parallel ^2 - 2 \eta (w_{m+1}^{k-1}-w*)^\top {\mathbb {E}} v^{k-1}_{m+1} \nonumber \\&\qquad + \eta ^2 {\mathbb {E}} \parallel v^{k-1}_{m+1} \parallel ^2 \nonumber \\&\quad = \parallel w_{m+1}^{k-1} - w_* \parallel ^2 - 2 \eta (w_{m+1}^{k-1}-w_*)^\top \nabla F(w^{k-1}_{m+1}) \nonumber \\&\qquad - 2 \eta (w_{m+1}^{k-1}-w_*)^\top (\tilde{g}_m - \nabla F(\tilde{w}_m)) + \eta ^2 {\mathbb {E}} \parallel v^{k-1}_{m+1} \parallel ^2\nonumber \\&\quad \le \parallel w_{m+1}^{k-1} - w_* \parallel ^2 - 2 \eta (F(w_{m+1}^{k-1})-F(w_*)) \nonumber \\&\qquad - \mu \eta \parallel w_* - w_{m+1}^{k-1} \parallel ^2 \nonumber \\&\qquad - 2 \eta (w_{m+1}^{k-1}-w_*)^\top (\tilde{g}_m - \nabla F(\tilde{w}_m)) + \eta ^2 {\mathbb {E}} \parallel v^{k-1}_{m+1} \parallel ^2 \nonumber \\&\quad \le \parallel w_{m+1}^{k-1} - w_* \parallel ^2 - 2 \eta (F(w_{m+1}^{k-1})-F(w_*)) \nonumber \\&\qquad - \mu \eta \parallel w_{m+1}^{k-1} - w_* \parallel ^2 \nonumber \\&\qquad - 2 \eta (w_{m+1}^{k-1}-w_*)^\top (\tilde{g}_m - \nabla F(\tilde{w}_m)) \nonumber \\&\qquad + 4 \eta ^2 L (F(w^{k-1}_{m+1})-F(w_*)) \nonumber \\&\qquad + \eta ^2\left( 8L + \frac{16L^2}{\mu }\right) (\tilde{G}_m - F(w_*)) \nonumber \\&\quad = \parallel w_{m+1}^{k-1} - w_* \parallel ^2 - \mu \eta \parallel w_{m+1}^{k-1} - w_* \parallel ^2 \nonumber \\&\qquad - (2 \eta - 4 \eta ^2 L)(F(w^{k-1}_{m+1})-F(w_*))\nonumber \\&\qquad - 2 \eta (w_{m+1}^{k-1}-w_*)^\top (\tilde{g}_m - \nabla F(\tilde{w}_m)) \nonumber \\&\qquad + \eta ^2\left( 8L + \frac{16L^2}{\mu }\right) (\tilde{G}_m - F(w_*)). \end{aligned}$$
(20)

The first inequality uses inequality (11). The second inequality uses inequality (19). The inner product term can be bounded as follows:

$$\begin{aligned}&-2(w_{m+1}^{k-1}-w_*)^\top (\tilde{g}_m - \nabla F(\tilde{w}_m)) \nonumber \\&\quad \le 2\parallel (w_{m+1}^{k-1}-w_*)^\top \parallel \parallel \tilde{g}_m - \nabla F(\tilde{w}_m) \parallel \nonumber \\&\quad \le 2\parallel (w_{m+1}^{k-1}-w_*)^\top \parallel \left[ \frac{1}{2n} \sum _{l=1}^{2n} \parallel \nabla f_{i_l}(\phi _m^l)-\nabla F(\tilde{w}_m) \parallel \right] \nonumber \\&\quad \le 2\parallel (w_{m+1}^{k-1}-w_*)^\top \parallel \left[ \frac{L}{2n} \sum _{l=1}^{2n} \parallel \phi _m^l-\tilde{w}_m \parallel \right] \nonumber \\&\quad = \frac{1}{2n} \sum _{l=1}^{2n} [2\parallel (w_{m+1}^{k-1}-w_*)^\top \parallel L \parallel \phi _m^l-\tilde{w}_m \parallel ]\nonumber \\&\quad \le \parallel w_{m+1}^{k-1}-w_* \parallel ^2 + \frac{L^2}{2n} \sum _{l=1}^{2n} \parallel \phi _m^l-\tilde{w}_m \parallel ^2\nonumber \\&\quad \le \parallel w_{m+1}^{k-1}-w_* \parallel ^2 \nonumber \\&\qquad + \frac{L^2}{2n} \frac{2}{\mu } \sum _{l=1}^{2n}(F(\phi _m^l)-F(\tilde{w}_m)- \nabla F(\tilde{w}_m)(\phi _m^l-\tilde{w}_m))\nonumber \\&\quad = \parallel w_{m+1}^{k-1}-w_* \parallel ^2 \nonumber \\&\qquad + \frac{2L^2}{\mu } [((\tilde{G}_m-F(w_*))- (F(\tilde{w}_m)-F(w_*))]. \end{aligned}$$
(21)

The third inequality uses inequality (10), the fourth inequality uses Cauchy–Schwartz inequality, and the fifth inequality uses inequality (11). We set \(F(\tilde{w}_m)-F(w_*)=\alpha (F(w_{m+1}^{k-1})-F(w_*))\), where \(\alpha > 0\). Then we can rewrite (20) as:

$$\begin{aligned}&{\mathbb {E}} \parallel w_{m+1}^{k} - w_*\parallel ^2 \nonumber \\&\quad \le \parallel w_{m+1}^{k-1} - w_* \parallel ^2 - (\mu -1 )\eta \parallel w_{m+1}^{k-1} - w_* \parallel ^2 \nonumber \\&\qquad - \left( 2 \eta + \frac{2\eta \alpha L^2}{\mu } - 4 \eta ^2 L)(F(w^{k-1}_{m+1})-F(w_*)\right) \nonumber \\&\qquad + \left( 8 \eta ^2 L + \frac{16 \eta ^2 L^2}{\mu } + \frac{2 \eta L^2}{\mu }\right) (\tilde{G}_m - F(w_*)). \end{aligned}$$
(22)

By summing the previous inequalities over \(k=1,\ldots ,2K\), taking expectation with all the history, we obtain

$$\begin{aligned}&{\mathbb {E}} \parallel w_{m+2}^{1} - w_* \parallel ^2 \nonumber \\&\quad \le {\mathbb {E}} \parallel w_{m+1}^{1} - w_* \parallel ^2 - (\mu -1 )\eta \sum _{i=1}^{2K} {\mathbb {E}} \parallel w_{m+1}^{k} - w_* \parallel ^2\nonumber \\&\qquad - \left( 2 \eta + \frac{2\eta \alpha L^2}{\mu } - 4 \eta ^2 L\right) \sum _{i=1}^{2K} {\mathbb {E}}( F(w^{i}_{m+1})-F(w_*))\nonumber \\&\qquad + 2K\left( 8 \eta ^2 L + \frac{16 \eta ^2 L^2}{\mu } + \frac{2 \eta L^2}{\mu }\right) {\mathbb {E}} (\tilde{G}_m - F(w_*))\nonumber \\&\quad \le {\mathbb {E}} \parallel w_{m+1}^{1} - w_* \parallel ^2 - (\mu -1 )\eta \sum _{i=1}^{2K} {\mathbb {E}} \parallel w_{m+1}^{k} - w_* \parallel ^2\nonumber \\&\qquad - 2n\left( 2 \eta + \frac{2\eta \alpha L^2}{\mu } - 4 \eta ^2 L\right) {\mathbb {E}}(\tilde{G}_{m+1}-F(w_*))\nonumber \\&\qquad + (2n-2K) \left( 2 \eta + \frac{2\eta \alpha L^2}{\mu } - 4 \eta ^2 L\right) {\mathbb {E}} (\tilde{G}_{m}-F(w_*))\nonumber \\&\qquad + 2K(8 \eta ^2 L + \frac{16 \eta ^2 L^2}{\mu } + \frac{2 \eta L^2}{\mu }) {\mathbb {E}} (\tilde{G}_m - F(w_*))\nonumber \\&\quad \le (1-\mu \eta +\eta ){\mathbb {E}} \parallel w_{m+1}^{1} - w_* \parallel ^2\nonumber \\&\qquad - 2n\left( 2 \eta + \frac{2\eta \alpha L^2}{\mu } - 4 \eta ^2 L\right) {\mathbb {E}}(\tilde{G}_{m+1}-F(w_*))\nonumber \\&\qquad + (2n-2K) \left( 2 \eta + \frac{2\eta \alpha L^2}{\mu } - 4 \eta ^2 L\right) {\mathbb {E}} (\tilde{G}_{m}-F(w_*))\nonumber \\&\qquad + 2K\left( 8 \eta ^2 L + \frac{16 \eta ^2 L^2}{\mu } + \frac{2 \eta L^2}{\mu }\right) {\mathbb {E}} (\tilde{G}_m - F(w_*)). \end{aligned}$$
(23)

The first inequality is summing the previous inequality over 2K at \(m+1\) epoch. Let \(\tilde{G}^{-j}_{m+1}\) denote the \(\tilde{G}_{m+1}\) removing the selected subset j at \((m+1)\)-epoch, which means \(\tilde{G}^{-j}_{m+1} = \frac{1}{2n-2K} \sum _{l=1}^{2n-2K} F(\phi _m^l)\). The second inequality uses

$$\begin{aligned}&-\sum _{i=1}^{2K} {\mathbb {E}} (F(w^{i}_{m+1})-F(w_*)) \\&\quad = - {\mathbb {E}}\left[ \sum _{i=1}^{2K} F(w^{i}_{m+1}) + \tilde{G}^{-j} - 2nF(w_*) \right. \\&\qquad \left. - (\tilde{G}^{-j}-(2n-2K)F(w_*))\right] \\&\quad = -2n{\mathbb {E}}(\tilde{G}_{m+1}-F(w_*)) + (2n-2K){\mathbb {E}}(\tilde{G}_{m}-F(w_*)). \end{aligned}$$

The third inequality uses

$$\begin{aligned} -\sum _{i=1}^{2K} {\mathbb {E}} \parallel w_{m+1}^{k} - w_* \parallel ^2 \le -{\mathbb {E}} \parallel w_{m+1}^{1} - w_* \parallel ^2. \end{aligned}$$

Rewriting (23), we can get:

$$\begin{aligned}&{\mathbb {E}} \parallel w_{m+2}^{1} - w_* \parallel ^2 + 2n\left( 2 \eta + \frac{2\eta \alpha L^2}{\mu } - 4 \eta ^2 L\right) {\mathbb {E}}(\tilde{G}_{m+1}-F(w_*))\nonumber \\&\quad \le (1-\mu \eta +\eta ){\mathbb {E}} \parallel w_{m+1}^{1} - w_* \parallel ^2\nonumber \\&\qquad + (2n-2K)\left( 2 \eta + \frac{2\eta \alpha L^2}{\mu } - 4 \eta ^2 L\right) {\mathbb {E}}(\tilde{G}_{m}-F(w_*))\nonumber \\&\qquad + 2K\left( 8 \eta ^2 L + \frac{16 \eta ^2 L^2}{\mu } + \frac{2 \eta L^2}{\mu }\right) {\mathbb {E}} (\tilde{G}_m - F(w_*)). \end{aligned}$$
(24)

So we can get the bound:

$$\begin{aligned}&{\mathbb {E}} \parallel w_{m+2}^{1} - w_* \parallel ^2 + \beta {\mathbb {E}}(\tilde{G}_{m+1}-F(w_*)) \nonumber \\&\quad \le \delta [{\mathbb {E}} \parallel w_{m+1}^{1} - w_* \parallel ^2 + \beta {\mathbb {E}}(\tilde{G}_{m}-F(w_*)], \end{aligned}$$
(25)

When \(1-\mu \eta +\eta < 1\) and \(\beta > 0\) and \(\gamma > 0\) and \(\frac{\gamma }{\beta } < 1\), \(\delta\) can be limited into (0,1). So Theorem 1 needs to satisfy the following conditions:

$$\begin{aligned}&0<1-\mu \eta +\eta < 1, \\&\quad 2 \eta + \frac{2\eta \alpha L^2}{\mu } - 4 \eta ^2 L> 0,\\&\quad \beta - \gamma = 2\eta K\left( 2 + \frac{2 \alpha L^2}{\mu } - 12 \eta L - \frac{16 \eta L^2}{\mu } - \frac{2 L^2}{\mu }\right) >0, \end{aligned}$$

that is

$$\begin{aligned}&\mu > 1,\\&\quad \eta < \min \left( \frac{1}{\mu -1},\frac{\alpha L^2 + \mu }{2 \mu L},\frac{\mu + \alpha L^2 - L^2}{12\mu L + 16L^2}\right) . \end{aligned}$$

Then the results immediately follows. \(\square\)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tang, M., Qiao, L., Huang, Z. et al. Accelerating SGD using flexible variance reduction on large-scale datasets. Neural Comput & Applic 32, 8089–8100 (2020). https://doi.org/10.1007/s00521-019-04315-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-019-04315-5

Keywords

Navigation