Skip to main content
Log in

A mini-batch stochastic conjugate gradient algorithm with variance reduction

  • Published:
Journal of Global Optimization Aims and scope Submit manuscript

Abstract

Stochastic gradient descent method is popular for large scale optimization but has slow convergence asymptotically due to the inherent variance. To remedy this problem, there have been many explicit variance reduction methods for stochastic descent, such as SVRG Johnson and Zhang [Advances in neural information processing systems, (2013), pp. 315–323], SAG Roux et al. [Advances in neural information processing systems, (2012), pp. 2663–2671], SAGA Defazio et al. [Advances in neural information processing systems, (2014), pp. 1646–1654] and so on. Conjugate gradient method, which has the same computation cost with gradient descent method, is considered. In this paper, in the spirit of SAGA, we propose a stochastic conjugate gradient algorithm which we call SCGA. With the Fletcher and Reeves type choices, we prove a linear convergence rate for smooth and strongly convex functions. We experimentally demonstrate that SCGA converges faster than the popular SGD type algorithms for four machine learning models, which may be convex, nonconvex or nonsmooth. Solving regression problems, SCGA is competitive with CGVR, which is the only one stochastic conjugate gradient algorithm with variance reduction so far, as we know.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets

  2. http://osmot.cs.cornell.edu/kddcup

  3. http://archive.ics.uci.edu/ml/datasets/Average+Localization+Error+%28ALE%29+in+sensor+ node+localization+process+in+WSNs

References

  1. Krizhevsky, A., Sutskever, I., Hinton, G. E.: Imagenet classification with deep convolutional neural networks, In: Advances in neural information processing systems, pp. 1097–1105. (2012)

  2. Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on audio, speech, and language processing 20(1), 30–42 (2011)

    Article  Google Scholar 

  3. Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.-R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine 29(6), 82–97 (2012)

    Article  Google Scholar 

  4. Collobert, R., Weston, J.: A unified architecture for natural language processing: Deep neural networks with multitask learning,” in Proceedings of the 25th international conference on Machine learning, pp. 160–167. (2008)

  5. Dahl, G. E., Stokes, J. W., Deng, L., Yu, D.: Large-scale malware classification using random projections and neural networks, In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, pp. 3422–3426. (2013)

  6. Cauchy, A.: Méthode générale pour la résolution des systemes d’équations simultanées. Comp. Rend. Sci. Paris. 25(1847), 536–538 (1847)

    Google Scholar 

  7. Robbins, H., Monro, S.: A stochastic approximation method, The annals of mathematical statistics, pp. 400–407, (1951)

  8. Bottou, L.: Large-scale machine learning with stochastic gradient descent, Proc. COMPSTAT, pp. 177–186, (2010)

  9. Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep learning. MIT press Cambridge, 1, (2016)

  10. Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics 4(5), 1–17 (1964)

    Article  Google Scholar 

  11. Nesterov, Y.: A method of solving a convex programming problem with convergence rate o (1/k2), In Soviet Mathematics Doklady, (1983)

  12. Qian, N.: On the momentum term in gradient descent learning algorithms. Neural Netw 12(1), 145–151 (1999)

    Article  Google Scholar 

  13. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Machine Learning Research, 12(7), (2011)

  14. Zeiler, M. D.: Adadelta: an adaptive learning rate method, arXiv preprint arXiv:1212.5701, (2012)

  15. Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,. COURSERA: Neural networks for machine learning 4(2), 26–31 (2012)

    Google Scholar 

  16. Kingma, D., Ba, J.: Adam: A method for stochastic optimization, Computer ence, (2014)

  17. Hager, W.W., Zhang, H.: A survey of nonlinear conjugate gradient methods. Pacific Journal of Optimization, 2(1), 35–58 (2006)

  18. Roux, N. L., Schmidt, M., Bach, F. R.: A stochastic gradient method with an exponential convergence rate for finite training sets, in Advances in neural information processing systems, pp. 2663–2671, (2012)

  19. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction, In Advances in neural information processing systems, pp. 315–323. (2013)

  20. Defazio, A., Bach, F., Lacoste-Julien, S.: Saga: A fast incremental gradient method with support for non-strongly convex composite objectives, in Advances in neural information processing systems, pp. 1646–1654. (2014)

  21. Nguyen, L. M., Liu, J., Scheinberg, K., Taká, M.: Sarah: A novel method for machine learning problems using stochastic recursive gradient, (2017)

  22. Gilbert, J.C., Nocedal, J.: Global convergence properties of conjugate gradient methods for optimization. SIAM J. Optim. 2, 21–42 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  23. Nocedal, J., Wright, S.: Numerical optimization. Springer Science & Business Media, (2006)

  24. Fletcher, R., Reeves, C.M.: Function minimization by conjugate gradients. The computer journal 7(2), 149–154 (1964)

    Article  MathSciNet  MATH  Google Scholar 

  25. Polak, E., Ribiere, G.: Note sur la convergence de méthodes de directions conjuguées,” ESAIM: Mathematical Modelling and Numerical Analysis-Modélisation Mathématique et Analyse Numérique, 3(R1), pp. 35–43, (1969)

  26. Polyak, B.T.: The conjugate gradient method in extreme problem. USSR Comp. Math. Math. Phys. 9(4), 94–112 (1969)

    Article  Google Scholar 

  27. Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving. Journal of research of the National Bureau of Standards 49(6), 409 (1952)

  28. Dai, Y.H., Yuan, Y.: A nonlinear conjugate gradient method with a strong global convergence property. Siam Journal on Optimization 10(1), 177–182 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  29. Hager, W.W., Zhang, H.: A new conjugate gradient method with guaranteed descent and an efficient line search. SIAM Journal on Optimization 16(1), 170–192 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  30. Dai, Y.H., Kou, C.X.: A nonlinear conjugate gradient algorithm with an optimal property and an improved wolfe line search. Siam J Optim 23(1), 296–320 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  31. Dai, Y.H., Yuan, Y.: Nonlinear conjugate gradient methods. Shanghai Science and Technology Publisher, (2000)

  32. Møller, M.F.: A scaled conjugate gradient algorithm for fast supervised learning. Neural networks 6(4), 525–533 (1993)

    Article  Google Scholar 

  33. Le, Q. V., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., Ng, A. Y.: On optimization methods for deep learning, In ICML, (2011)

  34. Moritz, P., Nishihara, R., Jordan, M. I.: A linearly convergent stochastic l-bfgs algorithm, Mathematics, (2015)

  35. Jin, X.B., Zhang, X.Y., Huang, K., Geng, G.G.: Stochastic conjugate gradient algorithm with variance reduction. IEEE transactions on neural networks and learning systems 30(5), 1360–1369 (2018)

Download references

Acknowledgements

We would like to thank the anonymous referees for their helpful comments. We also would like to thank professor Dai, Y. H. for the valuable suggestions. This work was supported by the Chinese NSF grants (Nos. 11971073, 12171052 and 11871115).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Caixia Kou.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kou, C., Yang, H. A mini-batch stochastic conjugate gradient algorithm with variance reduction. J Glob Optim 87, 1009–1025 (2023). https://doi.org/10.1007/s10898-022-01205-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10898-022-01205-4

Keywords

Navigation