Skip to main content
Log in

Faster doubly stochastic functional gradient by gradient preconditioning for scalable kernel methods

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

The doubly stochastic functional gradient descent algorithm (DSG) that is memory friendly and computationally efficient can effectively scale up kernel methods. However, in solving the highly ill-conditioned large-scale nonlinear machine learning problem, the convergence speed of DSG is quite slow. This is because the condition number of the Hessian matrix of this problem is quite large, which will make stochastic gradient methods converge very slowly. Fortunately, gradient preconditioning is a well-established technique in optimization aiming to reduce the condition number. Therefore, we propose a preconditioned doubly stochastic functional gradient descent algorithm (P-DSG) by combining DSG with gradient preconditioning. P-DSG first uses the gradient preconditioning to adaptively scale the individual components of the estimated functional gradient obtained by DSG, and then utilizes the preconditioned functional gradient as the descent direction in each iteration. Theoretically, an appropriate preconditioner is always the inverse of the Hessian matrix at the optimum, which is not easy to get due to its high computation cost. Therefore, we first choose an empirical covariance matrix of random Fourier features to approximate the Hessian matrix, and then perform a low-rank approximation to the empirical covariance matrix. P-DSG has a fast convergence rate \(\mathcal {O}(1/t)\) and produces a smaller constant factor in the boundary than that of DSG while remains \(\mathcal {O}(t)\) memory friendly and \(\mathcal {O}(td)\) computationally efficient. Finally, we test the performance of P-DSG on the kernel ridge regression, kernel support vector machines, and kernel logistic regression, respectively. The experimental results show that P-DSG speeds up convergence and achieves better performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#cadata

  2. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#E2006-tfidf

  3. http://archive.ics.uci.edu/ml/datasets/BlogFeedback

  4. http://archive.ics.uci.edu/ml/datasets/Relative+location+of+CT+slices+on+axial+axis

  5. http://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction

  6. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#a9a

  7. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#ijcnn1

  8. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#real-sim

  9. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#rcv1.binary

  10. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#w8a

  11. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#phishing

  12. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#usps

  13. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#news20

  14. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#mnist

  15. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#connect-4

  16. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#SensIT%20Vehicle%20(combined)

  17. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#protein

  18. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#YearPredictionMSD

  19. http://archive.ics.uci.edu/ml/datasets/Buzz+in+social+media+

  20. http://archive.ics.uci.edu/ml/datasets/SGEMM+GPU+kernel+performance

  21. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#covtype.binary

  22. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#epsilon

  23. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#skin_nonskin

  24. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#covtype

  25. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#aloi

  26. http://archive.ics.uci.edu/ml/datasets/Crop+mapping+using+fused+optical-radar+data+set

References

  1. Altschuler J, Bach F, Rudi A, Niles-Weed J (2019) Massively scalable Sinkhorn distances via the nystrom̈ method. In: Advances in neural information processing systems, pp 4427–4437

  2. Avron H, Kapralov M, Musco C, Musco C, Velingker A, Zandieh A (2017) Random Fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. In: International conference on machine learning, pp 253–262

  3. Avron H, Sindhwani V, Yang J, Mahoney MW (2016) Quasi-monte Carlo feature maps for shift-invariant kernels. J Mach Learn Res 17(1):4096–4133

    MathSciNet  MATH  Google Scholar 

  4. Bengio Y (2012) Practical recommendations for gradient-based training of deep architectures. In: Neural networks: tricks of the trade, Springer, pp 437–478

  5. Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge

    Book  Google Scholar 

  6. Chapelle O (2007) Training a support vector machine in the primal. Neural Comput 19 (5):1155–1178

    Article  MathSciNet  Google Scholar 

  7. Chávez G, Liu Y, Ghysels P, Li XS, Rebrova E (2020) Scalable and memory-efficient kernel ridge regression. In: 2020 IEEE International parallel and distributed processing symposium (IPDPS), pp 956–965

  8. Chen X, Yang H, King I, Lyu MR (2015) Training-efficient feature map for shift-invariant kernels. In: Twenty-fourth international joint conference on artificial intelligence, pp 3395–3401

  9. Cutajar K, Osborne M, Cunningham J, Filippone M (2016) Preconditioning kernel matrices. In: International conference on machine learning, pp 2529–2538

  10. Dai B, Xie B, He N, Liang Y, Raj A, Balcan MFF, Song L (2014) Scalable kernel methods via doubly stochastic gradients. In: Advances in neural information processing systems, pp 3041–3049

  11. Duchi JC, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12(7):2121–2159

    MathSciNet  MATH  Google Scholar 

  12. Fine S, Scheinberg K (2001) Efficient SVM training using low-rank kernel representations. J Mach Learn Res 2(Dec):243–264

    MATH  Google Scholar 

  13. Gonen A, Orabona F, Shalevshwartz S (2016) Solving ridge regression using sketched preconditioned SVRG. In: International conference on machine learning, pp 1397–1405

  14. Gu B, Geng X, Li X, Shi W, Zheng G, Deng C, Huang H (2020) Scalable kernel ordinal regression via doubly stochastic gradients. IEEE Transactions on Neural Networks and Learning Systems, pp 1–13

  15. Haim Avron KLC, Woodruff DP (2017) Faster kernel ridge regression using sketching and preconditioning. SIAM J Matrix Anal Appl 38(4):1116–1138

    Article  MathSciNet  Google Scholar 

  16. Kar P, Karnick H (2012) Random feature maps for dot product kernels. In: Artificial intelligence and statistics, pp 583–591

  17. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: International conference on learning representations, pp 1–13

  18. Kivinen J, Smola AJ, Williamson RC (2004) Online learning with kernels. IEEE Trans Signal Process 52(8):2165–2176

    Article  MathSciNet  Google Scholar 

  19. Kolotilina LY, Axelsson O (1990) Preconditioned conjugate gradient methods. Springer

  20. Le Roux N, Manzagol PA, Bengio Y (2007) Topmoumoute online natural gradient algorithm. In: Advances in neural information processing systems, pp 849–856

  21. Lei D, Tang J, Li Z, Wu Y (2019) Using low-rank approximations to speed up kernel logistic regression algorithm. IEEE Access 7:84242–84252

    Article  Google Scholar 

  22. Li CL, Póczos B (2016) Utilize old coordinates: Faster doubly stochastic gradients for kernel methods. UAI, pp 467–476

  23. Li X, Gu B, Ao S, Wang H, Ling CX (2017) Triply stochastic gradients on multiple kernel learning. In: UAI

  24. Li Z, Ton JF, Oglic D, Sejdinovic D (2019) Towards a unified analysis of random Fourier features. In: International conference on machine learning, pp 3905–3914

  25. Lin J, Rosasco L (2018) Generalization properties of doubly stochastic learning algorithms. J Complex 47:42–61

    Article  MathSciNet  Google Scholar 

  26. Liu F, Huang X, Chen Y, Suykens JA (2020) Random features for kernel approximation:, A survey in algorithms, theory, and beyond. arXiv:2004.11154

  27. Maldonado S, López J (2017) Robust kernel-based multiclass support vector machines via second-order cone programming. Appl Intell 46(4):983–992

    Article  Google Scholar 

  28. Mason L, Baxter J, Bartlett PL, Frean M, et al. (1999) Functional gradient techniques for combining hypotheses. In: Advances in neural information processing systems, MIT, pp 221–246

  29. Munkhoeva M, Kapushev Y, Burnaev E, Oseledets I (2018) Quadrature-based features for kernel approximation. In: Advances in neural information processing systems, pp 9147–9156

  30. Musco C, Musco C (2015) Randomized block Krylov methods for stronger and faster approximate singular value decomposition. In: Advances in neural information processing systems, pp 1396–1404

  31. Rahimi A, Recht B (2008) Random features for large-scale kernel machines. In: Advances in neural information processing systems, pp 1177–1184

  32. Ratliff ND, Bagnell JA (2007) Kernel conjugate gradient for fast kernel machines. IJCAI 20:1017–1021

    Google Scholar 

  33. Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22(3):400–407

    Article  MathSciNet  Google Scholar 

  34. Scholkopf B, Smola AJ (2018) Learning with kernels: support vector machines, regularization, optimization, and beyond. Adaptive Computation and Machine Learning series

  35. Shabat G, Choshen E, Ben-Or D, Carmel N (2019) Fast and accurate Gaussian kernel ridge regression using matrix decompositions for preconditioning. arXiv:1905.10587

  36. Shalev-Shwartz S, Singer Y, Srebro N, Cotter A (2011) Pegasos: Primal estimated sub-gradient solver for SVM. Math Program 127(1):3–30

    Article  MathSciNet  Google Scholar 

  37. Shen Z, Qian H, Mu T, Zhang C (2017) Accelerated doubly stochastic gradient algorithm for large-scale empirical risk minimization. In: IJCAI, pp 2715–2721

  38. Tieleman T, Hinton G (2012) Rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw Mach Learn 4(2):26–31

    Google Scholar 

  39. Tu S, Roelofs R, Venkataraman S, Recht B (2016) Large scale kernel learning using block coordinate descent. arXiv:1602.05310

  40. Vinyals O, Povey D (2012) Krylov subspace descent for deep learning. In: Artificial intelligence and statistics, pp 1261–1268

  41. Wang D, Xu J (2019) Faster constrained linear regression via two-step preconditioning. Neurocomputing 364:280–296

    Article  Google Scholar 

  42. Wendland H (2004) Scattered data approximation. Cambridge University Press, Cambridge

    Book  Google Scholar 

  43. Williams CK, Seeger M (2001) Using the nystrom̈ method to speed up kernel machines. In: Advances in neural information processing systems, pp 682–688

  44. Yang J, Sindhwani V, Fan Q, Avron H, Mahoney MW (2014) Random Laplace feature maps for semigroup kernels on histograms. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 971–978

  45. Yang T, Jin R, Zhu S, Lin Q (2016) On data preconditioning for regularized loss minimization. Mach Learn 103(1):57–79

    Article  MathSciNet  Google Scholar 

  46. Yang T, Li YF, Mahdavi M, Jin R, Zhou ZH (2012) Nystrom̈ method vs random Fourier features: A theoretical and empirical comparison. In: Advances in neural information processing systems, pp 476–484

  47. Yedida R, Saha S, Prashanth T (2020) Lipschitzlr: Using theoretically computed adaptive learning rates for fast convergence. Appl Intell, pp 1–19

  48. Zhang J, May A, Dao T, Ré C (2019) Low-precision random Fourier features for memory-constrained kernel approximation. Proc Mach Learn Res 89:1264

    Google Scholar 

  49. Zhang Z, Zhou S, Li D, Yang T (2020) Gradient preconditioned mini-batch SGD for ridge regression. Neurocomputing 413:284–293

    Article  Google Scholar 

  50. Zhou S (2016) Sparse LSSVM in primal using Cholesky factorization for large-scale problems. IEEE Trans Neural Netw Learn Syst 27(4):783–795

    Article  MathSciNet  Google Scholar 

  51. Zhu J, Hastie T (2005) Kernel logistic regression and the import vector machine. J Comput Graph Stat 14(1):185–205

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China [Grants numbers 61772020].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shuisheng Zhou.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Z., Zhou, S., Yang, T. et al. Faster doubly stochastic functional gradient by gradient preconditioning for scalable kernel methods. Appl Intell 52, 7091–7112 (2022). https://doi.org/10.1007/s10489-021-02618-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02618-6

Keywords

Navigation