Skip to main content
Log in

Robust large-scale online kernel learning

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

The control-based approach has been proved to be effective for developing robust online learning methods. However, the existing control-based kernel methods are infeasible for large-scale modeling due to their high computational complexity. This paper aims to propose a computationally efficient control-based framework for robust large-scale kernel learning problems. By random feature approximation and robust loss function, the learning problems are first transformed into a group of linear feedback control problems with sparse discrete large-scale algebraic Riccati equations (DARE). Then, with the solutions of the DAREs, two promising algorithms are developed to address large-scale binary classification and regression problems, respectively. Thanks to the sparseness, explicit solutions rather than numerical solutions of the DAREs are derived by utilizing matrix computation techniques developed in our study. This substantially reduces the complexity, and makes the proposed algorithms computationally efficient for large-scale complex datasets. Compared with the existing benchmarks, the proposed algorithms can achieve faster convergent, more robust and accurate modeling results. Theoretical analysis and encouraging numerical results on synthetic and realistic datasets are also provided to illustrate the effectiveness and efficiency of our algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Hoi SC, Sahoo D, Lu J, Zhao P (2021) Online learning: a comprehensive survey. Neurocomputing 459:249–289

    Article  Google Scholar 

  2. Gert G, Poggio T (2001) Incremental and decremental support vector machine learning. In: Advances in neural information processing systems, pp. 409–415

  3. Shwartz S, Singer SY, Serbro N, Cotter A (2011) Pegasos: primal estimated sub-gradient solver for SVM. Math Program 127(1):3–30

    Article  MathSciNet  MATH  Google Scholar 

  4. Jun Z, Shen F, Fan H, Zhao J (2013) An online incremental learning support vector machine for large-scale data. Neural Comput Appl 22(5):1023–1035

    Article  Google Scholar 

  5. Ming L, Zhang L, Jin R, Weng S, Zhang C (2016) Online kernel learning with nearly constant support vectors. Neurocomputing 179:26–36

    Article  Google Scholar 

  6. Liu W, Principe JC, Haykin SH (2010) Kernel adaptive filtering: a comprehensive introduction, vol 1. Wiley, Hoboken

    Book  Google Scholar 

  7. Genlin J (2004) Survey on genetic algorithm. Comput Appl Softw 2(1):69–73

    Google Scholar 

  8. Clerc M (2010) Particle swarm optimization, vol 93. John Wiley & Sons, New York

    MATH  Google Scholar 

  9. Haug AJ (2012) Bayesian estimation and tracking: a practical guide. John Wiley & Sons, New York

    Book  MATH  Google Scholar 

  10. Kingma DP, Ba J Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

  11. Ruder S An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747

  12. Bucolo M, Cadarso VJ, Esteve J, Fortuna L, Llobera A, Sapuppo F, Schembri F (2008) A disposable micro-electro-optical interface for flow monitoring in bio-microfluidics, in: Proceedings of the 12th conference on miniaturized systems of chemistry and life science microTAS08, pp. 1579–1581

  13. Sapuppo F, Llobera A, Schembri F, Intaglietta M, Cadarso VJ, Bucolo M (2010) A polymeric micro-optical interface for flow monitoring in biomicrofluidics. Biomicrofluidics 4(2):024108

    Article  Google Scholar 

  14. Sapuppo F, Schembri F, Fortuna L, Llobera A, Bucolo M (2012) A polymeric micro-optical system for the spatial monitoring in two-phase microfluidics. Microfluid Nanofluid 12(1):165–174

    Article  Google Scholar 

  15. Tang HS, Xue ST, Chen R, Sato T (2006) Online weighted LS-SVM for hysteretic structural system identification. Eng Struct 28(12):1728–1735

    Article  Google Scholar 

  16. Ning H, Jing X, Cheng L (2011) Online identification of nonlinear spatiotemporal systems using kernel learning approach. IEEE Trans Neural Netw 22(9):1381–1394

    Article  Google Scholar 

  17. Jin SS, Jung HJ (2018) Vibration-based damage detection using online learning algorithm for output-only structural health monitoring. Struct Health Monit 17(4):727–746

    Article  MathSciNet  Google Scholar 

  18. Taouali O, Elaissi I, Messaoud H (2012) Online identification of nonlinear system using reduced kernel principal component analysis. Neural Comput Appl 21(1):161–169

    Article  Google Scholar 

  19. Bhadriraju B, Narasingam A, Kwon JSI (2019) Machine learning-based adaptive model identification of systems: application to a chemical process. Chem Eng Res Des 152:372–383

    Article  Google Scholar 

  20. Motai Y, Siddique NA, Yoshida H (2017) Heterogeneous data analysis: online learning for medical-image-based diagnosis. Pattern Recogn 63:612–624

    Article  Google Scholar 

  21. Nguyen-Tuong D, Peters J (2012) Online kernel-based learning for task-space tracking robot control. IEEE Trans Neural Netw Learn Syst 23(9):1417–1425

    Article  Google Scholar 

  22. Laxhammar R, Falkman G (2013) Online learning and sequential anomaly detection in trajectories. IEEE Trans Pattern Anal Mach Intell 36(6):1158–1173

    Article  MATH  Google Scholar 

  23. Fan H, Song Q, Shrestha SB (2016) Kernel online learning with adaptive kernel width. Neurocomputing 175:233–242

    Article  Google Scholar 

  24. Chen B, Liang J, Zheng N, Principe JC (2016) Kernel least mean square with adaptive kernel size. Neurocomputing 191:95–106

    Article  Google Scholar 

  25. Sahoo D, Hoi SCH, Li B (2014) Online multiple kernel regression. In: Proc 20th ACM SIGKDD Int Conf Knowl Discovery Data Mining, pp. 293–302

  26. Hoi SCH, Jin R, Zhao P, Yang T (2013) Online multiple kernel classification. Mach Learn 90(2):289–316

    Article  MathSciNet  MATH  Google Scholar 

  27. Fiat A, Woeginger GJ (1998) Online algorithms: the state of the art, vol 1442. Springer, Cham

    MATH  Google Scholar 

  28. Ma J, Saul LK, Savage S, Voelker GM (2009) Identifying suspicious urls: an application of large-scale online learning. In: Proceedings of the 26th annual international conference on machine learning, pp. 681–688

  29. Li B, Hoi SC, Sahoo D, Liu Z (2015) Moving average reversion strategy for on-line portfolio selection. Artif Intell 222:104–123

    Article  MathSciNet  Google Scholar 

  30. Kurt MN, Yilmaz Y, Wang X Real-time nonparametric anomaly detection in high-dimensional settings. In: IEEE transactions on pattern analysis and machine intelligence

  31. Kivinen J, Smola AJ, Williamson RC (2004) Online learning with kernels. IEEE Trans Signal Process 52(8):2165–2176

    Article  MathSciNet  MATH  Google Scholar 

  32. Liu W, Pokharel PP, Principe JC (2008) The kernel least-mean-square algorithm. IEEE Trans Signal Process 56(2):543–554

    Article  MathSciNet  MATH  Google Scholar 

  33. Lu J, Sahoo D, Zhao P, Hoi SC (2018) Sparse passive-aggressive learning for bounded online kernel methods. ACM Trans Intell Syst Technol (TIST) 9(4):1–27

    Article  Google Scholar 

  34. Wang Z, Crammer K, Vucetic S (2012) Breaking the curse of Kernelization: budgeted stochastic gradient descent for large-scale SVM training. J Mach Learn Res 13(1):3103–3131

    MathSciNet  MATH  Google Scholar 

  35. Engel Y, Mannor S, Meir R (2004) The kernel recursive least-squares algorithm. IEEE Trans Signal Process 52(8):2275–2285

    Article  MathSciNet  MATH  Google Scholar 

  36. Le T, Nguyen TD, Nguyen V, Phung D (2017) Approximation vector machines for large-scale online learning. J Mach Learn Res 18(1):3962–4016

    MathSciNet  MATH  Google Scholar 

  37. Fan H, Song Q, Shrestha SB (2016) Kernel online learning with adaptive Kernel width. Neurocomputing 175:233–242

    Article  Google Scholar 

  38. Lu J, Hoi SC, Wang J, Zhao P, Liu Z-Y (2016) Large-scale online kernel learning. J Mach Learn Res 17(1):1613–1655

    MathSciNet  MATH  Google Scholar 

  39. De Brabanter K, De Brabanter J, Suykens JA, De Moor B (2011) Kernel regression in the presence of correlated errors. J Mach Learn Res 12(6):1955–1976

    MathSciNet  MATH  Google Scholar 

  40. Espinoza M, Suykens JA, De Moor B (2006) LS-SVM regression with autocorrelated errors. IFAC Proc Vol 39(1):582–587

    Article  Google Scholar 

  41. Jing X (2012) Robust adaptive learning of feedforward neural networks via LMI optimizations. Neural Netw 31:33–45

    Article  MATH  Google Scholar 

  42. Bastani H, Bayati M (2020) Online decision making with high-dimensional covariates. Oper Res 68(1):276–294

    Article  MathSciNet  MATH  Google Scholar 

  43. Ning H, Zhang J, Feng T-T, Chu EK-W, Tian T (2020) Control-based algorithms for high dimensional online learning. J Franklin Inst 357(3):1909–1942

    Article  MathSciNet  MATH  Google Scholar 

  44. Zhang J, Ning H, Jing X, Tian T (2021) Online kernel learning with adaptive bandwidth by optimal control approach. IEEE Trans Neural Netw Learn Syst 32(5):1920–1934

    Article  MathSciNet  Google Scholar 

  45. Ning H, Zhang J, Jing X, Tian T (2019) Robust online learning method based on dynamical linear quadratic regulator. IEEE Access 7:117780–117795

    Article  Google Scholar 

  46. Jing X, Cheng L (2012) An optimal PID control algorithm for training feedforward neural networks. IEEE Trans Ind Electron 60(6):2273–2283

    Article  Google Scholar 

  47. An W, Wang H, Sun Q, Xu J, Dai Q, Zhang L (2018) A PID controller approach for stochastic optimization of deep networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8522–8531

  48. Jing X (2011) An \(\text{ H}_{\infty }\) control approach to robust learning of feedforward neural networks. Neural Netw 24(7):759–766

    Article  MATH  Google Scholar 

  49. Ning H, Qing G, Tian T, Jing X (2019) Online identification of nonlinear stochastic spatiotemporal system with multiplicative noise by robust optimal control-based kernel learning method. IEEE Tran Neural Netw Learn Syst 30(2):389–404

    Article  MathSciNet  Google Scholar 

  50. Zhang J, Ning H (2020) Online kernel classification with adjustable bandwidth using control-based learning approach. Pattern Recogn 108:107566

    Article  Google Scholar 

  51. Ning H, Li Z (2018) An adaptive online machine learning method based on a robust optimal control approach. SCIENTIA SINICA Math 48(9):1181–1202

    Article  MATH  Google Scholar 

  52. Li T, Chu EK-W, Kuo Y-C, Lin W-W (2013) Solving large-scale nonsymmetric algebraic Riccati equations by doubling. SIAM J Matrix Anal Appl 34(3):1129–1147

    Article  MathSciNet  MATH  Google Scholar 

  53. Li T, Chu EK-W, Lin W-W, Weng PC-Y (2013) Solving large-scale continuous-time algebraic Riccati equations by doubling. J Comput Appl Math 237(1):373–383

    Article  MathSciNet  MATH  Google Scholar 

  54. Hoi SC, Wang J, Zhao P, Zhuang J, Liu Z (2013) Large-scale online kernel classification. In: IJCAI

  55. Nguyen TD, Le T, Bui H, Phung DQ (2017) Large-scale online Kernel learning with random feature reparameterization. In: IJCAI, pp. 2543–2549

  56. Shen Y, Chen T, Giannakis GB (2019) Random feature-based online multi-kernel learning in environments with unknown dynamics. J Mach Learn Res 20(1):773–808

    MathSciNet  MATH  Google Scholar 

  57. Vedaldi A, Zisserman A (2012) Efficient additive kernels via explicit feature maps. IEEE Trans Pattern Anal Mach Intell 34(3):480–492

    Article  Google Scholar 

  58. Rahimi A, Recht B (2008) Random features for large-scale kernel machines. In: Advances in neural information processing systems, pp. 1177–1184

  59. Kwon WH, Han SH (2006) Receding horizon control: model predictive control for state models. Springer Science & Business Media, Cham

    Google Scholar 

  60. Camacho EF, Alba CB (2013) Model predictive control. Springer Science & Business Media, Cham

    Google Scholar 

  61. Freund Y, Schapire RE (1999) Large margin classification using the perceptron algorithm. Mach Learn 37(3):277–296

    Article  MATH  Google Scholar 

  62. Cavallanti G, Cesa-Bianchi N, Gentile C (2007) Tracking the best hyperplane with a simple budget perceptron. Mach Learn 69(2):143–167

    Article  MATH  Google Scholar 

  63. Dekel O, Shalev-Shwartz S, Singer Y The forgetron: A Kernel-based perceptron on a fixed budget

  64. Orabona F, Keshet J, Caputo B (2009) Bounded Kernel-based online learning. J Mach Learn Res 10(11):2643–2666

    MathSciNet  MATH  Google Scholar 

  65. Zhao P, Wang J, Wu P, Jin R, Hoi SC Fast bounded online gradient descent algorithms for scalable kernel-based online learning, arXiv preprint arXiv:1206.4633

  66. Tüfekci P (2014) Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. Int J Electr Power Energy Syst 60:126–140

    Article  Google Scholar 

  67. Tso GK, Yau KK (2007) Predicting electricity energy consumption: a comparison of regression analysis, decision tree and neural networks. Energy 32(9):1761–1768

    Article  Google Scholar 

  68. Che J, Wang J, Wang G (2012) An adaptive fuzzy combination model based on self-organizing map and support vector regression for electric load forecasting. Energy 37(1):657–664

    Article  Google Scholar 

  69. Liu Y, Wang H, Jiang Y, Li P (2010) Selective recursive kernel learning for online identification of nonlinear systems with NARX form. J Process Control 20:181–194

    Article  Google Scholar 

  70. Philip R (2015) Essential statistics for the pharmaceutical sciences. John Wiley & Sons, New York

    Google Scholar 

  71. Leopold S (2012) Introduction to mathematical statistics, vol 202. Springer Science & Business Media, Cham

    Google Scholar 

  72. Zhang J, Li Z, Song X, Ning H (2021) Deep tobit networks: a novel machine learning approach to microeconometrics. Neural Netw 144:279–296

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by National Social Science Foundation of China under Project 19BTJ025, Fundamental Research Funds for the Central Universities under Project 2722022BY020, and Financial support from the Innovation and Talent Base for Digital Technology and Finance (B21038).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hanwen Ning.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

Proof

From standard optimal control theory [60], for any given n, V(E(n)) is quadratic, i.e., there exists a symmetric positive definite matrix \(\mathcal {P}_n\) such that \(V(E(n))=E(n)^T\mathcal {P}_nE(n)=E_n(1)^T\mathcal {P}_nE_n(1)\). A Hamilton-Jacobi equation is given as

$$\begin{aligned} V(E(n))= & \min _{U_n(1),\ldots ,U_n(N)}\sum \limits _{t=1}^{\infty }E_n(t)^TE_n(t)+\gamma U_n(t)^TU_n(t)\\= & \min \limits _{U_n(1)}(E_n(1)^TE_n(1)+\gamma U_n(1)^TU_n(1)+V(E_n(1)+\mathcal {B}_nU_n(1))). \end{aligned}$$

It follows that

$$\begin{aligned} V(E(n))= & \min \limits _{U_n(1)}(E_n(1)^TE_n(1)+\gamma U_n(1)^TU_n(1) \\&+(E_n(1)+\mathcal {B}_nU_n(1))^T\mathcal {P}_n(E_n(1)+\mathcal {B}_nU_n(1))). \end{aligned}$$

To minimize V(E(n)), we set the partial derivative with respect to \(U_n(1)\) to zero:

$$\begin{aligned} 2U_n(1)^T\gamma I+2(E_n(1)+\mathcal {B}_nU_n(1))^T\mathcal {P}_n\mathcal {B}_n=0. \end{aligned}$$

This leads to

$$\begin{aligned} U_n^{\star }(1)=-(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)^{-1}\mathcal {B}_n^T\mathcal {P}_nE_n(1). \end{aligned}$$

V(E(n)) can be rewritten as

$$\begin{aligned} V(E(n))= & E_n(1)^T\mathcal {P}_nE_n(1)\\= & (E_n(1)+\mathcal {B}_nU_n^{\star }(1))^T\mathcal {P}_n(E_n(1)+\mathcal {B}_nU_n^{\star }(1))\\&+E_n(1)^TE_n(1)+\gamma U_n^{\star }(1)^TU_n^{\star }(1). \end{aligned}$$

With \(U_n^\star (1)\), it follows

$$\begin{aligned}&E_n(1)^T\mathcal {P}_nE_n(1)\\= & E_n(1)^TE_n(1)+\gamma U_n^{\star }(1)^TU_n^{\star }(1)+(E_n(1)+\mathcal {B}_nU_n^{\star }(1))^T\mathcal {P}_n(E_n(1)+\mathcal {B}_nU_n^{\star }(1))\\= & E_n(1)^TE_n(1)+E_n(1)^T\mathcal {P}_nE_n(1)\\&+\gamma E_n(1)^T\mathcal {P}_n\mathcal {B}_n(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)^{-1}(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)^{-1}\mathcal {B}_n^T\mathcal {P}_nE_n(1)\\&+E_n(1)^T\mathcal {P}_n\mathcal {B}_n(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)^{-1}\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)^{-1}\mathcal {B}_n^T\mathcal {P}_nE_n(1)\\&-2E_n(1)^T\mathcal {P}_n\mathcal {B}_n(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)^{-1}\mathcal {B}_n^T\mathcal {P}_nE_n(1)\\= & E_n(1)^TE_n(1)+E_n(1)^T\mathcal {P}_nE_n(1)\\&+E_n(1)^T\mathcal {P}_n\mathcal {B}_n(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)^{-1}(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)^{-1}\mathcal {B}_n^T\mathcal {P}_nE_n(1)\\&-2E_n(1)^T\mathcal {P}_n\mathcal {B}_n(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)^{-1}\mathcal {B}_n^T\mathcal {P}_nE_n(1)\\= & E_n(1)^TE_n(1)+E_n(1)^T\mathcal {P}_nE_n(1)-E_n(1)^T\mathcal {P}_n\mathcal {B}_n(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)^{-1}\mathcal {B}_n^T\mathcal {P}_nE_n(1). \end{aligned}$$

Since this must hold for all \(E_n(1)\), we have the following discrete-time algebraic Riccati equation:

$$\begin{aligned} \mathcal {P}_n=I+\mathcal {P}_n-\mathcal {P}_n\mathcal {B}_n(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)^{-1}\mathcal {B}_n^T\mathcal {P}_n. \end{aligned}$$

The solution \(\mathcal {P}_n\) is symmetric positive definite and stabilizing, and the update law of the online learning is given by the optimal input \(\Delta \theta (n)=U_n^{\star }(1)=\mathcal {F}_nE_n(1)=\mathcal {F}_nE(n)\), where \(\mathcal {F}_n=-(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)^{-1}\mathcal {B}_n^T\mathcal {P}_n\). The proof is completed. \(\square\)

Appendix B

Proof

In the following, for simplicity, \(\mathcal {L}(f_n,x(n),y(n))\), \(\mathcal {L}(f^{\star },x(n),y(n))\), \(\mathcal {L}(\theta (n),x(n),y(n))\) and \(\mathcal {L}(\theta ^{\star },x(n),y(n))\) are denoted as \(\mathcal {L}_n(f_n)\), \(\mathcal {L}_n(f^{\star })\), \(\mathcal {L}_n(\theta (n))\) and \(\mathcal {L}_n(\theta ^{\star })\), respectively. Apparently, the loss functions \(\mathcal {L}_n\) are convex, the convexity of the loss function implies

$$\begin{aligned} \mathcal {L}_n(\theta (n))-\mathcal {L}_n(\theta ^{\star })\le \nabla \mathcal {L}_n(\theta (n))^T(\theta (n)-\theta ^{\star }). \end{aligned}$$

Meanwhile, for any fixed \(\theta ^{\star }\), we have

$$\begin{aligned}&\Vert \theta (n+1)-\theta ^{\star }\Vert ^2\\&=\quad \Vert \theta (n)-\mathcal {B}_n^T\mathcal {G}_n^{-1}\mathcal {P}_n^{-1}E(n)-\theta ^{\star }\Vert ^2\\&=\quad \Vert \theta (n)-\theta ^{\star }\Vert ^2+\Vert \mathcal {B}_n^T\mathcal {G}_n^{-1}\mathcal {P}_n^{-1}E(n)\Vert ^2-2(\mathcal {B}_n^T\mathcal {G}_n^{-1}\mathcal {P}_n^{-1}E(n))^T(\theta (n)-\theta ^{\star })\\&=\quad \Vert \theta (n)-\theta ^{\star }\Vert ^2+(\mathcal {G}_n^{-1}\mathcal {P}_n^{-1}E(n))^T\mathcal {B}_n\mathcal {B}_n^T\mathcal {G}_n^{-1}\mathcal {P}_n^{-1}E(n)\\&\quad -2(\mathcal {G}_n^{-1}\mathcal {P}_n^{-1}E(n))^T\mathcal {B}_n(\theta (n)-\theta ^{\star }). \end{aligned}$$

In the settings of our proposed algorithms, \(\mathcal {G}_n=\mathcal {B}_n\mathcal {B}_n^T\), \(\mathcal {P}_n\) and E(n) are scalars. Noticing that

$$\begin{aligned} \mathcal {B}_n^T=\frac{\partial \mathcal {L}(\theta (n),x(n),y(n))}{\partial \theta (n)}=\nabla \mathcal {L}(\theta (n),x(n),y(n))=\nabla \mathcal {L}_n(\theta (n)), \end{aligned}$$

we have

$$\begin{gathered} \left\| {\theta (n + 1) - \theta ^{{ \star }} } \right\|^{2} \hfill \\ \quad = \left\| {\theta (n) - \theta ^{{ \star }} } \right\|^{2} + ({\mathcal{G}}_{n}^{{ - 1}} {\mathcal{P}}_{n}^{{ - 1}} E(n))^{T} {\mathcal{P}}_{n}^{{ - 1}} E(n) \hfill \\ \quad - 2{\mathcal{G}}_{n}^{{ - 1}} {\mathcal{P}}_{n}^{{ - 1}} E(n)\nabla {\mathcal{L}}_{n} (\theta (n))^{T} (\theta (n) - \theta ^{{ \star }} ) \hfill \\ \quad = \left\| {\theta (n) - \theta ^{{ \star }} } \right\|^{2} + {\mathcal{G}}_{n}^{{ - 1}} ({\mathcal{P}}_{n}^{{ - 1}} )^{2} E(n)^{2} \hfill \\ \quad - 2{\mathcal{G}}_{n}^{{ - 1}} {\mathcal{P}}_{n}^{{ - 1}} E(n)\nabla {\mathcal{L}}_{n} (\theta (n))^{T} (\theta (n) - \theta ^{{ \star }} ). \hfill \\ \end{gathered}$$

Then,

$$\begin{aligned}&\mathcal {G}_n^{-1}\mathcal {P}_n^{-1}E(n)\nabla \mathcal {L}_n(\theta (n))^T(\theta (n)-\theta ^{\star })\\= & \Vert \theta (n)-\theta ^{\star }\Vert ^2-\Vert \theta (n+1)-\theta ^{\star }\Vert ^2+\mathcal {G}_n^{-1}(\mathcal {P}_n^{-1})^2E(n)^2. \end{aligned}$$

For \(\mathcal {L}_n(\theta (n))-\mathcal {L}_n(\theta ^{\star })\le \nabla \mathcal {L}_n(\theta (n))^T(\theta (n)-\theta ^{\star })\), it follows

$$\begin{gathered} {\mathcal{L}}_{n} (\theta (n)) - {\mathcal{L}}_{n} (\theta ^{{ \star }} ) \hfill \\ \quad \le \frac{1}{{2{\mathcal{G}}_{n}^{{ - 1}} {\mathcal{P}}_{n}^{{ - 1}} E(n)}}\left( {\left\| {\theta (n) - \theta ^{{ \star }} } \right\|^{2} - \left\| {\theta (n + 1) - \theta ^{{ \star }} {}} \right\|^{2} } \right) \hfill \\ \quad + \frac{1}{2}{\mathcal{P}}_{n}^{{ - 1}} E(n), \hfill \\ \end{gathered}$$

which yields

$$\begin{gathered} \sum\limits_{{n = 1}}^{N} {\left( {{\mathcal{L}}_{n} (\theta (n)) - {\mathcal{L}}_{n} (\theta ^{{ \star }} )} \right)} \hfill \\ \quad \le \sum\limits_{{n = 1}}^{N} {\left( {\frac{1}{{2{\mathcal{G}}_{n}^{{ - 1}} {\mathcal{P}}_{n}^{{ - 1}} E(n)}}\left( {\left\| {\theta (n) - \theta ^{{ \star }} } \right\|^{2} - \left\| {\theta (n + 1) - \theta ^{{ \star }} } \right\|^{2} } \right) \quad + \frac{1}{2}{\mathcal{P}}_{n}^{{ - 1}} E(n)} \right)} . \hfill \\ \end{gathered}$$

Since \(\mathcal {P}_n=\frac{1}{2} \left( 1+\sqrt{1+4 \gamma \mathcal {G}_n^{-1}} \right)\), \(\theta (n)\) and \(\theta ^{\star }\) are assumed to lie in a compact set, it is trivial to verify that \(\mathcal {G}_n\) and E(n) are positive and bounded, \(\forall n\). Thus, there exit positive constants \(c_1\) and \(c_2\), such that \(1/\mathcal {G}_n^{-1}\mathcal {P}_n^{-1}E(n)\le c_1\sqrt{N}\) and \(\frac{1}{2}\mathcal {P}_n^{-1}E(n)\le c_2/\sqrt{N}\), for \(\forall n\). On the other hand, since by the proposed learning law, \(\theta (n)\) always converges to the optimal vector \(\theta ^{\star }\), \(\Vert \theta (n)-\theta ^{\star }\Vert ^2-\Vert \theta (n+1)-\theta ^{\star }\Vert ^2>0\). It follows

$$\begin{aligned}&\sum \limits _{n=1}^{N}\bigg {(}\mathcal {L}_n(\theta (n))-\mathcal {L}_n(\theta ^{\star })\bigg {)}\\\le & \sum \limits _{n=1}^{N}c_1\sqrt{N}(\Vert \theta (n)-\theta ^{\star }\Vert ^2-\Vert \theta (n+1)-\theta ^{\star }\Vert ^2)+\sum \limits _{n=1}^{N}\frac{1}{2}\mathcal {P}_n^{-1}E(n)\\= & c_1\sqrt{N}(\Vert \theta (1)-\theta ^{\star }\Vert ^2-\Vert \theta (n+1)-\theta ^{\star }\Vert ^2)+\sum \limits _{n=1}^{N}\frac{1}{2}\mathcal {P}_n^{-1}E(n)\\\le & c_1\sqrt{N}(\Vert \theta (1)-\theta ^{\star }\Vert ^2-\Vert \theta (n+1)-\theta ^{\star }\Vert ^2)+c_2\sqrt{N}\\\le & c_1\sqrt{N}\Vert \theta (1)-\theta ^{\star }\Vert ^2+c_2\sqrt{N}. \end{aligned}$$

Let \(\theta (1)=0\), and we have

$$\begin{aligned} \sum \limits _{n=1}^{N}\mathcal {L}_n(\theta (n))-\sum \limits _{n=1}^{N}\mathcal {L}_n(\theta ^{\star })\le & (c_1||\theta ^{\star }||^2+c_2)\sqrt{N}. \end{aligned}$$

Based on the Claim 1 in [58], there exists a constant \(\delta _0\) (\(\delta _0\) is arbitrary small as D increases), such that for \(\forall x_1,x_2\),

$$\begin{aligned} \mid \mathcal {Z}_{\varvec{u}}(x_1)^T\mathcal {Z}_{\varvec{u}}(x_2)-K_{\sigma }(x_1,x_2)\mid< & \delta _0. \end{aligned}$$

Following the method given in [54, 56], when D is sufficiently large, it is trivial to verify that there exists there exists a constant \(\delta\) (\(\delta\) is also arbitrary small as D increases) and a positive constant \(c_3\), such that

$$\begin{gathered} \left| {\sum\limits_{{t = n}}^{N} {{\mathcal{L}}_{t} } (\theta ^{{ \star }} ) - \sum\limits_{{n = 1}}^{N} {{\mathcal{L}}_{n} } (f^{{ \star }} ){\mid } \le \sum\limits_{{n = 1}}^{N} {\mid } {\mathcal{L}}_{n} (\theta ^{{ \star }} ) - {\mathcal{L}}_{n} (f^{{ \star }} )} \right| \hfill \\ \quad \le c_{3} \delta N. \hfill \\ \end{gathered}$$

Let \(\delta =1/\sqrt{N}\),

$$\begin{aligned} \sum \limits _{n=1}^{N}\mathcal {L}_n(\theta (n))-\sum \limits _{n=1}^{N}\mathcal {L}_n(f^{\star })\le & (c_1||\theta ^{\star }||^2+c_2+c_3)\sqrt{N}. \end{aligned}$$

The proof is completed. \(\square\)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, L., Zhang, J. & Ning, H. Robust large-scale online kernel learning. Neural Comput & Applic 34, 15053–15073 (2022). https://doi.org/10.1007/s00521-022-07283-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-07283-5

Keywords

Navigation