Robust large-scale online kernel learning

Chen, Lei; Zhang, Jiaming; Ning, Hanwen

doi:10.1007/s00521-022-07283-5

Robust large-scale online kernel learning

Original Article
Published: 19 May 2022

Volume 34, pages 15053–15073, (2022)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

304 Accesses
1 Citation
Explore all metrics

Abstract

The control-based approach has been proved to be effective for developing robust online learning methods. However, the existing control-based kernel methods are infeasible for large-scale modeling due to their high computational complexity. This paper aims to propose a computationally efficient control-based framework for robust large-scale kernel learning problems. By random feature approximation and robust loss function, the learning problems are first transformed into a group of linear feedback control problems with sparse discrete large-scale algebraic Riccati equations (DARE). Then, with the solutions of the DAREs, two promising algorithms are developed to address large-scale binary classification and regression problems, respectively. Thanks to the sparseness, explicit solutions rather than numerical solutions of the DAREs are derived by utilizing matrix computation techniques developed in our study. This substantially reduces the complexity, and makes the proposed algorithms computationally efficient for large-scale complex datasets. Compared with the existing benchmarks, the proposed algorithms can achieve faster convergent, more robust and accurate modeling results. Theoretical analysis and encouraging numerical results on synthetic and realistic datasets are also provided to illustrate the effectiveness and efficiency of our algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning Rates of Kernel-Based Robust Classification

Article 21 April 2022

Shuhua Wang & Baohuai Sheng

Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent

FFTRL: A Sparse Online Kernel Classification Algorithm for Large Scale Data

References

Hoi SC, Sahoo D, Lu J, Zhao P (2021) Online learning: a comprehensive survey. Neurocomputing 459:249–289
Article Google Scholar
Gert G, Poggio T (2001) Incremental and decremental support vector machine learning. In: Advances in neural information processing systems, pp. 409–415
Shwartz S, Singer SY, Serbro N, Cotter A (2011) Pegasos: primal estimated sub-gradient solver for SVM. Math Program 127(1):3–30
Article MathSciNet MATH Google Scholar
Jun Z, Shen F, Fan H, Zhao J (2013) An online incremental learning support vector machine for large-scale data. Neural Comput Appl 22(5):1023–1035
Article Google Scholar
Ming L, Zhang L, Jin R, Weng S, Zhang C (2016) Online kernel learning with nearly constant support vectors. Neurocomputing 179:26–36
Article Google Scholar
Liu W, Principe JC, Haykin SH (2010) Kernel adaptive filtering: a comprehensive introduction, vol 1. Wiley, Hoboken
Book Google Scholar
Genlin J (2004) Survey on genetic algorithm. Comput Appl Softw 2(1):69–73
Google Scholar
Clerc M (2010) Particle swarm optimization, vol 93. John Wiley & Sons, New York
MATH Google Scholar
Haug AJ (2012) Bayesian estimation and tracking: a practical guide. John Wiley & Sons, New York
Book MATH Google Scholar
Kingma DP, Ba J Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Ruder S An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747
Bucolo M, Cadarso VJ, Esteve J, Fortuna L, Llobera A, Sapuppo F, Schembri F (2008) A disposable micro-electro-optical interface for flow monitoring in bio-microfluidics, in: Proceedings of the 12th conference on miniaturized systems of chemistry and life science microTAS08, pp. 1579–1581
Sapuppo F, Llobera A, Schembri F, Intaglietta M, Cadarso VJ, Bucolo M (2010) A polymeric micro-optical interface for flow monitoring in biomicrofluidics. Biomicrofluidics 4(2):024108
Article Google Scholar
Sapuppo F, Schembri F, Fortuna L, Llobera A, Bucolo M (2012) A polymeric micro-optical system for the spatial monitoring in two-phase microfluidics. Microfluid Nanofluid 12(1):165–174
Article Google Scholar
Tang HS, Xue ST, Chen R, Sato T (2006) Online weighted LS-SVM for hysteretic structural system identification. Eng Struct 28(12):1728–1735
Article Google Scholar
Ning H, Jing X, Cheng L (2011) Online identification of nonlinear spatiotemporal systems using kernel learning approach. IEEE Trans Neural Netw 22(9):1381–1394
Article Google Scholar
Jin SS, Jung HJ (2018) Vibration-based damage detection using online learning algorithm for output-only structural health monitoring. Struct Health Monit 17(4):727–746
Article MathSciNet Google Scholar
Taouali O, Elaissi I, Messaoud H (2012) Online identification of nonlinear system using reduced kernel principal component analysis. Neural Comput Appl 21(1):161–169
Article Google Scholar
Bhadriraju B, Narasingam A, Kwon JSI (2019) Machine learning-based adaptive model identification of systems: application to a chemical process. Chem Eng Res Des 152:372–383
Article Google Scholar
Motai Y, Siddique NA, Yoshida H (2017) Heterogeneous data analysis: online learning for medical-image-based diagnosis. Pattern Recogn 63:612–624
Article Google Scholar
Nguyen-Tuong D, Peters J (2012) Online kernel-based learning for task-space tracking robot control. IEEE Trans Neural Netw Learn Syst 23(9):1417–1425
Article Google Scholar
Laxhammar R, Falkman G (2013) Online learning and sequential anomaly detection in trajectories. IEEE Trans Pattern Anal Mach Intell 36(6):1158–1173
Article MATH Google Scholar
Fan H, Song Q, Shrestha SB (2016) Kernel online learning with adaptive kernel width. Neurocomputing 175:233–242
Article Google Scholar
Chen B, Liang J, Zheng N, Principe JC (2016) Kernel least mean square with adaptive kernel size. Neurocomputing 191:95–106
Article Google Scholar
Sahoo D, Hoi SCH, Li B (2014) Online multiple kernel regression. In: Proc 20th ACM SIGKDD Int Conf Knowl Discovery Data Mining, pp. 293–302
Hoi SCH, Jin R, Zhao P, Yang T (2013) Online multiple kernel classification. Mach Learn 90(2):289–316
Article MathSciNet MATH Google Scholar
Fiat A, Woeginger GJ (1998) Online algorithms: the state of the art, vol 1442. Springer, Cham
MATH Google Scholar
Ma J, Saul LK, Savage S, Voelker GM (2009) Identifying suspicious urls: an application of large-scale online learning. In: Proceedings of the 26th annual international conference on machine learning, pp. 681–688
Li B, Hoi SC, Sahoo D, Liu Z (2015) Moving average reversion strategy for on-line portfolio selection. Artif Intell 222:104–123
Article MathSciNet Google Scholar
Kurt MN, Yilmaz Y, Wang X Real-time nonparametric anomaly detection in high-dimensional settings. In: IEEE transactions on pattern analysis and machine intelligence
Kivinen J, Smola AJ, Williamson RC (2004) Online learning with kernels. IEEE Trans Signal Process 52(8):2165–2176
Article MathSciNet MATH Google Scholar
Liu W, Pokharel PP, Principe JC (2008) The kernel least-mean-square algorithm. IEEE Trans Signal Process 56(2):543–554
Article MathSciNet MATH Google Scholar
Lu J, Sahoo D, Zhao P, Hoi SC (2018) Sparse passive-aggressive learning for bounded online kernel methods. ACM Trans Intell Syst Technol (TIST) 9(4):1–27
Article Google Scholar
Wang Z, Crammer K, Vucetic S (2012) Breaking the curse of Kernelization: budgeted stochastic gradient descent for large-scale SVM training. J Mach Learn Res 13(1):3103–3131
MathSciNet MATH Google Scholar
Engel Y, Mannor S, Meir R (2004) The kernel recursive least-squares algorithm. IEEE Trans Signal Process 52(8):2275–2285
Article MathSciNet MATH Google Scholar
Le T, Nguyen TD, Nguyen V, Phung D (2017) Approximation vector machines for large-scale online learning. J Mach Learn Res 18(1):3962–4016
MathSciNet MATH Google Scholar
Fan H, Song Q, Shrestha SB (2016) Kernel online learning with adaptive Kernel width. Neurocomputing 175:233–242
Article Google Scholar
Lu J, Hoi SC, Wang J, Zhao P, Liu Z-Y (2016) Large-scale online kernel learning. J Mach Learn Res 17(1):1613–1655
MathSciNet MATH Google Scholar
De Brabanter K, De Brabanter J, Suykens JA, De Moor B (2011) Kernel regression in the presence of correlated errors. J Mach Learn Res 12(6):1955–1976
MathSciNet MATH Google Scholar
Espinoza M, Suykens JA, De Moor B (2006) LS-SVM regression with autocorrelated errors. IFAC Proc Vol 39(1):582–587
Article Google Scholar
Jing X (2012) Robust adaptive learning of feedforward neural networks via LMI optimizations. Neural Netw 31:33–45
Article MATH Google Scholar
Bastani H, Bayati M (2020) Online decision making with high-dimensional covariates. Oper Res 68(1):276–294
Article MathSciNet MATH Google Scholar
Ning H, Zhang J, Feng T-T, Chu EK-W, Tian T (2020) Control-based algorithms for high dimensional online learning. J Franklin Inst 357(3):1909–1942
Article MathSciNet MATH Google Scholar
Zhang J, Ning H, Jing X, Tian T (2021) Online kernel learning with adaptive bandwidth by optimal control approach. IEEE Trans Neural Netw Learn Syst 32(5):1920–1934
Article MathSciNet Google Scholar
Ning H, Zhang J, Jing X, Tian T (2019) Robust online learning method based on dynamical linear quadratic regulator. IEEE Access 7:117780–117795
Article Google Scholar
Jing X, Cheng L (2012) An optimal PID control algorithm for training feedforward neural networks. IEEE Trans Ind Electron 60(6):2273–2283
Article Google Scholar
An W, Wang H, Sun Q, Xu J, Dai Q, Zhang L (2018) A PID controller approach for stochastic optimization of deep networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8522–8531
Jing X (2011) An $\text{ H}_{\infty }$ control approach to robust learning of feedforward neural networks. Neural Netw 24(7):759–766
Article MATH Google Scholar
Ning H, Qing G, Tian T, Jing X (2019) Online identification of nonlinear stochastic spatiotemporal system with multiplicative noise by robust optimal control-based kernel learning method. IEEE Tran Neural Netw Learn Syst 30(2):389–404
Article MathSciNet Google Scholar
Zhang J, Ning H (2020) Online kernel classification with adjustable bandwidth using control-based learning approach. Pattern Recogn 108:107566
Article Google Scholar
Ning H, Li Z (2018) An adaptive online machine learning method based on a robust optimal control approach. SCIENTIA SINICA Math 48(9):1181–1202
Article MATH Google Scholar
Li T, Chu EK-W, Kuo Y-C, Lin W-W (2013) Solving large-scale nonsymmetric algebraic Riccati equations by doubling. SIAM J Matrix Anal Appl 34(3):1129–1147
Article MathSciNet MATH Google Scholar
Li T, Chu EK-W, Lin W-W, Weng PC-Y (2013) Solving large-scale continuous-time algebraic Riccati equations by doubling. J Comput Appl Math 237(1):373–383
Article MathSciNet MATH Google Scholar
Hoi SC, Wang J, Zhao P, Zhuang J, Liu Z (2013) Large-scale online kernel classification. In: IJCAI
Nguyen TD, Le T, Bui H, Phung DQ (2017) Large-scale online Kernel learning with random feature reparameterization. In: IJCAI, pp. 2543–2549
Shen Y, Chen T, Giannakis GB (2019) Random feature-based online multi-kernel learning in environments with unknown dynamics. J Mach Learn Res 20(1):773–808
MathSciNet MATH Google Scholar
Vedaldi A, Zisserman A (2012) Efficient additive kernels via explicit feature maps. IEEE Trans Pattern Anal Mach Intell 34(3):480–492
Article Google Scholar
Rahimi A, Recht B (2008) Random features for large-scale kernel machines. In: Advances in neural information processing systems, pp. 1177–1184
Kwon WH, Han SH (2006) Receding horizon control: model predictive control for state models. Springer Science & Business Media, Cham
Google Scholar
Camacho EF, Alba CB (2013) Model predictive control. Springer Science & Business Media, Cham
Google Scholar
Freund Y, Schapire RE (1999) Large margin classification using the perceptron algorithm. Mach Learn 37(3):277–296
Article MATH Google Scholar
Cavallanti G, Cesa-Bianchi N, Gentile C (2007) Tracking the best hyperplane with a simple budget perceptron. Mach Learn 69(2):143–167
Article MATH Google Scholar
Dekel O, Shalev-Shwartz S, Singer Y The forgetron: A Kernel-based perceptron on a fixed budget
Orabona F, Keshet J, Caputo B (2009) Bounded Kernel-based online learning. J Mach Learn Res 10(11):2643–2666
MathSciNet MATH Google Scholar
Zhao P, Wang J, Wu P, Jin R, Hoi SC Fast bounded online gradient descent algorithms for scalable kernel-based online learning, arXiv preprint arXiv:1206.4633
Tüfekci P (2014) Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. Int J Electr Power Energy Syst 60:126–140
Article Google Scholar
Tso GK, Yau KK (2007) Predicting electricity energy consumption: a comparison of regression analysis, decision tree and neural networks. Energy 32(9):1761–1768
Article Google Scholar
Che J, Wang J, Wang G (2012) An adaptive fuzzy combination model based on self-organizing map and support vector regression for electric load forecasting. Energy 37(1):657–664
Article Google Scholar
Liu Y, Wang H, Jiang Y, Li P (2010) Selective recursive kernel learning for online identification of nonlinear systems with NARX form. J Process Control 20:181–194
Article Google Scholar
Philip R (2015) Essential statistics for the pharmaceutical sciences. John Wiley & Sons, New York
Google Scholar
Leopold S (2012) Introduction to mathematical statistics, vol 202. Springer Science & Business Media, Cham
Google Scholar
Zhang J, Li Z, Song X, Ning H (2021) Deep tobit networks: a novel machine learning approach to microeconometrics. Neural Netw 144:279–296
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by National Social Science Foundation of China under Project 19BTJ025, Fundamental Research Funds for the Central Universities under Project 2722022BY020, and Financial support from the Innovation and Talent Base for Digital Technology and Finance (B21038).

Author information

Authors and Affiliations

School of Business, Jianghan University, Zhuankou, Wuhan, 430056, Hubei, People’s Republic of China
Lei Chen
Department of Statistics, Zhongnan University of Econometrics and Law, Nanhu Campus, Wuhan, 430073, Hubei, People’s Republic of China
Jiaming Zhang & Hanwen Ning

Authors

Lei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jiaming Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hanwen Ning
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hanwen Ning.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

Proof

From standard optimal control theory [60], for any given n, V(E(n)) is quadratic, i.e., there exists a symmetric positive definite matrix $\mathcal {P}_n$ such that $V(E(n))=E(n)^T\mathcal {P}_nE(n)=E_n(1)^T\mathcal {P}_nE_n(1)$. A Hamilton-Jacobi equation is given as

$$\begin{aligned} V(E(n))= & \min _{U_n(1),\ldots ,U_n(N)}\sum \limits _{t=1}^{\infty }E_n(t)^TE_n(t)+\gamma U_n(t)^TU_n(t)\\= & \min \limits _{U_n(1)}(E_n(1)^TE_n(1)+\gamma U_n(1)^TU_n(1)+V(E_n(1)+\mathcal {B}_nU_n(1))). \end{aligned}$$

It follows that

$$\begin{aligned} V(E(n))= & \min \limits _{U_n(1)}(E_n(1)^TE_n(1)+\gamma U_n(1)^TU_n(1) \\&+(E_n(1)+\mathcal {B}_nU_n(1))^T\mathcal {P}_n(E_n(1)+\mathcal {B}_nU_n(1))). \end{aligned}$$

To minimize V(E(n)), we set the partial derivative with respect to $U_n(1)$ to zero:

$$\begin{aligned} 2U_n(1)^T\gamma I+2(E_n(1)+\mathcal {B}_nU_n(1))^T\mathcal {P}_n\mathcal {B}_n=0. \end{aligned}$$

This leads to

$$\begin{aligned} U_n^{\star }(1)=-(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)^{-1}\mathcal {B}_n^T\mathcal {P}_nE_n(1). \end{aligned}$$

V(E(n)) can be rewritten as

$$\begin{aligned} V(E(n))= & E_n(1)^T\mathcal {P}_nE_n(1)\\= & (E_n(1)+\mathcal {B}_nU_n^{\star }(1))^T\mathcal {P}_n(E_n(1)+\mathcal {B}_nU_n^{\star }(1))\\&+E_n(1)^TE_n(1)+\gamma U_n^{\star }(1)^TU_n^{\star }(1). \end{aligned}$$

With $U_n^\star (1)$, it follows

$$\begin{aligned}&E_n(1)^T\mathcal {P}_nE_n(1)\\= & E_n(1)^TE_n(1)+\gamma U_n^{\star }(1)^TU_n^{\star }(1)+(E_n(1)+\mathcal {B}_nU_n^{\star }(1))^T\mathcal {P}_n(E_n(1)+\mathcal {B}_nU_n^{\star }(1))\\= & E_n(1)^TE_n(1)+E_n(1)^T\mathcal {P}_nE_n(1)\\&+\gamma E_n(1)^T\mathcal {P}_n\mathcal {B}_n(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)^{-1}(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)^{-1}\mathcal {B}_n^T\mathcal {P}_nE_n(1)\\&+E_n(1)^T\mathcal {P}_n\mathcal {B}_n(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)^{-1}\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)^{-1}\mathcal {B}_n^T\mathcal {P}_nE_n(1)\\&-2E_n(1)^T\mathcal {P}_n\mathcal {B}_n(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)^{-1}\mathcal {B}_n^T\mathcal {P}_nE_n(1)\\= & E_n(1)^TE_n(1)+E_n(1)^T\mathcal {P}_nE_n(1)\\&+E_n(1)^T\mathcal {P}_n\mathcal {B}_n(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)^{-1}(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)^{-1}\mathcal {B}_n^T\mathcal {P}_nE_n(1)\\&-2E_n(1)^T\mathcal {P}_n\mathcal {B}_n(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)^{-1}\mathcal {B}_n^T\mathcal {P}_nE_n(1)\\= & E_n(1)^TE_n(1)+E_n(1)^T\mathcal {P}_nE_n(1)-E_n(1)^T\mathcal {P}_n\mathcal {B}_n(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)^{-1}\mathcal {B}_n^T\mathcal {P}_nE_n(1). \end{aligned}$$

Since this must hold for all $E_n(1)$, we have the following discrete-time algebraic Riccati equation:

$$\begin{aligned} \mathcal {P}_n=I+\mathcal {P}_n-\mathcal {P}_n\mathcal {B}_n(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)^{-1}\mathcal {B}_n^T\mathcal {P}_n. \end{aligned}$$

The solution $\mathcal {P}_n$ is symmetric positive definite and stabilizing, and the update law of the online learning is given by the optimal input $\Delta \theta (n)=U_n^{\star }(1)=\mathcal {F}_nE_n(1)=\mathcal {F}_nE(n)$, where $\mathcal {F}_n=-(\gamma I+\mathcal {B}_n^T\mathcal {P}_n\mathcal {B}_n)^{-1}\mathcal {B}_n^T\mathcal {P}_n$. The proof is completed. $\square$

Appendix B

Proof

In the following, for simplicity, $\mathcal {L}(f_n,x(n),y(n))$, $\mathcal {L}(f^{\star },x(n),y(n))$, $\mathcal {L}(\theta (n),x(n),y(n))$ and $\mathcal {L}(\theta ^{\star },x(n),y(n))$ are denoted as $\mathcal {L}_n(f_n)$, $\mathcal {L}_n(f^{\star })$, $\mathcal {L}_n(\theta (n))$ and $\mathcal {L}_n(\theta ^{\star })$, respectively. Apparently, the loss functions $\mathcal {L}_n$ are convex, the convexity of the loss function implies

$$\begin{aligned} \mathcal {L}_n(\theta (n))-\mathcal {L}_n(\theta ^{\star })\le \nabla \mathcal {L}_n(\theta (n))^T(\theta (n)-\theta ^{\star }). \end{aligned}$$

Meanwhile, for any fixed $\theta ^{\star }$, we have

$$\begin{aligned}&\Vert \theta (n+1)-\theta ^{\star }\Vert ^2\\&=\quad \Vert \theta (n)-\mathcal {B}_n^T\mathcal {G}_n^{-1}\mathcal {P}_n^{-1}E(n)-\theta ^{\star }\Vert ^2\\&=\quad \Vert \theta (n)-\theta ^{\star }\Vert ^2+\Vert \mathcal {B}_n^T\mathcal {G}_n^{-1}\mathcal {P}_n^{-1}E(n)\Vert ^2-2(\mathcal {B}_n^T\mathcal {G}_n^{-1}\mathcal {P}_n^{-1}E(n))^T(\theta (n)-\theta ^{\star })\\&=\quad \Vert \theta (n)-\theta ^{\star }\Vert ^2+(\mathcal {G}_n^{-1}\mathcal {P}_n^{-1}E(n))^T\mathcal {B}_n\mathcal {B}_n^T\mathcal {G}_n^{-1}\mathcal {P}_n^{-1}E(n)\\&\quad -2(\mathcal {G}_n^{-1}\mathcal {P}_n^{-1}E(n))^T\mathcal {B}_n(\theta (n)-\theta ^{\star }). \end{aligned}$$

In the settings of our proposed algorithms, $\mathcal {G}_n=\mathcal {B}_n\mathcal {B}_n^T$, $\mathcal {P}_n$ and E(n) are scalars. Noticing that

$$\begin{aligned} \mathcal {B}_n^T=\frac{\partial \mathcal {L}(\theta (n),x(n),y(n))}{\partial \theta (n)}=\nabla \mathcal {L}(\theta (n),x(n),y(n))=\nabla \mathcal {L}_n(\theta (n)), \end{aligned}$$

we have

$$\begin{gathered} \left\| {\theta (n + 1) - \theta ^{{ \star }} } \right\|^{2} \hfill \\ \quad = \left\| {\theta (n) - \theta ^{{ \star }} } \right\|^{2} + ({\mathcal{G}}_{n}^{{ - 1}} {\mathcal{P}}_{n}^{{ - 1}} E(n))^{T} {\mathcal{P}}_{n}^{{ - 1}} E(n) \hfill \\ \quad - 2{\mathcal{G}}_{n}^{{ - 1}} {\mathcal{P}}_{n}^{{ - 1}} E(n)\nabla {\mathcal{L}}_{n} (\theta (n))^{T} (\theta (n) - \theta ^{{ \star }} ) \hfill \\ \quad = \left\| {\theta (n) - \theta ^{{ \star }} } \right\|^{2} + {\mathcal{G}}_{n}^{{ - 1}} ({\mathcal{P}}_{n}^{{ - 1}} )^{2} E(n)^{2} \hfill \\ \quad - 2{\mathcal{G}}_{n}^{{ - 1}} {\mathcal{P}}_{n}^{{ - 1}} E(n)\nabla {\mathcal{L}}_{n} (\theta (n))^{T} (\theta (n) - \theta ^{{ \star }} ). \hfill \\ \end{gathered}$$

Then,

$$\begin{aligned}&\mathcal {G}_n^{-1}\mathcal {P}_n^{-1}E(n)\nabla \mathcal {L}_n(\theta (n))^T(\theta (n)-\theta ^{\star })\\= & \Vert \theta (n)-\theta ^{\star }\Vert ^2-\Vert \theta (n+1)-\theta ^{\star }\Vert ^2+\mathcal {G}_n^{-1}(\mathcal {P}_n^{-1})^2E(n)^2. \end{aligned}$$

For $\mathcal {L}_n(\theta (n))-\mathcal {L}_n(\theta ^{\star })\le \nabla \mathcal {L}_n(\theta (n))^T(\theta (n)-\theta ^{\star })$, it follows

$$\begin{gathered} {\mathcal{L}}_{n} (\theta (n)) - {\mathcal{L}}_{n} (\theta ^{{ \star }} ) \hfill \\ \quad \le \frac{1}{{2{\mathcal{G}}_{n}^{{ - 1}} {\mathcal{P}}_{n}^{{ - 1}} E(n)}}\left( {\left\| {\theta (n) - \theta ^{{ \star }} } \right\|^{2} - \left\| {\theta (n + 1) - \theta ^{{ \star }} {}} \right\|^{2} } \right) \hfill \\ \quad + \frac{1}{2}{\mathcal{P}}_{n}^{{ - 1}} E(n), \hfill \\ \end{gathered}$$

which yields

$$\begin{gathered} \sum\limits_{{n = 1}}^{N} {\left( {{\mathcal{L}}_{n} (\theta (n)) - {\mathcal{L}}_{n} (\theta ^{{ \star }} )} \right)} \hfill \\ \quad \le \sum\limits_{{n = 1}}^{N} {\left( {\frac{1}{{2{\mathcal{G}}_{n}^{{ - 1}} {\mathcal{P}}_{n}^{{ - 1}} E(n)}}\left( {\left\| {\theta (n) - \theta ^{{ \star }} } \right\|^{2} - \left\| {\theta (n + 1) - \theta ^{{ \star }} } \right\|^{2} } \right) \quad + \frac{1}{2}{\mathcal{P}}_{n}^{{ - 1}} E(n)} \right)} . \hfill \\ \end{gathered}$$

Since $\mathcal {P}_n=\frac{1}{2} \left( 1+\sqrt{1+4 \gamma \mathcal {G}_n^{-1}} \right)$, $\theta (n)$ and $\theta ^{\star }$ are assumed to lie in a compact set, it is trivial to verify that $\mathcal {G}_n$ and E(n) are positive and bounded, $\forall n$. Thus, there exit positive constants $c_1$ and $c_2$, such that $1/\mathcal {G}_n^{-1}\mathcal {P}_n^{-1}E(n)\le c_1\sqrt{N}$ and $\frac{1}{2}\mathcal {P}_n^{-1}E(n)\le c_2/\sqrt{N}$, for $\forall n$. On the other hand, since by the proposed learning law, $\theta (n)$ always converges to the optimal vector $\theta ^{\star }$, $\Vert \theta (n)-\theta ^{\star }\Vert ^2-\Vert \theta (n+1)-\theta ^{\star }\Vert ^2>0$. It follows

$$\begin{aligned}&\sum \limits _{n=1}^{N}\bigg {(}\mathcal {L}_n(\theta (n))-\mathcal {L}_n(\theta ^{\star })\bigg {)}\\\le & \sum \limits _{n=1}^{N}c_1\sqrt{N}(\Vert \theta (n)-\theta ^{\star }\Vert ^2-\Vert \theta (n+1)-\theta ^{\star }\Vert ^2)+\sum \limits _{n=1}^{N}\frac{1}{2}\mathcal {P}_n^{-1}E(n)\\= & c_1\sqrt{N}(\Vert \theta (1)-\theta ^{\star }\Vert ^2-\Vert \theta (n+1)-\theta ^{\star }\Vert ^2)+\sum \limits _{n=1}^{N}\frac{1}{2}\mathcal {P}_n^{-1}E(n)\\\le & c_1\sqrt{N}(\Vert \theta (1)-\theta ^{\star }\Vert ^2-\Vert \theta (n+1)-\theta ^{\star }\Vert ^2)+c_2\sqrt{N}\\\le & c_1\sqrt{N}\Vert \theta (1)-\theta ^{\star }\Vert ^2+c_2\sqrt{N}. \end{aligned}$$

Let $\theta (1)=0$, and we have

$$\begin{aligned} \sum \limits _{n=1}^{N}\mathcal {L}_n(\theta (n))-\sum \limits _{n=1}^{N}\mathcal {L}_n(\theta ^{\star })\le & (c_1||\theta ^{\star }||^2+c_2)\sqrt{N}. \end{aligned}$$

Based on the Claim 1 in [58], there exists a constant $\delta _0$ ($\delta _0$ is arbitrary small as D increases), such that for $\forall x_1,x_2$,

$$\begin{aligned} \mid \mathcal {Z}_{\varvec{u}}(x_1)^T\mathcal {Z}_{\varvec{u}}(x_2)-K_{\sigma }(x_1,x_2)\mid< & \delta _0. \end{aligned}$$

Following the method given in [54, 56], when D is sufficiently large, it is trivial to verify that there exists there exists a constant $\delta$ ($\delta$ is also arbitrary small as D increases) and a positive constant $c_3$, such that

$$\begin{gathered} \left| {\sum\limits_{{t = n}}^{N} {{\mathcal{L}}_{t} } (\theta ^{{ \star }} ) - \sum\limits_{{n = 1}}^{N} {{\mathcal{L}}_{n} } (f^{{ \star }} ){\mid } \le \sum\limits_{{n = 1}}^{N} {\mid } {\mathcal{L}}_{n} (\theta ^{{ \star }} ) - {\mathcal{L}}_{n} (f^{{ \star }} )} \right| \hfill \\ \quad \le c_{3} \delta N. \hfill \\ \end{gathered}$$

Let $\delta =1/\sqrt{N}$,

$$\begin{aligned} \sum \limits _{n=1}^{N}\mathcal {L}_n(\theta (n))-\sum \limits _{n=1}^{N}\mathcal {L}_n(f^{\star })\le & (c_1||\theta ^{\star }||^2+c_2+c_3)\sqrt{N}. \end{aligned}$$

The proof is completed. $\square$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, L., Zhang, J. & Ning, H. Robust large-scale online kernel learning. Neural Comput & Applic 34, 15053–15073 (2022). https://doi.org/10.1007/s00521-022-07283-5

Download citation

Received: 30 August 2021
Accepted: 07 April 2022
Published: 19 May 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s00521-022-07283-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust large-scale online kernel learning

Abstract

Access this article

Similar content being viewed by others

Learning Rates of Kernel-Based Robust Classification

Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent

FFTRL: A Sparse Online Kernel Classification Algorithm for Large Scale Data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A

Proof

Appendix B

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Robust large-scale online kernel learning

Abstract

Access this article

Similar content being viewed by others

Learning Rates of Kernel-Based Robust Classification

Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent

FFTRL: A Sparse Online Kernel Classification Algorithm for Large Scale Data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A

Proof

Appendix B

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation