Skip to main content
Log in

A Sparse Online Approach for Streaming Data Classification via Prototype-Based Kernel Models

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Processing big data streams through machine learning algorithms has various challenges, such as little time to train the models, hardware memory constraints, and concept drift. In this paper, we show that prototype-based kernel classifiers designed by sparsification procedures, such as the approximate linear dependence (ALD) method, provides an adequate tradeoff between accuracy and size complexity of kernelized nearest neighbor classifiers. The proposed approach automatically selects relevant samples from the training data stream to form a sparse dictionary of prototypes, which are then used in kernelized distance metrics to classify arriving samples on the fly. Additionally, the proposed method is fully adaptive, in the sense that it updates and removes prototypes from the dictionary, enabling it to learn continuously in nonstationary environments. The results obtained from a comprehensive set of computer simulations involving artificial and real streaming data sets indicate that the proposed algorithm can build models with low complexity and competitive classification error rates compared to state of the art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. https://github.com/davidcoelho89/spok-nn

  2. Sometimes normalized by the quadratic norm of the difference vector, i.e. \(\Vert \mathbf {w}_{i^*}(t)-\mathbf {x}(t)\Vert ^2\).

  3. A binary vector of length \(m_{t-1}\) with the k-th element set to to 1. All the other elements are set to zero.

  4. https://github.com/vlosing/SAMkNN

References

  1. Albuquerque RF, Oliveira PDL, Braga APS (2018) Adaptive fuzzy learning vector puantization (AFLVQ) for time series classification. In: Barreto GA, Coelho R (eds) North American fuzzy information Processing society annual conference (NAFIPS’2018), vol CCIS 831, pp 385–397

  2. Aliyu A, Abdullah AH, Kaiwartya O, Cao Y, Lloret J, Aslam N, Joda UM (2018) Towards video streaming in IoT environments: vehicular communication perspective. Comput Commun 118:93–119

    Article  Google Scholar 

  3. Augenstein C, Spangenberg N, Franczyk B (2017) Applying machine learning to big data streams: an overview of challenges. In: 2017 IEEE 4th international conference on soft computing & machine intelligence (ISCMI), pp 25–29. IEEE

  4. Biehl M, Hammer B, Villmann T (2016) Prototype-based models in machine learning. WIREs Cogn Sci 7(2):92–111

    Article  Google Scholar 

  5. Bifet A, Gavalda R (2007) Learning from time-changing data with adaptive windowing. In: Proceedings of the 2007 SIAM international conference on data mining, pp 443–448. SIAM

  6. Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) Moa: massive online analysis. J Mach Learn Res 11(May):1601–1604

    Google Scholar 

  7. Bifet A, Holmes G, Pfahringer B (2010) Leveraging bagging for evolving data streams. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 135–150

  8. Bifet A, Pfahringer B, Read J, Holmes G (2013) Efficient data stream classification via probabilistic adaptive windows. In: Proceedings of the 28th annual ACM symposium on applied computing, pp 801–806

  9. Brna AP, Brown RC, Connolly PM, Simons SB, Shimizu RE, Aguilar-Simon M (2019) Uncertainty-based modulation for lifelong learning. Neural Netw 120:129–142

    Article  Google Scholar 

  10. Carpenter GA, Grossberg S, Rosen DB (1991) Fuzzy ART: fast stable learning, categorization of analog patterns by an adaptive resonance system. Neural Netw 4(6):759–771

    Article  Google Scholar 

  11. Chua SL, Marsland S, Guesgen HW (2011) Unsupervised learning of patterns in data streams using compression and edit distance. In: Twenty-second international joint conference on artificial intelligence

  12. Coelho DN, Barreto GA (2019) Approximate linear dependence as a design method for kernel prototype-based classifiers. In: A.C.M.G.J. Vellido A, Gibert K (ed) Advances in self-organizing maps, learning vector quantization, clustering and data visualization (WSOM’2019), vol 976. Springer, pp 241–250

  13. Elwell R, Polikar R (2011) Incremental learning of concept drift in nonstationary environments. IEEE Trans Neural Netw 22(10):1517–1531

    Article  Google Scholar 

  14. Engel Y, Mannor S, Meir R (2004) The kernel recursive least squares algorithm. IEEE Trans Signal Process 52(8):2275–2285

    Article  MathSciNet  Google Scholar 

  15. Gama J, Žliobaitė I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv 46(4):1–37

    Article  Google Scholar 

  16. Gomes HM, Barddal JP, Enembreck F, Bifet A (2017) A survey on ensemble learning for data stream classification. ACM Comput Surv 50(2):1–36

    Article  Google Scholar 

  17. Gomes HM, Bifet A, Read J, Barddal JP, Enembreck F, Pfharinger B, Holmes G, Abdessalem T (2017) Adaptive random forests for evolving data stream classification. Mach Learn 106(9–10):1469–1495

    Article  MathSciNet  Google Scholar 

  18. Grossberg S (1987) Competitive learning: from interactive activation to adaptive resonance. Cogn Sci 11:23–63

    Article  Google Scholar 

  19. Haasdonk B, Pekalska E (2009) Classification with kernel mahalanobis distance classifiers. In: Advances in data analysis, data handling and business intelligence. Springer, pp 351–361

  20. Hammer B, Hofmann D, Schleif FM, Zhu X (2014) Learning vector quantization for (dis-)similarities. Neurocomputing 131:43–51

    Article  Google Scholar 

  21. Harries M (1999) Splice-2 comparative evaluation: electricity pricing

  22. Haykin S, Li L (1995) Nonlinear adaptive prediction of nonstationary signals. IEEE Trans Signal Process 43(2):526–535

    Article  Google Scholar 

  23. Heusinger M, Raab C, Schleif FM (2019) Passive concept drift handling via momentum based robust soft learning vector quantization. In: A.C.M.G.J. Vellido A, Gibert K (ed) Advances in self-organizing maps, learning vector quantization, clustering and data visualization (WSOM’2019), vol 976. Springer, pp 200–209

  24. Hofmann D, Schleif FM, Paaßen B, Hammer B (2014) Learning interpretable kernelized prototype-based models. Neurocomputing 141:84–96

    Article  Google Scholar 

  25. Iwashita AS, Papa JP (2018) An overview on concept drift learning. IEEE Access 7:1532–1547

    Article  Google Scholar 

  26. Jaber G, Cornuéjols A, Tarroux P (2013) Online learning: searching for the best forgetting strategy under concept drift. In: International conference on neural information processing. Springer, pp 400–408

  27. Jäkel F, Schölkopf B, Wichmann FA (2007) A tutorial on kernel methods for categorization. J Math Psychol 51(6):343–358

    Article  MathSciNet  Google Scholar 

  28. Juárez-Ruiz E, Cortés-Maldonado R, Pérez-Rodríguez F (2016) Relationship between the inverses of a matrix and a submatrix. Comput Sist 20(2):251–262

    Google Scholar 

  29. Kohonen T (1990) Improved versions of learning vector quantization. In: Proceedings of the 1990 international joint conference on neural networks (IJCNN’90), pp 545–550. IEEE

  30. Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480

    Article  Google Scholar 

  31. Kohonen T (2013) Essentials of the self-organizing map. Neural Netw 37:52–65

    Article  Google Scholar 

  32. Lau KW, Yin H, Hubbard S (2006) Kernel self-organising maps for classification. Neurocomputing 69(16):2033–2040

    Article  Google Scholar 

  33. Li X, Yu W (2015) Data stream classification for structural health monitoring via on-line support vector machines. In: 2015 IEEE first international conference on big data computing service and applications, pp 400–405. IEEE

  34. Li Z, Huang W, Xiong Y, Ren S, Zhu T (2020) Incremental learning imbalanced data streams with concept drift: the dynamic updated ensemble algorithm. Knowl Based Syst 195:105694

    Article  Google Scholar 

  35. Liu W, Pokharel PP, Principe JC (2008) The kernel least-mean-square algorithm. IEEE Trans Signal Process 56(2):543–554

    Article  MathSciNet  Google Scholar 

  36. Losing V, Hammer B, Wersing H (2015) Interactive online learning for obstacle classification on a mobile robot. In: 2015 international joint conference on neural networks (IJCNN’2015), pp 1–8. IEEE

  37. Losing V, Hammer B, Wersing H (2016) KNN classifier with self adjusting memory for heterogeneous concept drift. In: 2016 IEEE 16th international conference on data mining (ICDM), pp 291–300. IEEE

  38. Losing V, Hammer B, Wersing H (2018) Incremental on-line learning: a review and comparison of state of the art algorithms. Neurocomputing 275:1261–1274

    Article  Google Scholar 

  39. Mairal J, Bach F, Ponce J, Sapiro G (2010) Online learning for matrix factorization and sparse coding. J Mach Learn Res 11(2):19–60. http://jmlr.org/papers/v11/mairal10a.html

  40. Mermillod M, Bugaiska A, Bonin P (2013) The stability-plasticity dilemma: investigating the continuum from catastrophic forgetting to age-limited learning effects. Front Psychol 4:504

    Article  Google Scholar 

  41. Moreno-Torres JG, Raeder T, Alaiz-RodríGuez R, Chawla NV, Herrera F (2012) A unifying view on dataset shift in classification. Pattern Recognit 45(1):521–530

    Article  Google Scholar 

  42. Platt J (1991) A resource-allocating network for function interpolation. MIT Press

  43. Qin AK, Suganthan PN (2004) A novel kernel prototype-based learning algorithm. In: Proceedings of the 17th international conference on pattern recognition, 2004. ICPR 2004, vol 4, pp 621–624. IEEE

  44. Richard C, Carlos J, Bermudez M (2007) Affine projection algorithm applied to nonlinear adaptive filtering. Statistical Signal Processing

  45. Richardson FM, Thomas MS (2008) Critical periods and catastrophic interference effects in the development of self-organizing feature maps. Dev Sci 11(3):371–389

    Article  Google Scholar 

  46. Rubio G, Herrera LJ, Pomares H, Rojas I, Guillén A (2010) Design of specific-to-problem kernels and use of kernel weighted k-nearest neighbours for time series modelling. Neurocomputing 73(10–12):1965–1975

    Article  Google Scholar 

  47. Sherman J, Morrison WJ (1950) Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Ann Math Stat 21(1):124–127

    Article  MathSciNet  Google Scholar 

  48. Soares Filho LA, Barreto GA (2014) On the efficient design of a prototype-based classifier using differential evolution. In: 2014 IEEE symposium on differential evolution (SDE), pp 1–8. IEEE

  49. Spangenberg N, Augenstein C, Franczyk B, Wagner M, Apitz M, Kenngott H (2017) Method for intra-surgical phase detection by using real-time medical device data. In: 2017 IEEE 30th international symposium on computer-based medical systems (CBMS), pp 254–259. IEEE

  50. Suykens JAK, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9(3):293–300

    Article  Google Scholar 

  51. Tsymbal A (2004) The problem of concept drift: definitions and related work. Tech. Rep. TCD-CS-2004-16, Computer Science Department, Trinity College Dublin. www.scss.tcd.ie/publications/tech-reports/

  52. Van Vaerenbergh S, Santamaría I (2014) Online regression with kernels. Regularization, Optimization, Kernels, and Support Vector Machines, pp 477–501

  53. Wadewale K, Desai S (2015) Survey on method of drift detection and classification for time varying data set. Int Res J Eng Technol 2(9):709–713

    Google Scholar 

  54. Wang D, Yeung DS, Tsang ECC (2007) Weighted mahalanobis distance kernels for support vector machines. IEEE Trans Neural Netw 18(5):1453–1462

    Article  Google Scholar 

  55. Webb GI, Hyde R, Cao H, Nguyen HL, Petitjean F (2016) Characterizing concept drift. Data Min Knowl Dis 30(4):964–994

    Article  MathSciNet  Google Scholar 

  56. Yin H (2006) On the equivalence between kernel self-organising maps and self-organising mixture density networks. Neural Netw 19(6):780–784

    Article  Google Scholar 

  57. Žliobaitė I, Pechenizkiy M, Gama J (2016) An overview of concept drift applications. In: Big data analysis: new algorithms for a new society, pp 91–114. Springer

Download references

Acknowledgements

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES - Finance Code 001) and by the Brazilian National Research Council (CNPq) via the grant 309379/2019-9. This study was financed by the following Brazilian research funding agencies: CAPES (Finance Code 001) and CNPq (grant no. 309379/2019-9).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guilherme A. Barreto.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix - Kernelized Distance Calculation

Appendix - Kernelized Distance Calculation

This section is devoted to demonstrate that kernel functions based on the squared Euclidean distance \(\left\| \mathbf{x} - \mathbf{w} _i \right\| _{2}^{2}\), such as the widely used Gaussian radial basis function (RBF), lead to kernelized distances that reduces to the standard Euclidean distance function. Also, the derivative of these functions are proportional to either the difference between the argument vectors \((\mathbf{x} - \mathbf{w} _i)\) or a normalized version of this difference.

Firstly, considering the linear kernel \(k(\mathbf {u},\mathbf {v})=\mathbf {u}^T\mathbf {v}\) and Eqs. (9) and (11), the cost function and its derivative are

$$\begin{aligned} J_i(\mathbf{x} )= & {} \mathbf{x} ^{T}{} \mathbf{x} -2\mathbf{w} _{i}^{T}{} \mathbf{x} + \mathbf{w} _i^{T}{} \mathbf{w} _i, \end{aligned}$$
(33)
$$\begin{aligned}= & {} (\mathbf{x} - \mathbf{w} _i)^T(\mathbf{x} - \mathbf{w} _i), \end{aligned}$$
(34)
$$\begin{aligned}= & {} \left\| \mathbf{x} - \mathbf{w} _i \right\| _{2}^{2}, \end{aligned}$$
(35)

whose gradient vector is given by

$$\begin{aligned} \nabla J_i(\mathbf {x}) = 2(\mathbf{w} _{i} - \mathbf{x} ). \end{aligned}$$
(36)

Now, considering the Gaussian kernel, the cost function and its gradient vector are given by

$$\begin{aligned} J_i \left( \mathbf {x} \right)= & {} \exp \left( - \frac{\left\| \mathbf{x} -\mathbf{x} \right\| _{2}^{2}}{2\gamma ^2} \right) -2 \exp \left( - \frac{\left\| \mathbf{w} _i-\mathbf{x} \right\| _{2}^{2}}{2\gamma ^2} \right) + \exp \left( - \frac{\left\| \mathbf{w} _i-\mathbf{w} _i \right\| _{2}^{2}}{2\gamma ^2} \right) \end{aligned}$$
(37)
$$\begin{aligned} J_i \left( \mathbf {x} \right)= & {} 2 - 2 \exp \left( - \frac{\left\| \mathbf{w} _i-\mathbf{x} \right\| _{2}^{2}}{2\gamma ^2} \right) . \end{aligned}$$
(38)
$$\begin{aligned} \nabla J_{i}(\mathbf{x} )= & {} - 2 \exp \left( - \frac{\left\| \mathbf{w} _i-\mathbf{x} \right\| _{2}^{2}}{2\gamma ^2} \right) \left( - \frac{1}{2\gamma ^2} \right) 2(\mathbf{w} _{i} - \mathbf{x} )\nonumber \\ \nabla J_{i}(\mathbf{x} )= & {} \left( \frac{2}{\gamma ^2} \right) \exp \left( - \frac{\left\| \mathbf{w} _i-\mathbf{x} \right\| _{2}^{2}}{2\gamma ^2} \right) (\mathbf{w} _{i} - \mathbf{x} ). \end{aligned}$$
(39)

Without loss of generality, we can set \(\gamma =\frac{1}{\sqrt{2}}\). Thus, Eq. (39) can be rewritten as

$$\begin{aligned} \nabla J_{i}(\mathbf{x} ) = \left[ \frac{4}{\exp \left( \left\| \mathbf{w} _i-\mathbf{x} \right\| _{2}^{2} \right) }\right] (\mathbf{w} _{i} - \mathbf{x} ). \end{aligned}$$
(40)

Finally, the Cauchy kernel function leads, respectively, to the following cost function and gradient vector:

$$\begin{aligned} J_i(\mathbf{x} )= & {} \left( 1 + \frac{\left\| \mathbf{x} - \mathbf{x} \right\| ^2}{\gamma ^2} \right) ^{-1} -2\left( 1 + \frac{\left\| \mathbf{w} _i - \mathbf{x} \right\| ^2}{\gamma ^2} \right) ^{-1} + \left( 1 + \frac{\left\| \mathbf{w} _i - \mathbf{w} _i \right\| ^2}{\gamma ^2} \right) ^{-1} \end{aligned}$$
(41)
$$\begin{aligned}= & {} 2 - 2\left( 1 + \frac{\left\| \mathbf{w} _i - \mathbf{x} \right\| ^2}{\gamma ^2} \right) ^{-1} = 2 - 2\left( \frac{\gamma ^2}{\gamma ^2 + \left\| \mathbf{w} _i - \mathbf{x} \right\| ^2} \right) , \end{aligned}$$
(42)

and

$$\begin{aligned}&\nabla J_{i}(\mathbf{x} ) = - 2\gamma ^2\left[ -\frac{1}{\left( \gamma ^2 + \left\| \mathbf{w} _i - \mathbf{x} \right\| ^2 \right) ^2} \right] 2\left( \mathbf{w} _i - \mathbf{x} \right) \nonumber \\&\quad = \left[ \frac{4\gamma ^2}{\left( \gamma ^2 + \left\| \mathbf{w} _i - \mathbf{x} \right\| ^2 \right) ^2} \right] \left( \mathbf{w} _i - \mathbf{x} \right) \end{aligned}$$
(43)

As one can inferred from the Eqs. (35), (38) and (42), the minimum value of these cost functions \(J_i \left( \mathbf {x} \right) \) occurs when the squared Euclidean distance \(\left\| \mathbf{w} _i-\mathbf{x} \right\| _{2}^{2}\) is minimum. Furthermore, for fixed hyperparameters, the gradient vectors resulting from these cost functions are proportional to \(\left( \mathbf{w} _i - \mathbf{x} \right) \), differing only by factors that depend on \(\left\| \mathbf{w} _i-\mathbf{x} \right\| _{2}^{2}\). For a given iteration, these factors are constant. This feature then motivated us to use a common learning rule in Eq. (20) to update the prototypes whenever we use one these kernel functions.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Coelho, D.N., Barreto, G.A. A Sparse Online Approach for Streaming Data Classification via Prototype-Based Kernel Models. Neural Process Lett 54, 1679–1706 (2022). https://doi.org/10.1007/s11063-021-10701-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-021-10701-9

Keywords

Navigation