Skip to main content
Log in

Paired k-NN learners with dynamically adjusted number of neighbors for classification of drifting data streams

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Over the years, several classification algorithms have been proposed in the machine learning area to address challenges related to the continuous arrival of data over time, formally known as data stream. The implementations of these approaches are of vital importance for the different applications where they are used, and they have also received modifications, specifically to address the problem of concept drift, a phenomenon present in classification problems with data streams. The K-nearest neighbors (k-NN) classification algorithm is one of the methods of the family of lazy approaches used to address this problem in online learning, but it still presents some challenges that can be improved, such as the efficient choice of the number of neighbors k used in the learning process. This article proposes paired k-NN learners with dynamically adjusted number of neighbors (PL-kNN), an innovative method which adjusts dynamically and incrementally the number of neighbors used by its pair of k-NN learners in the process of online learning regarding data streams with concept drifts. To validate it, experiments were carried out with both artificial and real-world datasets and the results were evaluated using the accuracy metric, run-time, memory usage, and the Friedman statistical test with the Nemenyi post hoc test. The experimental results show that PL-kNN improves and optimizes the accuracy performances of k-NN with fixed neighboring k values in most tested scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. MOAManager is freely available at https://github.com/brunom4ciel/moamanager/.

  2. Available at http://mlkd.csd.auth.gr/datasets.html.

References

  1. Agrawal R, Imielinski T, Swami A (1993) Database mining: a performance perspective. IEEE Trans Knowl Data Eng 5(6):914–925

    Article  Google Scholar 

  2. Alberghini G, Barbon Junior S, Cano A (2022) Adaptive ensemble of self-adjusting nearest neighbor subspaces for multi-label drifting data streams. Neurocomputing 481:228–248

    Article  Google Scholar 

  3. Almeida PR, Oliveira LS, Britto AS Jr et al (2018) Adapting dynamic classifier selection for concept drift. Expert Syst Appl 104:67–85

    Article  Google Scholar 

  4. Atkeson CG, Moore AW, Schaal S (1997) Locally weighted learning. Artif Intell Rev 11(1–5):11–73

    Article  Google Scholar 

  5. Barddal JP, Gomes HM, Granatyr J et al (2016) Overcoming feature drifts via dynamic feature weighted k-nearest neighbor learning. In: Proceedings of 23rd IEEE international conference on pattern recognition (ICPR), pp 2186–2191

  6. Barros RSM, Santos SGTC (2018) A large-scale comparison of concept drift detectors. Inf Sciences 451:348–370

    Article  MathSciNet  Google Scholar 

  7. Barros RSM, Santos SGTC (2019) An overview and comprehensive comparison of ensembles for concept drift. Inf Fusion 52((C)):213–244

    Article  Google Scholar 

  8. Barros RSM, Cabral DRL, Gonçalves PM Jr et al (2017) RDDM: reactive drift detection method. Expert Syst Appl 90((C)):344–355

    Article  Google Scholar 

  9. Barros RSM, Hidalgo JIG, Cabral DRL (2018) Wilcoxon rank sum test drift detector. Neurocomputing 275((C)):1954–1963

    Article  Google Scholar 

  10. Barros RSM, Santos SGTC, Barddal JP (2022) Evaluating k-NN in the classification of data streams with concept drift. arXiv preprint arXiv:2210.03119

  11. Bifet A, Holmes G, Kirkby R et al (2010) MOA: massive online analysis. J Mach Learn Res 11:1601–1604

    Google Scholar 

  12. Bifet A, Gavaldà R, Holmes G et al (2018) Machine learning for data streams with practical examples in MOA. MIT Press, Cambridge

    Book  Google Scholar 

  13. Bottou L, Vapnik V (1992) Local learning algorithms. Neural Comput 4(6):888–900

    Article  Google Scholar 

  14. Brzezinski D, Stefanowski J (2013) Reacting to different types of concept drift: the accuracy updated ensemble algorithm. IEEE Trans Neural Netw Learn Syst 25(1):81–94

    Article  Google Scholar 

  15. Cabral DRL, Barros RSM (2018) Concept drift detection based on Fisher’s exact test. Inf Sci 442:220–234

    Article  MathSciNet  Google Scholar 

  16. Cai YL, Ji D, Cai D (2010) A KNN research paper classification method based on shared nearest neighbor. In: Proceedings of NTCIR-8 workshop meeting, Tokyo, Japan, pp 336–340

  17. Candillier L, Lemaire V (2012) Design and analysis of the nomao challenge active learning in the real-world. In: Proceedings of the ALRA: active learning in real-world applications, workshop ECML-PKDD, pp 1–15

  18. Cortez P, Cerdeira A, Almeida F et al (2009) Modeling wine preferences by data mining from physicochemical properties. Decis Support Syst 47(4):547–553

    Article  Google Scholar 

  19. Dawid AP (1984) Present position and potential developments: some personal views: statistical theory: the prequential approach. J R Stat Soc Ser A (General) 147(2):278–292

    Article  Google Scholar 

  20. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  21. Fern X, Brodley C (2004) Cluster ensembles for high dimensional clustering: an empirical study. Tech. rep., Oregon State University. Department of Computer Science. http://hdl.handle.net/1957/35655

  22. Frías-Blanco I, Verdecia-Cabrera A, Ortiz-Díaz A et al (2016) Fast adaptive stacking of ensembles. In: Proceedings of the 31st ACM symposium on applied computing (SAC’16), Pisa, Italy, pp 929–934

  23. Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701

    Article  MATH  Google Scholar 

  24. Gaber MM, Zaslavsky A, Krishnaswamy S (2007) A survey of classification methods in data streams. In: Aggarwal CC (ed) Data streams: advances in database systems. Springer, Boston, pp 39–59

    Chapter  Google Scholar 

  25. Gao J, Ding B, Fan W et al (2008) Classifying data streams with skewed class distributions and concept drifts. IEEE Internet Comput 12(6):37–49

    Article  Google Scholar 

  26. Gomes HM, Barddal JP, Enembreck F et al (2017) A survey on ensemble learning for data stream classification. ACM Comput Surv 50(2):1–36

    Article  Google Scholar 

  27. Gonçalves PM Jr, Barros RSM (2013) RCD: a recurring concept drift framework. Pattern Recogn Lett 34(9):1018–1025

    Article  Google Scholar 

  28. Hidalgo JIG, Maciel BIF, Barros RSM (2019) Experimenting with prequential variations for data stream learning evaluation. Comput Intell 35:670–692

    Article  MathSciNet  Google Scholar 

  29. Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining, New York, USA, KDD ’01, pp 97–106

  30. Ienco D, Žliobaitė I, Pfahringer B (2014) High density-focused uncertainty sampling for active learning over evolving stream data. In: Proceedings of the 3rd international workshop on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications, pp 133–148

  31. Katakis I, Tsoumakas G, Vlahavas I (2006) Dynamic feature space and incremental feature selection for the classification of textual data streams. In: Proceedings of ECML/PKDD international workshop on knowledge discovery from data streams (IWKDDS), pp 107–116

  32. Koychev I (2007) Experiments with two approaches for tracking drifting concepts. Serdica J Comput 1(1):27–44

    Article  MATH  Google Scholar 

  33. Liao Y, Vemuri V (2002) Use of k-nearest neighbor classifier for intrusion detection. Comput Secur 21(5):439–448

    Article  Google Scholar 

  34. Liu A, Lu J, Liu F et al (2018) Accumulating regional density dissimilarity for concept drift detection in data streams. Pattern Recogn 76:256–272

    Article  Google Scholar 

  35. Losing V, Hammer B, Wersing H (2016) KNN classifier with self adjusting memory for heterogeneous concept drift. In: 2016 IEEE 16th international conference on data mining (ICDM), Barcelona, Spain, pp 291–300

  36. Losing V, Hammer B, Wersing H (2018) Tackling heterogeneous concept drift with the self-adjusting memory (SAM). Knowl Inf Syst 54(1):171–201

    Article  Google Scholar 

  37. Lu N, Zhang G, Lu J (2014) Concept drift detection via competence models. Artif Intell 209:11–28

    Article  MathSciNet  MATH  Google Scholar 

  38. Lu N, Lu J, Zhang G et al (2016) A concept drift-tolerant case-base editing technique. Artif Intell 230((C)):108–133

    Article  MathSciNet  MATH  Google Scholar 

  39. Maciel BIF, Santos SGTC, Barros RSM (2020) MOAManager: a tool to support data stream experiments. Softw Pract Exp 50(4):325–334

    Article  Google Scholar 

  40. Nemenyi P (1963) Distribution-free Multiple Comparisons. Ph.D. Thesis, Princeton University, New Jersey, NJ, USA. https://books.google.com.br/books?id=nhDMtgAACAAJ

  41. Nguyen T, Czerwinski M, Lee D (1993) Compaq quicksource: providing the consumer with the power of artificial intelligence. In: Proceedings of the the fifth conference on innovative applications of artificial intelligence. AAAI Press, IAAI ’93, pp 142–151

  42. Roseberry M, Krawczyk B, Cano A (2019) Multi-label punitive kNN with self-adjusting memory for drifting data streams. ACM Trans Knowl Discov Data 13(6):1–31

    Article  Google Scholar 

  43. Salganicoff M (1997) Tolerating concept and sampling shift in lazy learning using prediction error context switching. Artif Intell Rev 11(1–5):133–155

    Article  Google Scholar 

  44. Simoudis E, Aha DW (1997) Special issue on lazy learning. Artif Intell Rev 11(1–5):7–10

    Google Scholar 

  45. Srivas S, Khot PG (2019) Performance evaluation of MOA v/s KNN classification schemes: case study of major cities in the world. Int J Comput Sci Eng 7:489–495

    Google Scholar 

  46. Sun Y, Dai H (2021) Constructing accuracy and diversity ensemble using pareto-based multi-objective learning for evolving data streams. Neural Comput Appl 33(11):6119–6132

    Article  Google Scholar 

  47. Sun Y, Sun Y, Dai H (2020) Two-stage cost-sensitive learning for data streams with concept drift and class imbalance. IEEE Access 8:191942–191955

    Article  Google Scholar 

  48. Sun Y, Li M, Li L et al (2021) Cost-sensitive classification for evolving data streams with concept drift and class imbalance. Comput Intell Neurosci. https://doi.org/10.1155/2021/8813806

    Article  Google Scholar 

  49. Wang X, Kuntz P, Meyer F et al (2021) Multi-label kNN classifier with online dual memory on data stream. In: 2021 international conference on data mining workshops (ICDMW), pp 405–413

  50. Wu X, Li P, Hu X (2012) Learning from concept drifting data streams with unlabeled data. Neurocomputing 92:145–155

    Article  Google Scholar 

  51. Xioufis ES, Spiliopoulou M, Tsoumakas G et al (2011) Dealing with concept drift and class imbalance in multi-label stream classification. In: Proceedings of 22nd international joint conference on artificial intelligence, Barcelona, Spain, IJCAI’11, pp 1583–1588

  52. Zhang J, Wang T, Ng WWY et al (2022) KNNENS: a k-nearest neighbor ensemble-based method for incremental learning under data stream with emerging new classes. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2022.3149991

    Article  Google Scholar 

  53. Zhang ML, Zhou ZH (2007) ML-KNN: a lazy learning approach to multi-label learning. Pattern Recogn 40(7):2038–2048

    Article  MATH  Google Scholar 

Download references

Acknowledgements

Juan Hidalgo is a PhD student previously supported by a postgraduate grant from Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES); Silas Santos is a researcher supported by postdoctorate Grant Number 88887.374884/2019-00 from CAPES; and Prof. Roberto S. M. Barros is supported by research Grant Number 310092/2019-1 from Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq).

Author information

Authors and Affiliations

Authors

Contributions

JH and RB were responsible for conceptualization, validation, and writing, reviewing, and editing; and SS was involved in validation, and writing, reviewing, and editing.

Corresponding author

Correspondence to Silas Garrido T. C. Santos.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

All the run-time and memory usage tables of results omitted from the main body of the article are provided here in the appendix. Nevertheless, this appendix can also be seen as complementary material. In summary, PL-kNN tends to demand a little more run-time and memory than the other methods, except for SAMkNN, which presents very high memory consumption.

See Tables 6, 7, 8, 9, 10 and 11.

Table 6 Run-time of the methods in percentage using RDDM as auxiliary detector, in artificial datasets with abrupt concept drifts and 95% confidence intervals
Table 7 Memory usage of the methods (bytes per second) in percentage using RDDM as auxiliary detector, in artificial datasets with abrupt concept drifts and 95% confidence intervals
Table 8 Run-time of the methods in percentage using RDDM as auxiliary detector, in artificial datasets with gradual concept drifts and 95% confidence intervals
Table 9 Memory usage of the methods (bytes per second) in percentage using RDDM as auxiliary detector, in artificial datasets with gradual concept drifts and 95% confidence intervals
Table 10 Run-time of the methods in percentage using RDDM as auxiliary detector in real-world datasets
Table 11 Memory usage of the methods in percentage using RDDM as auxiliary detector in real-world datasets

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hidalgo, J.I.G., Santos, S.G.T.C. & de Barros, R.S.M. Paired k-NN learners with dynamically adjusted number of neighbors for classification of drifting data streams. Knowl Inf Syst 65, 1787–1816 (2023). https://doi.org/10.1007/s10115-022-01817-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-022-01817-y

Keywords

Navigation