Abstract
Over the years, several classification algorithms have been proposed in the machine learning area to address challenges related to the continuous arrival of data over time, formally known as data stream. The implementations of these approaches are of vital importance for the different applications where they are used, and they have also received modifications, specifically to address the problem of concept drift, a phenomenon present in classification problems with data streams. The K-nearest neighbors (k-NN) classification algorithm is one of the methods of the family of lazy approaches used to address this problem in online learning, but it still presents some challenges that can be improved, such as the efficient choice of the number of neighbors k used in the learning process. This article proposes paired k-NN learners with dynamically adjusted number of neighbors (PL-kNN), an innovative method which adjusts dynamically and incrementally the number of neighbors used by its pair of k-NN learners in the process of online learning regarding data streams with concept drifts. To validate it, experiments were carried out with both artificial and real-world datasets and the results were evaluated using the accuracy metric, run-time, memory usage, and the Friedman statistical test with the Nemenyi post hoc test. The experimental results show that PL-kNN improves and optimizes the accuracy performances of k-NN with fixed neighboring k values in most tested scenarios.
Similar content being viewed by others
Notes
MOAManager is freely available at https://github.com/brunom4ciel/moamanager/.
Available at http://mlkd.csd.auth.gr/datasets.html.
References
Agrawal R, Imielinski T, Swami A (1993) Database mining: a performance perspective. IEEE Trans Knowl Data Eng 5(6):914–925
Alberghini G, Barbon Junior S, Cano A (2022) Adaptive ensemble of self-adjusting nearest neighbor subspaces for multi-label drifting data streams. Neurocomputing 481:228–248
Almeida PR, Oliveira LS, Britto AS Jr et al (2018) Adapting dynamic classifier selection for concept drift. Expert Syst Appl 104:67–85
Atkeson CG, Moore AW, Schaal S (1997) Locally weighted learning. Artif Intell Rev 11(1–5):11–73
Barddal JP, Gomes HM, Granatyr J et al (2016) Overcoming feature drifts via dynamic feature weighted k-nearest neighbor learning. In: Proceedings of 23rd IEEE international conference on pattern recognition (ICPR), pp 2186–2191
Barros RSM, Santos SGTC (2018) A large-scale comparison of concept drift detectors. Inf Sciences 451:348–370
Barros RSM, Santos SGTC (2019) An overview and comprehensive comparison of ensembles for concept drift. Inf Fusion 52((C)):213–244
Barros RSM, Cabral DRL, Gonçalves PM Jr et al (2017) RDDM: reactive drift detection method. Expert Syst Appl 90((C)):344–355
Barros RSM, Hidalgo JIG, Cabral DRL (2018) Wilcoxon rank sum test drift detector. Neurocomputing 275((C)):1954–1963
Barros RSM, Santos SGTC, Barddal JP (2022) Evaluating k-NN in the classification of data streams with concept drift. arXiv preprint arXiv:2210.03119
Bifet A, Holmes G, Kirkby R et al (2010) MOA: massive online analysis. J Mach Learn Res 11:1601–1604
Bifet A, Gavaldà R, Holmes G et al (2018) Machine learning for data streams with practical examples in MOA. MIT Press, Cambridge
Bottou L, Vapnik V (1992) Local learning algorithms. Neural Comput 4(6):888–900
Brzezinski D, Stefanowski J (2013) Reacting to different types of concept drift: the accuracy updated ensemble algorithm. IEEE Trans Neural Netw Learn Syst 25(1):81–94
Cabral DRL, Barros RSM (2018) Concept drift detection based on Fisher’s exact test. Inf Sci 442:220–234
Cai YL, Ji D, Cai D (2010) A KNN research paper classification method based on shared nearest neighbor. In: Proceedings of NTCIR-8 workshop meeting, Tokyo, Japan, pp 336–340
Candillier L, Lemaire V (2012) Design and analysis of the nomao challenge active learning in the real-world. In: Proceedings of the ALRA: active learning in real-world applications, workshop ECML-PKDD, pp 1–15
Cortez P, Cerdeira A, Almeida F et al (2009) Modeling wine preferences by data mining from physicochemical properties. Decis Support Syst 47(4):547–553
Dawid AP (1984) Present position and potential developments: some personal views: statistical theory: the prequential approach. J R Stat Soc Ser A (General) 147(2):278–292
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Fern X, Brodley C (2004) Cluster ensembles for high dimensional clustering: an empirical study. Tech. rep., Oregon State University. Department of Computer Science. http://hdl.handle.net/1957/35655
Frías-Blanco I, Verdecia-Cabrera A, Ortiz-Díaz A et al (2016) Fast adaptive stacking of ensembles. In: Proceedings of the 31st ACM symposium on applied computing (SAC’16), Pisa, Italy, pp 929–934
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701
Gaber MM, Zaslavsky A, Krishnaswamy S (2007) A survey of classification methods in data streams. In: Aggarwal CC (ed) Data streams: advances in database systems. Springer, Boston, pp 39–59
Gao J, Ding B, Fan W et al (2008) Classifying data streams with skewed class distributions and concept drifts. IEEE Internet Comput 12(6):37–49
Gomes HM, Barddal JP, Enembreck F et al (2017) A survey on ensemble learning for data stream classification. ACM Comput Surv 50(2):1–36
Gonçalves PM Jr, Barros RSM (2013) RCD: a recurring concept drift framework. Pattern Recogn Lett 34(9):1018–1025
Hidalgo JIG, Maciel BIF, Barros RSM (2019) Experimenting with prequential variations for data stream learning evaluation. Comput Intell 35:670–692
Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining, New York, USA, KDD ’01, pp 97–106
Ienco D, Žliobaitė I, Pfahringer B (2014) High density-focused uncertainty sampling for active learning over evolving stream data. In: Proceedings of the 3rd international workshop on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications, pp 133–148
Katakis I, Tsoumakas G, Vlahavas I (2006) Dynamic feature space and incremental feature selection for the classification of textual data streams. In: Proceedings of ECML/PKDD international workshop on knowledge discovery from data streams (IWKDDS), pp 107–116
Koychev I (2007) Experiments with two approaches for tracking drifting concepts. Serdica J Comput 1(1):27–44
Liao Y, Vemuri V (2002) Use of k-nearest neighbor classifier for intrusion detection. Comput Secur 21(5):439–448
Liu A, Lu J, Liu F et al (2018) Accumulating regional density dissimilarity for concept drift detection in data streams. Pattern Recogn 76:256–272
Losing V, Hammer B, Wersing H (2016) KNN classifier with self adjusting memory for heterogeneous concept drift. In: 2016 IEEE 16th international conference on data mining (ICDM), Barcelona, Spain, pp 291–300
Losing V, Hammer B, Wersing H (2018) Tackling heterogeneous concept drift with the self-adjusting memory (SAM). Knowl Inf Syst 54(1):171–201
Lu N, Zhang G, Lu J (2014) Concept drift detection via competence models. Artif Intell 209:11–28
Lu N, Lu J, Zhang G et al (2016) A concept drift-tolerant case-base editing technique. Artif Intell 230((C)):108–133
Maciel BIF, Santos SGTC, Barros RSM (2020) MOAManager: a tool to support data stream experiments. Softw Pract Exp 50(4):325–334
Nemenyi P (1963) Distribution-free Multiple Comparisons. Ph.D. Thesis, Princeton University, New Jersey, NJ, USA. https://books.google.com.br/books?id=nhDMtgAACAAJ
Nguyen T, Czerwinski M, Lee D (1993) Compaq quicksource: providing the consumer with the power of artificial intelligence. In: Proceedings of the the fifth conference on innovative applications of artificial intelligence. AAAI Press, IAAI ’93, pp 142–151
Roseberry M, Krawczyk B, Cano A (2019) Multi-label punitive kNN with self-adjusting memory for drifting data streams. ACM Trans Knowl Discov Data 13(6):1–31
Salganicoff M (1997) Tolerating concept and sampling shift in lazy learning using prediction error context switching. Artif Intell Rev 11(1–5):133–155
Simoudis E, Aha DW (1997) Special issue on lazy learning. Artif Intell Rev 11(1–5):7–10
Srivas S, Khot PG (2019) Performance evaluation of MOA v/s KNN classification schemes: case study of major cities in the world. Int J Comput Sci Eng 7:489–495
Sun Y, Dai H (2021) Constructing accuracy and diversity ensemble using pareto-based multi-objective learning for evolving data streams. Neural Comput Appl 33(11):6119–6132
Sun Y, Sun Y, Dai H (2020) Two-stage cost-sensitive learning for data streams with concept drift and class imbalance. IEEE Access 8:191942–191955
Sun Y, Li M, Li L et al (2021) Cost-sensitive classification for evolving data streams with concept drift and class imbalance. Comput Intell Neurosci. https://doi.org/10.1155/2021/8813806
Wang X, Kuntz P, Meyer F et al (2021) Multi-label kNN classifier with online dual memory on data stream. In: 2021 international conference on data mining workshops (ICDMW), pp 405–413
Wu X, Li P, Hu X (2012) Learning from concept drifting data streams with unlabeled data. Neurocomputing 92:145–155
Xioufis ES, Spiliopoulou M, Tsoumakas G et al (2011) Dealing with concept drift and class imbalance in multi-label stream classification. In: Proceedings of 22nd international joint conference on artificial intelligence, Barcelona, Spain, IJCAI’11, pp 1583–1588
Zhang J, Wang T, Ng WWY et al (2022) KNNENS: a k-nearest neighbor ensemble-based method for incremental learning under data stream with emerging new classes. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2022.3149991
Zhang ML, Zhou ZH (2007) ML-KNN: a lazy learning approach to multi-label learning. Pattern Recogn 40(7):2038–2048
Acknowledgements
Juan Hidalgo is a PhD student previously supported by a postgraduate grant from Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES); Silas Santos is a researcher supported by postdoctorate Grant Number 88887.374884/2019-00 from CAPES; and Prof. Roberto S. M. Barros is supported by research Grant Number 310092/2019-1 from Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq).
Author information
Authors and Affiliations
Contributions
JH and RB were responsible for conceptualization, validation, and writing, reviewing, and editing; and SS was involved in validation, and writing, reviewing, and editing.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
All the run-time and memory usage tables of results omitted from the main body of the article are provided here in the appendix. Nevertheless, this appendix can also be seen as complementary material. In summary, PL-kNN tends to demand a little more run-time and memory than the other methods, except for SAMkNN, which presents very high memory consumption.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hidalgo, J.I.G., Santos, S.G.T.C. & de Barros, R.S.M. Paired k-NN learners with dynamically adjusted number of neighbors for classification of drifting data streams. Knowl Inf Syst 65, 1787–1816 (2023). https://doi.org/10.1007/s10115-022-01817-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-022-01817-y