Abstract
Data reduction processes are designed not only to reduce the amount of data, but also to reduce noise interference. In this study, we focus on researching sample reduction algorithms for the classification and regression data. A sample quality evaluation measure denoted by NN-kNN, which is inspired by human social behavior, is proposed. This measure is a local evaluation method that can accurately evaluate the quality of samples under uneven and irregular data distribution. Additionally, the measure is easy to understand and applies to both supervised and unsupervised data. Consequently, it respectively studies the sample reduction algorithms based on the NN-kNN measure for classification and regression data. Experiments are carried out to verify the proposed quality evaluation measure and data reduction algorithms. Experimental results show that NN-kNN can evaluate data quality effectively. High quality samples selected by the reduction algorithms can generate high classification and prediction performance. Furthermore, the robustness of the sample reduction algorithms is also validated.
Similar content being viewed by others
References
Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66
An S, Hu QH, Pedrycz W, Zhu PF, Tsang Eric CC (2016) Data-distribution-aware fuzzy rough set model and its application to robust classification. IEEE Trans Cybern 46(12):3073–3085
Bai W, Wang XT, Xin JC, Wang GR (2016) Efficient algorithm for distributed density-based outlier detection on big data. Neurocomputing 181:19–28
Breunig MM, Kriegel H-P, Ng RT, Sander J (1999) Optics-of: identifying local outliers. Principles of Data Mining and Knowledge Discovery, Lecture Notes in Computer Science 1704:262–270
Chen YX, Dang X, Peng HX, Bart H (2009) Outlier detection with the kernelized spatial depth function. Artif Intell Rev 31(2):288–305
Dai JH, Hu QH,Zhang JH (2017) Attribute selection for partially labeled categorical data by rough set approach. IEEE Trans Cybern 47(9)(SI):2460-2471
Dai JH, Liu Y, Chen JL, Liu XF (2020) Fast feature selection for interval-valued data through kernel density estimation entropy. Int J Mach Learn Cybern. https://doi.org/10.1007/s13042-020-01131-5
Ding WP, Lin CT, Witold P (2020) Multiple relevant feature ensemble selection based on multilayer co-evolutionary consensus mapreduce. IEEE Trans Cybern 50(2):425–439
Dua D, Graff C (2019) UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science
Frumosu FD, Kulahci M (2019) Outliers detection using an iterative strategy for semi-supervised learning. Qual Reliab Eng Int 35(5):1408–1423
Gao JH, Ji WX, Zhang LL (2020) Cube-based incremental outlier detection for streaming computing. Inf Sci 517:361–376
Garcia S, Derrac J, Cano JR, Herrera F (2012) Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans Pattern Anal Mach Intell 34(3):417–435
Hautamaki V, Karkkainen I, Franti P (2001) Outlier detection using k-nearest neighbour graph. IEEE Comput Soc 3:430–433
He Z, Xu X, Deng S (2003) Discovering cluster-based local outliers. Pattern Recogn Lett 24(9–10):1641–1650
Knorr EM, Ng RT, Tucakov V (2000) Distance-based Outliers: algorithms and applications. VLDB J 8(3–4):237–253
Krzysztof M, Witold R (2020) All-relevant feature selection using multidimensional filters with exhaustive search. Inf Sci 524:277–297
Li XJ, Lv JC, Yi Z (2020) Outlier detection using structural scores in a high-dimensional space. IEEE Trans Cybern 50(5):2302–2310
Liu HW, Li XL, Li JY, Zhang SC (2018) Efficient outlier detection for high-dimensional data. IEEE Trans Syst Man Cybern-Syst 48(12):2451–2461
Mei BS, Xu YT (2020) Safe sample screening for regularized multi-task learning. Knowl-Based Syst 204:106–248
Otey ME, Ghoting A, Parthasarathy S (2006) Fast distributed outlier detection in mixed-attribute data sets. Data Min Knowl Discov 12(2–3):203–228
Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106
Ramaswamy S, Rastogi R, Shim K (2000) Effecient algorithms for mining outliers from large data sets. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data 29:427-438
Roth V (2006) Kernel Fisher discriminants for outlier detection. Neural Comput 18(4):942–960
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423
Tan AH, Wu W-Z, Qian YH, Liang JY, Chen JK, Li JJ (2019) Intuitionistic fuzzy rough set-based granular structures and attribute subset selection. IEEE Trans Fuzzy Syst 27(3):527–539
Tang B, He HB (2017) A local density-based approach for outlier detection. Neurocomputing 241:171–180
Verbiest N, Cornelis C, Herrera F (2013) FRPS: A fuzzy rough prototype selection method. Pattern Recogn 46:2770–2782
Wang CZ, Qi YL, Shao MW, Hu QH, Chen DG, Qian YH, Lin YJ (2017) A fitting model for feature selection with fuzzy rough sets. IEEE Trans Fuzzy Syst 25(4):741–753
Wang CZ, Huang Y, Shao MW, Hu QH, Chen DG (2019) Feature selection based on neighborhood self-Information. IEEE Trans Cybern 99:1–12
Wang CZ, Wang Y, Shao MW, Qian YH, Chen DG (2020) Fuzzy rough attribute reduction for categorical data. IEEE Trans Fuzzy Syst 28(5):818–830
Yang YY, Song SJ, Chen DG, Zhang X (2020) Discernible neighborhood counting based incremental feature selection for heterogeneous data. Int J Mach Learn Cybern 11(5):1115–1127
Yu DR, An S, Hu QH (2011) Fuzzy mutual information based min-redundancy and max-relevance heterogeneous feature selection. Int J Comput Intell Syst 4(4):619–633
Yuan Z, Zhang XY, Feng S (2018) Hybrid data-driven outlier detection based on neighborhood information entropy and its developmental measures. Expert Syst Appl 112:243–257
Acknowledgements
This work was partially supported by National Natural Science Foundation of China (61976027,U1808205), Natural Science Foundation of Hebei Province of China (A2018501040).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
An, S., Hu, Q., Wang, C. et al. Data reduction based on NN-kNN measure for NN classification and regression. Int. J. Mach. Learn. & Cyber. 13, 765–781 (2022). https://doi.org/10.1007/s13042-021-01327-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-021-01327-3