Skip to main content
Log in

RHC: a non-parametric cluster-based data reduction for efficient \(k\)-NN classification

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Although the \(k\)-NN classifier is a popular classification method, it suffers from the high computational cost and storage requirements it involves. This paper proposes two effective cluster-based data reduction algorithms for efficient \(k\)-NN classification. Both have low preprocessing cost and can achieve high data reduction rates while maintaining \(k\)-NN classification accuracy at high levels. The first proposed algorithm is called reduction through homogeneous clusters (RHC) and is based on a fast preprocessing clustering procedure that creates homogeneous clusters. The centroids of these clusters constitute the reduced training set. The second proposed algorithm is a dynamic version of RHC that retains all its properties and, in addition, it can manage datasets that cannot fit in main memory and is appropriate for dynamic environments where new training data are gradually available. Experimental results, based on fourteen datasets, illustrate that both algorithms are faster and achieve higher reduction rates than four known methods, while maintaining high classification accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. DRTs have two points of view: (1) item reduction, and, (2) dimensionality reduction. We consider them from the first point of view.

  2. One should not confuse \(k\) of \(k\)-NN with \(k\) of \(k\)-Means.

  3. http://sci2s.ugr.es/keel/datasets.php.

  4. Many datasets distributed by the KEEL repository have been preprocessed to remove duplicates.

References

  1. Aha DW (1992) Tolerating noisy, irrelevant and novel attributes in instance-based learning algorithms. Int J Man-Mach Stud 36(2):267–287. doi:10.1016/0020-7373(92)90018-G

  2. Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66. doi:10.1023/A:1022689900470

  3. Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Mult-Val Logic Soft Comput 17(2–3):255–287

    Google Scholar 

  4. Angiulli F (2005) Fast condensed nearest neighbor rule. In: Proceedings of the 22nd international conference on machine learning., ICML ’05ACM, New York, NY, USA, pp 25–32

  5. Angiulli F (2007) Fast nearest neighbor condensation for large data sets classification. IEEE Trans Knowl Data Eng 19(11):1450–1464. doi:10.1109/TKDE.2007.190645

  6. Beringer J, Hüllermeier E (2007) Efficient instance-based learning on data streams. Intell Data Anal 11(6):627–650. http://dl.acm.org/citation.cfm?id=1368018.1368022

  7. Brighton H, Mellish C (2002) Advances in instance selection for instance-based learning algorithms. Data Min Knowl Discov 6(2):153–172. doi:10.1023/A:1014043630878

  8. Chang CL (1974) Finding prototypes for nearest neighbor classifiers. IEEE Trans Comput 23(11):1179–1184. doi:10.1109/T-C.1974.223827

  9. Chen CH, Jóźwik A (1996) A sample set condensation algorithm for the class sensitive artificial neural network. Pattern Recogn Lett 17(8):819–823. doi:10.1016/0167-8655(96)00041-4

  10. Chou CH, Kuo BH, Chang F (2006) The generalized condensed nearest neighbor rule as a data reduction method. In: Proceedings of the 18th international conference on pattern recognition, vol 02, ICPR ’06. IEEE Computer Society, Washington, DC, pp 556–559. doi:10.1109/ICPR.2006.1119

  11. Cover T, Hart P (2006) Nearest neighbor pattern classification. IEEE Trans Inf Theor 13(1):21–27. doi:10.1109/TIT.1967.1053964

  12. Dasarathy BV (1991) Nearest neighbor. NN pattern classification techniques. IEEE Computer Society Press, NN) norms

    Google Scholar 

  13. Dasarathy BV, Snchez JS, Townsend S (2000) Nearest neighbour editing and condensing toolssynergy exploitation. Pattern Anal Appl 3(1):19–30. doi:10.1007/s100440050003

  14. Datta P, Kibler DF (1997) Learning symbolic prototypes. In: Proceedings of the fourteenth international conference on machine learning., ICML ’97Morgan Kaufmann Publishers Inc., San Francisco, pp 75–82

  15. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30. http://dl.acm.org/citation.cfm?id=1248547.1248548

  16. Devi VS, Murty MN (2002) An incremental prototype set building technique. Pattern Recogn 35(2):505–513

    Article  MATH  Google Scholar 

  17. Devijver PA, Kittler J (1980) On the edited nearest neighbor rule. In: Proceedings of the fifth international conference on pattern recognition. The Institute of Electrical and Electronics Engineers, New Jersey

  18. Fayed HA, Hashem SR, Atiya AF (2007) Self-generating prototypes for pattern classification. Pattern Recogn 40(5):1498–1509. doi:10.1016/j.patcog.2006.10.018

  19. García S, Cano JR, Herrera F (2008) A memetic algorithm for evolutionary prototype selection: a scaling up approach. Pattern Recogn 41(8):2693–2709. 10.1016/j.patcog.2008.02.006

  20. Garcia S, Derrac J, Cano J, Herrera F (2012) Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans Pattern Anal Mach Intell 34(3):417–435. doi:10.1109/TPAMI.2011.142

  21. García-Borroto M, Villuendas-Rey Y, Carrasco-Ochoa JA, Martínez-Trinidad JF (2009) Using maximum similarity graphs to edit nearest neighbor classifiers. In: Proceedings of the 14th Iberoamerican conference on pattern recognition: progress in pattern recognition, image analysis, computer vision, and applications, CIARP ’09. Springer, Berlin, pp 489–496. doi:10.1007/978-3-642-10268-4_57

  22. Gates GW (1972) The reduced nearest neighbor rule. IEEE Trans Inf Theory 18(3):431–433

    Article  Google Scholar 

  23. Grochowski M, Jankowski N (2004) Comparison of instance selection algorithms ii. results and comments. In: Artificial intelligence and soft computing—ICAISC 2004, vol 3070. LNCS/Springer, Berlin/Heidelberg, pp 580–585

  24. Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques. Elsevier Science, The Morgan Kaufmann Series in Data Management Systems

  25. Hart PE (1968) The condensed nearest neighbor rule. IEEE Trans Inf Theory 14(3):515–516

    Article  Google Scholar 

  26. James M (1985) Classification algorithms. Wiley-Interscience, New York

    MATH  Google Scholar 

  27. Jankowski N, Grochowski M (2004) Comparison of instances seletion algorithms i. algorithms survey. Artif Intell Soft Comput ICAISC 2004, vol 3070. LNCS/Springer, Berlin/Heidelberg, pp 598–603

  28. Lozano M (2007) Data reduction techniques in classification processes. Phd Thesis, Universitat Jaume I

  29. McQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proc. of 5th Berkeley symp. on math. statistics and probability. University of California Press, Berkeley, pp 281–298

  30. Mollineda R, Ferri F, Vidal E (2002) An efficient prototype merging strategy for the condensed 1-nn rule through class-conditional hierarchical clustering. Pattern Recogn 35(12):2771–2782

    Article  MATH  Google Scholar 

  31. Olvera-López JA, Carrasco-Ochoa JA, Martínez-Trinidad JF, Kittler J (2010) A review of instance selection methods. Artif Intell Rev 34(2):133–143. doi:10.1007/s10462-010-9165-y

  32. Olvera-Lopez JA, Carrasco-Ochoa JA, Trinidad JFM (2010) A new fast prototype selection method based on clustering. Pattern Anal Appl 13(2):131–141

    Article  MathSciNet  Google Scholar 

  33. Olvera-López JA, Martínez-Trinidad JF, Carrasco-Ochoa JA (2007) Mixed data object selection based on clustering and border objects. In: Proceedings of the congress on pattern recognition 12th Iberoamerican conference on progress in pattern recognition, image analysis and applications, CIARP’07. Springer, Berlin, pp 674–683. http://dl.acm.org/citation.cfm?id=1782914.1782996

  34. Olvera-López JA, Carrasco-Ochoa JA, Trinidad JFM (2008) Object selection based on clustering and border objects. In: Kurzynski M, Puchala E, Wozniak M, Zolnierek A (eds) Computer Recognition Systems 2, advances in soft computing, vol 45. Springer, Berlin, pp 27–34

  35. Ougiaroglou S, Evangelidis G (2012) Efficient dataset size reduction by finding homogeneous clusters. In: Proceedings of the fifth Balkan conference in informatics, BCI ’12. ACM, New York, pp 168–173. doi:10.1145/2371316.2371349

  36. Ougiaroglou S, Evangelidis G (2012) Fast and accurate k-nearest neighbor classification using prototype selection by clustering. In: 16th Panhellenic conference on informatics (PCI), 2012, pp 168–173

  37. Ritter G, Woodruff H, Lowry S, Isenhour T (1975) An algorithm for a selective nearest neighbor decision rule. IEEE Trans Inf Theory 21(6):665–669

    Article  MATH  Google Scholar 

  38. Samet H (2006) Foundations of multidimensional and metric data structures. In: The Morgan Kaufmann series in computer graphics, Elsevier/Morgan Kaufmann

  39. Sánchez JS (2004) High training set size reduction by space partitioning and prototype abstraction. Pattern Recogn 37(7):1561–1564

    Article  Google Scholar 

  40. Sheskin D (2011) Handbook of parametric and monparametric statistical procedures. A Chapman & Hall book. Chapman & Hall/CRC, Boca Raton

  41. Tomek I (1976) An experiment with the edited nearest-neighbor rule. IEEE Trans Syst Man Cybern 6:448–452

    Article  MATH  MathSciNet  Google Scholar 

  42. Tomek I (1976) Two modifications of cnn. Syst Man Cybern IEEE Trans SMC 6(11):769–772. doi:10.1109/TSMC.1976.4309452

  43. Toussaint G (2002) Proximity graphs for nearest neighbor decision rules: recent progress. In: 34th symposium on the INTERFACE, pp 17–20

  44. Triguero I, Derrac J, Garcia S, Herrera F (2012) A taxonomy and experimental study on prototype generation for nearest neighbor classification. Trans Syst Man Cyber Part C 42(1):86–100. doi:10.1109/TSMCC.2010.2103939

  45. Tsymbal A (2004) The problem of concept drift: definitions and related work. Tech. Rep. TCD-CS-2004-15, The University of Dublin, Trinity College, Department of Computer Science, Dublin, Ireland

  46. Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3):408–421

  47. Wilson DR, Martinez TR (2000) Reduction techniques for instance-basedlearning algorithms. Mach Learn 38(3):257–286. doi:10.1023/A:1007626913721

  48. Wu J (2012) Advances in K-means clustering: a data mining thinking. Springer Publishing Company, Incorporated

    Book  Google Scholar 

  49. Xi X, Keogh E, Shelton C, Wei L, Ratanamahatana CA (2006) Fast time series classification using numerosity reduction. In: Proceedings of the 23rd international conference on Machine learning, ICML ’06, pp. 1033–1040. ACM, New York. doi:10.1145/1143844.1143974

Download references

Acknowledgments

We are grateful to the anonymous reviewers for their valuable comments on the original form of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefanos Ougiaroglou.

Additional information

S. Ougiaroglou is supported by a scholarship from the state scholarships foundation of Greece (I.K.Y.).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ougiaroglou, S., Evangelidis, G. RHC: a non-parametric cluster-based data reduction for efficient \(k\)-NN classification. Pattern Anal Applic 19, 93–109 (2016). https://doi.org/10.1007/s10044-014-0393-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-014-0393-7

Keywords

Navigation