Skip to main content
Log in

Automatic detection of outliers and the number of clusters in k-means clustering via Chebyshev-type inequalities

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

We address two key challenges of k-means clustering. In the first part of the paper, we show that: when a dataset is partitioned with an appropriate number of clusters (k), not more than 1/9 of D will exceed twice its standard deviation (2 s.d.), and not more than 4/9 of D will exceed its standard deviation (1 s.d.) (σ), where D is a vector comprising the distance of each point to its cluster centroid. Our bounds assume unimodal symmetrical clusters (a generalization of k-means’ Gaussian assumption). In the second part of the paper, we show that a non-outlier will not be further from its cluster centroid than 14.826 times the median of absolute deviations from the median of D. Interestingly, D is already available from the k-means process. The first insight leads to an enhanced k-means algorithm (named Automatic k-means) which  efficiently estimates k. Unlike popular techniques, ours eliminates the need to supply a search range for k. Meanwhile, since practical datasets may deviate from the ideal distribution, the 1 and 2 s.d. tests may yield different k estimates. Both estimates constitute effective lower and upper bounds. Thus, our algorithm also provides a general way to speed up and automate existing techniques, via automatically determined narrow search range. We demonstrate this by presenting enhanced versions of the popular silhouette and gap statistics techniques (Auto-Silhouette and Auto-Gap). We apply the second theoretical insight to incorporate automatic outlier detection into k-means. Our outlier-aware algorithm (named k-means#) is identical to the standard k-means in the absence of outliers. In the presence of outliers, it is identical to a known outlier-aware algorithm, named k-means-−, except for the crucial difference that k-means−− relies on the user to supply the number of outliers, while our algorithm is automated. Our technique solves a puzzle described by the authors of k-means−− regarding the difficulty of complete automation, which was considered an open problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Zhou Y et al (2019) Predictive big data analytics using the UK biobank data. Sci Rep 9(1):6012

    Article  Google Scholar 

  2. Davari M, Noursalehi P, Keramati A (2019) Data mining approach to professional education market segmentation: a case study. J Mark High Educ 29(1):45–66

    Article  Google Scholar 

  3. Gellrich S, Filz M-A, Wölper J, Herrmann C, Thiede S (2019) Data mining applications in manufacturing of lightweight structures. In: Technologies for economical and functional lightweight design. Springer, pp 15–27

  4. Bin S (2020) K-means stock clustering analysis based on historical price movements and financial ratios

  5. Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666

    Article  Google Scholar 

  6. Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137

    Article  MathSciNet  Google Scholar 

  7. Whang JJ, Hou Y, Gleich D, Dhillon IS (2018) Non-exhaustive, overlapping clustering. IEEE Trans Pattern Anal Mach Intell

  8. Olukanmi PO, Twala B (2017) K-means-sharp: modified centroid update for outlier-robust k-means clustering. In: Pattern recognition association of South Africa and robotics and mechatronics (PRASA-RobMech), 2017, pp 14–19

  9. Olukanmi PO, Twala B (2017) Sensitivity analysis of an outlier-aware k-means clustering algorithm. In: Pattern recognition association of South Africa and robotics and mechatronics (PRASA-RobMech), 2017, pp 68–73

  10. Olukanmi P, Nelwamondo F, Marwala T (2019) Rethinking k-means clustering in the age of massive datasets: a constant-time approach. Neural Comput Appl:1–23

  11. Capó M, Pérez A, Lozano JA (2017) An efficient approximation to the K-means clustering for massive data. Knowl-Based Syst 117:56–69

    Article  Google Scholar 

  12. Olukanmi PO, Nelwamondo F, Marwala T (2019) k-means-lite: real time clustering for large datasets. In: 2018 5th international conference on soft computing & machine intelligence (ISCMI), pp 54–59

  13. Hess S, Duivesteijn W (2019) k is the magic number—inferring the number of clusters through nonparametric concentration inequalities. Submitt. ECMLPKDD

  14. Kalogeratos A, Likas A (2012) Dip-means: an incremental clustering method for estimating the number of clusters. In: Advances in neural information processing systems, pp 2393–2401

  15. Steinley D, Brusco MJ (2011) Choosing the number of clusters in K-means clustering. Psychol Methods 16(3):285

    Article  Google Scholar 

  16. Olukanmi PO, Nelwamondo F, Marwala T (2020) k-means-MIND: an efficient alternative to repetitive k-means runs. In: 2020 7th international conference on soft computing & machine intelligence (ISCMI), pp 172–176

  17. Bianchi FM, Livi L, Rizzi A (2016) Two density-based k-means initialization algorithms for non-metric data clustering. Pattern Anal Appl 19(3):745–763

    Article  MathSciNet  Google Scholar 

  18. Fränti P, Sieranoja S (2019) How much can k-means be improved by using better initialization and repeats? Pattern Recognit 93:95–112

    Article  Google Scholar 

  19. Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B Stat Methodol 63(2):411–423

    Article  MathSciNet  Google Scholar 

  20. Liang J, Zhao X, Li D, Cao F, Dang C (2012) Determining the number of clusters using information entropy for mixed data. Pattern Recognit 45(6):2251–2265

    Article  Google Scholar 

  21. Chawla S, Gionis A (2013) k-means–: a unified approach to clustering and outlier detection. In: Proceedings of the 2013 SIAM international conference on data mining, pp 189–197

  22. Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  Google Scholar 

  23. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. Proc Fifth Berkeley Symp Math Stat Probab 1:281–297

    MathSciNet  MATH  Google Scholar 

  24. Steinley D (2006) K-means clustering: a half-century synthesis. Br J Math Stat Psychol 59(1):1–34

    Article  MathSciNet  Google Scholar 

  25. Anderberg MR (1973) Cluster analysis for applications. Academic Press, New York

    MATH  Google Scholar 

  26. Wishart D (1969) Fortran II programs for 8 methods of cluster analysis (CLUSTAN I). State Geological Survey

  27. Bischof H, Leonardis A, Selb A (1999) MDL principle for robust vector quantisation. Pattern Anal Appl 2(1):59–72

    Article  Google Scholar 

  28. Hamerly G, Elkan C (2004) Learning the k in k-means. In: Advances in neural information processing systems, pp 281–288

  29. Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat-Theory Methods 3(1):1–27

    Article  MathSciNet  Google Scholar 

  30. Duda RO, Hart PE (1973) Pattern recognition and scene analysis. Wiley, New York

    MATH  Google Scholar 

  31. Hubert LJ, Levin JR (1976) A general statistical framework for assessing categorical clustering in free recall. Psychol Bull 83(6):1072

    Article  Google Scholar 

  32. Milligan GW (1980) An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika 45(3):325–342

    Article  Google Scholar 

  33. Gan G, Ng MK-P (2017) K-means clustering with outlier removal. Pattern Recognit Lett 90:8–14

    Article  Google Scholar 

  34. Jiang M-F, Tseng S-S, Su C-M (2001) Two-phase clustering process for outliers detection. Pattern Recognit Lett 22(6–7):691–700

    Article  Google Scholar 

  35. Hautamäki V, Cherednichenko S, Kärkkäinen I, Kinnunen T, Fränti P (2005) Improving k-means by outlier removal. In Scandinavian conference on image analysis, pp 978–987

  36. Whang JJ, Dhillon IS, Gleich DF (2015) Non-exhaustive, overlapping k-means. In: Proceedings of the 2015 SIAM international conference on data mining, pp 936–944

  37. Bickel PJ, Krieger AM (1992) Extensions of Chebychev’s inequality with applications. Probab Math Stat 13:293–310

    MathSciNet  MATH  Google Scholar 

  38. DasGupta A (2000) Best constants in Chebyshev inequalities with various applications. Metrika 51(3):185–200

    Article  MathSciNet  Google Scholar 

  39. Arthur D, Vassilvitskii S (2007) k-means++: The advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp 1027–1035

  40. Olukanmi P, Nelwamondo F, Marwala T (2019) Learning the k in k-means via the Camp-Meidell inequality. In: 2019 6th international conference on soft computing & machine intelligence (ISCMI), pp 54–59

  41. Leys C, Ley C, Klein O, Bernard P, Licata L (2013) Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. J Exp Soc Psychol 49(4):764–766

    Article  Google Scholar 

  42. Miller J (1991) Reaction time analysis with outlier exclusion: bias varies with sample size. Q J Exp Psychol 43(4):907–912

    Article  Google Scholar 

  43. Sun P, Chawla S, Arunasalam B (2006) Mining for outliers in sequential databases. In: Proceedings of the 2006 SIAM international conference on data mining, pp 94–105

  44. Knight NL, Wang J (2009) A comparison of outlier detection procedures and robust estimation methods in GPS positioning. J Navig 62(4):699–709

    Article  Google Scholar 

  45. Cousineau D, Chartier S (2010) Outliers detection and treatment: a review. Int J Psychol Res 3(1):58–67

    Article  Google Scholar 

  46. Chawla S, Sun P (2006) SLOM: a new measure for local spatial outliers. Knowl Inf Syst 9(4):412–429

    Article  Google Scholar 

  47. Rousseeuw PJ, Croux C (1993) Alternatives to the median absolute deviation. J Am Stat Assoc 88(424):1273–1283

    Article  MathSciNet  Google Scholar 

  48. Fränti P, Sieranoja S (2017) K-means properties on six clustering benchmark datasets. Appl Intell:1–17

Download references

Acknowledgements

This work was funded by the University of Johannesburg Global Excellence and Stature Doctoral Scholarship.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peter Olukanmi.

Ethics declarations

Conflict of interest

We have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Olukanmi, P., Nelwamondo, F., Marwala, T. et al. Automatic detection of outliers and the number of clusters in k-means clustering via Chebyshev-type inequalities. Neural Comput & Applic 34, 5939–5958 (2022). https://doi.org/10.1007/s00521-021-06689-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-021-06689-x

Keywords

Navigation