Abstract
We address two key challenges of k-means clustering. In the first part of the paper, we show that: when a dataset is partitioned with an appropriate number of clusters (k), not more than 1/9 of D will exceed twice its standard deviation (2 s.d.), and not more than 4/9 of D will exceed its standard deviation (1 s.d.) (σ), where D is a vector comprising the distance of each point to its cluster centroid. Our bounds assume unimodal symmetrical clusters (a generalization of k-means’ Gaussian assumption). In the second part of the paper, we show that a non-outlier will not be further from its cluster centroid than 14.826 times the median of absolute deviations from the median of D. Interestingly, D is already available from the k-means process. The first insight leads to an enhanced k-means algorithm (named Automatic k-means) which efficiently estimates k. Unlike popular techniques, ours eliminates the need to supply a search range for k. Meanwhile, since practical datasets may deviate from the ideal distribution, the 1 and 2 s.d. tests may yield different k estimates. Both estimates constitute effective lower and upper bounds. Thus, our algorithm also provides a general way to speed up and automate existing techniques, via automatically determined narrow search range. We demonstrate this by presenting enhanced versions of the popular silhouette and gap statistics techniques (Auto-Silhouette and Auto-Gap). We apply the second theoretical insight to incorporate automatic outlier detection into k-means. Our outlier-aware algorithm (named k-means#) is identical to the standard k-means in the absence of outliers. In the presence of outliers, it is identical to a known outlier-aware algorithm, named k-means-−, except for the crucial difference that k-means−− relies on the user to supply the number of outliers, while our algorithm is automated. Our technique solves a puzzle described by the authors of k-means−− regarding the difficulty of complete automation, which was considered an open problem.







Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Zhou Y et al (2019) Predictive big data analytics using the UK biobank data. Sci Rep 9(1):6012
Davari M, Noursalehi P, Keramati A (2019) Data mining approach to professional education market segmentation: a case study. J Mark High Educ 29(1):45–66
Gellrich S, Filz M-A, Wölper J, Herrmann C, Thiede S (2019) Data mining applications in manufacturing of lightweight structures. In: Technologies for economical and functional lightweight design. Springer, pp 15–27
Bin S (2020) K-means stock clustering analysis based on historical price movements and financial ratios
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137
Whang JJ, Hou Y, Gleich D, Dhillon IS (2018) Non-exhaustive, overlapping clustering. IEEE Trans Pattern Anal Mach Intell
Olukanmi PO, Twala B (2017) K-means-sharp: modified centroid update for outlier-robust k-means clustering. In: Pattern recognition association of South Africa and robotics and mechatronics (PRASA-RobMech), 2017, pp 14–19
Olukanmi PO, Twala B (2017) Sensitivity analysis of an outlier-aware k-means clustering algorithm. In: Pattern recognition association of South Africa and robotics and mechatronics (PRASA-RobMech), 2017, pp 68–73
Olukanmi P, Nelwamondo F, Marwala T (2019) Rethinking k-means clustering in the age of massive datasets: a constant-time approach. Neural Comput Appl:1–23
Capó M, Pérez A, Lozano JA (2017) An efficient approximation to the K-means clustering for massive data. Knowl-Based Syst 117:56–69
Olukanmi PO, Nelwamondo F, Marwala T (2019) k-means-lite: real time clustering for large datasets. In: 2018 5th international conference on soft computing & machine intelligence (ISCMI), pp 54–59
Hess S, Duivesteijn W (2019) k is the magic number—inferring the number of clusters through nonparametric concentration inequalities. Submitt. ECMLPKDD
Kalogeratos A, Likas A (2012) Dip-means: an incremental clustering method for estimating the number of clusters. In: Advances in neural information processing systems, pp 2393–2401
Steinley D, Brusco MJ (2011) Choosing the number of clusters in K-means clustering. Psychol Methods 16(3):285
Olukanmi PO, Nelwamondo F, Marwala T (2020) k-means-MIND: an efficient alternative to repetitive k-means runs. In: 2020 7th international conference on soft computing & machine intelligence (ISCMI), pp 172–176
Bianchi FM, Livi L, Rizzi A (2016) Two density-based k-means initialization algorithms for non-metric data clustering. Pattern Anal Appl 19(3):745–763
Fränti P, Sieranoja S (2019) How much can k-means be improved by using better initialization and repeats? Pattern Recognit 93:95–112
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B Stat Methodol 63(2):411–423
Liang J, Zhao X, Li D, Cao F, Dang C (2012) Determining the number of clusters using information entropy for mixed data. Pattern Recognit 45(6):2251–2265
Chawla S, Gionis A (2013) k-means–: a unified approach to clustering and outlier detection. In: Proceedings of the 2013 SIAM international conference on data mining, pp 189–197
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. Proc Fifth Berkeley Symp Math Stat Probab 1:281–297
Steinley D (2006) K-means clustering: a half-century synthesis. Br J Math Stat Psychol 59(1):1–34
Anderberg MR (1973) Cluster analysis for applications. Academic Press, New York
Wishart D (1969) Fortran II programs for 8 methods of cluster analysis (CLUSTAN I). State Geological Survey
Bischof H, Leonardis A, Selb A (1999) MDL principle for robust vector quantisation. Pattern Anal Appl 2(1):59–72
Hamerly G, Elkan C (2004) Learning the k in k-means. In: Advances in neural information processing systems, pp 281–288
Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat-Theory Methods 3(1):1–27
Duda RO, Hart PE (1973) Pattern recognition and scene analysis. Wiley, New York
Hubert LJ, Levin JR (1976) A general statistical framework for assessing categorical clustering in free recall. Psychol Bull 83(6):1072
Milligan GW (1980) An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika 45(3):325–342
Gan G, Ng MK-P (2017) K-means clustering with outlier removal. Pattern Recognit Lett 90:8–14
Jiang M-F, Tseng S-S, Su C-M (2001) Two-phase clustering process for outliers detection. Pattern Recognit Lett 22(6–7):691–700
Hautamäki V, Cherednichenko S, Kärkkäinen I, Kinnunen T, Fränti P (2005) Improving k-means by outlier removal. In Scandinavian conference on image analysis, pp 978–987
Whang JJ, Dhillon IS, Gleich DF (2015) Non-exhaustive, overlapping k-means. In: Proceedings of the 2015 SIAM international conference on data mining, pp 936–944
Bickel PJ, Krieger AM (1992) Extensions of Chebychev’s inequality with applications. Probab Math Stat 13:293–310
DasGupta A (2000) Best constants in Chebyshev inequalities with various applications. Metrika 51(3):185–200
Arthur D, Vassilvitskii S (2007) k-means++: The advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp 1027–1035
Olukanmi P, Nelwamondo F, Marwala T (2019) Learning the k in k-means via the Camp-Meidell inequality. In: 2019 6th international conference on soft computing & machine intelligence (ISCMI), pp 54–59
Leys C, Ley C, Klein O, Bernard P, Licata L (2013) Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. J Exp Soc Psychol 49(4):764–766
Miller J (1991) Reaction time analysis with outlier exclusion: bias varies with sample size. Q J Exp Psychol 43(4):907–912
Sun P, Chawla S, Arunasalam B (2006) Mining for outliers in sequential databases. In: Proceedings of the 2006 SIAM international conference on data mining, pp 94–105
Knight NL, Wang J (2009) A comparison of outlier detection procedures and robust estimation methods in GPS positioning. J Navig 62(4):699–709
Cousineau D, Chartier S (2010) Outliers detection and treatment: a review. Int J Psychol Res 3(1):58–67
Chawla S, Sun P (2006) SLOM: a new measure for local spatial outliers. Knowl Inf Syst 9(4):412–429
Rousseeuw PJ, Croux C (1993) Alternatives to the median absolute deviation. J Am Stat Assoc 88(424):1273–1283
Fränti P, Sieranoja S (2017) K-means properties on six clustering benchmark datasets. Appl Intell:1–17
Acknowledgements
This work was funded by the University of Johannesburg Global Excellence and Stature Doctoral Scholarship.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
We have no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Olukanmi, P., Nelwamondo, F., Marwala, T. et al. Automatic detection of outliers and the number of clusters in k-means clustering via Chebyshev-type inequalities. Neural Comput & Applic 34, 5939–5958 (2022). https://doi.org/10.1007/s00521-021-06689-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-021-06689-x