Automatic detection of outliers and the number of clusters in k-means clustering via Chebyshev-type inequalities

Olukanmi, Peter; Nelwamondo, Fulufhelo; Marwala, Tshilidzi; Twala, Bhekisipho

doi:10.1007/s00521-021-06689-x

Automatic detection of outliers and the number of clusters in k-means clustering via Chebyshev-type inequalities

Original Article
Published: 20 January 2022

Volume 34, pages 5939–5958, (2022)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Peter Olukanmi¹,
Fulufhelo Nelwamondo¹,
Tshilidzi Marwala¹ &
…
Bhekisipho Twala²

1004 Accesses
1 Altmetric
Explore all metrics

Abstract

We address two key challenges of k-means clustering. In the first part of the paper, we show that: when a dataset is partitioned with an appropriate number of clusters (k), not more than 1/9 of D will exceed twice its standard deviation (2 s.d.), and not more than 4/9 of D will exceed its standard deviation (1 s.d.) (σ), where D is a vector comprising the distance of each point to its cluster centroid. Our bounds assume unimodal symmetrical clusters (a generalization of k-means’ Gaussian assumption). In the second part of the paper, we show that a non-outlier will not be further from its cluster centroid than 14.826 times the median of absolute deviations from the median of D. Interestingly, D is already available from the k-means process. The first insight leads to an enhanced k-means algorithm (named Automatic k-means) which efficiently estimates k. Unlike popular techniques, ours eliminates the need to supply a search range for k. Meanwhile, since practical datasets may deviate from the ideal distribution, the 1 and 2 s.d. tests may yield different k estimates. Both estimates constitute effective lower and upper bounds. Thus, our algorithm also provides a general way to speed up and automate existing techniques, via automatically determined narrow search range. We demonstrate this by presenting enhanced versions of the popular silhouette and gap statistics techniques (Auto-Silhouette and Auto-Gap). We apply the second theoretical insight to incorporate automatic outlier detection into k-means. Our outlier-aware algorithm (named k-means#) is identical to the standard k-means in the absence of outliers. In the presence of outliers, it is identical to a known outlier-aware algorithm, named k-means-−, except for the crucial difference that k-means−− relies on the user to supply the number of outliers, while our algorithm is automated. Our technique solves a puzzle described by the authors of k-means−− regarding the difficulty of complete automation, which was considered an open problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MORe++: k-Means Based Outlier Removal on High-Dimensional Data

A New K-means-Based Algorithm for Automatic Clustering and Outlier Discovery

A Comparative Evaluation of Different Distance Measures for Determining Initial Seeds in the K-means Algorithm

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Zhou Y et al (2019) Predictive big data analytics using the UK biobank data. Sci Rep 9(1):6012
Article Google Scholar
Davari M, Noursalehi P, Keramati A (2019) Data mining approach to professional education market segmentation: a case study. J Mark High Educ 29(1):45–66
Article Google Scholar
Gellrich S, Filz M-A, Wölper J, Herrmann C, Thiede S (2019) Data mining applications in manufacturing of lightweight structures. In: Technologies for economical and functional lightweight design. Springer, pp 15–27
Bin S (2020) K-means stock clustering analysis based on historical price movements and financial ratios
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666
Article Google Scholar
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137
Article MathSciNet Google Scholar
Whang JJ, Hou Y, Gleich D, Dhillon IS (2018) Non-exhaustive, overlapping clustering. IEEE Trans Pattern Anal Mach Intell
Olukanmi PO, Twala B (2017) K-means-sharp: modified centroid update for outlier-robust k-means clustering. In: Pattern recognition association of South Africa and robotics and mechatronics (PRASA-RobMech), 2017, pp 14–19
Olukanmi PO, Twala B (2017) Sensitivity analysis of an outlier-aware k-means clustering algorithm. In: Pattern recognition association of South Africa and robotics and mechatronics (PRASA-RobMech), 2017, pp 68–73
Olukanmi P, Nelwamondo F, Marwala T (2019) Rethinking k-means clustering in the age of massive datasets: a constant-time approach. Neural Comput Appl:1–23
Capó M, Pérez A, Lozano JA (2017) An efficient approximation to the K-means clustering for massive data. Knowl-Based Syst 117:56–69
Article Google Scholar
Olukanmi PO, Nelwamondo F, Marwala T (2019) k-means-lite: real time clustering for large datasets. In: 2018 5th international conference on soft computing & machine intelligence (ISCMI), pp 54–59
Hess S, Duivesteijn W (2019) k is the magic number—inferring the number of clusters through nonparametric concentration inequalities. Submitt. ECMLPKDD
Kalogeratos A, Likas A (2012) Dip-means: an incremental clustering method for estimating the number of clusters. In: Advances in neural information processing systems, pp 2393–2401
Steinley D, Brusco MJ (2011) Choosing the number of clusters in K-means clustering. Psychol Methods 16(3):285
Article Google Scholar
Olukanmi PO, Nelwamondo F, Marwala T (2020) k-means-MIND: an efficient alternative to repetitive k-means runs. In: 2020 7th international conference on soft computing & machine intelligence (ISCMI), pp 172–176
Bianchi FM, Livi L, Rizzi A (2016) Two density-based k-means initialization algorithms for non-metric data clustering. Pattern Anal Appl 19(3):745–763
Article MathSciNet Google Scholar
Fränti P, Sieranoja S (2019) How much can k-means be improved by using better initialization and repeats? Pattern Recognit 93:95–112
Article Google Scholar
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B Stat Methodol 63(2):411–423
Article MathSciNet Google Scholar
Liang J, Zhao X, Li D, Cao F, Dang C (2012) Determining the number of clusters using information entropy for mixed data. Pattern Recognit 45(6):2251–2265
Article Google Scholar
Chawla S, Gionis A (2013) k-means–: a unified approach to clustering and outlier detection. In: Proceedings of the 2013 SIAM international conference on data mining, pp 189–197
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Article Google Scholar
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. Proc Fifth Berkeley Symp Math Stat Probab 1:281–297
MathSciNet MATH Google Scholar
Steinley D (2006) K-means clustering: a half-century synthesis. Br J Math Stat Psychol 59(1):1–34
Article MathSciNet Google Scholar
Anderberg MR (1973) Cluster analysis for applications. Academic Press, New York
MATH Google Scholar
Wishart D (1969) Fortran II programs for 8 methods of cluster analysis (CLUSTAN I). State Geological Survey
Bischof H, Leonardis A, Selb A (1999) MDL principle for robust vector quantisation. Pattern Anal Appl 2(1):59–72
Article Google Scholar
Hamerly G, Elkan C (2004) Learning the k in k-means. In: Advances in neural information processing systems, pp 281–288
Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat-Theory Methods 3(1):1–27
Article MathSciNet Google Scholar
Duda RO, Hart PE (1973) Pattern recognition and scene analysis. Wiley, New York
MATH Google Scholar
Hubert LJ, Levin JR (1976) A general statistical framework for assessing categorical clustering in free recall. Psychol Bull 83(6):1072
Article Google Scholar
Milligan GW (1980) An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika 45(3):325–342
Article Google Scholar
Gan G, Ng MK-P (2017) K-means clustering with outlier removal. Pattern Recognit Lett 90:8–14
Article Google Scholar
Jiang M-F, Tseng S-S, Su C-M (2001) Two-phase clustering process for outliers detection. Pattern Recognit Lett 22(6–7):691–700
Article Google Scholar
Hautamäki V, Cherednichenko S, Kärkkäinen I, Kinnunen T, Fränti P (2005) Improving k-means by outlier removal. In Scandinavian conference on image analysis, pp 978–987
Whang JJ, Dhillon IS, Gleich DF (2015) Non-exhaustive, overlapping k-means. In: Proceedings of the 2015 SIAM international conference on data mining, pp 936–944
Bickel PJ, Krieger AM (1992) Extensions of Chebychev’s inequality with applications. Probab Math Stat 13:293–310
MathSciNet MATH Google Scholar
DasGupta A (2000) Best constants in Chebyshev inequalities with various applications. Metrika 51(3):185–200
Article MathSciNet Google Scholar
Arthur D, Vassilvitskii S (2007) k-means++: The advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp 1027–1035
Olukanmi P, Nelwamondo F, Marwala T (2019) Learning the k in k-means via the Camp-Meidell inequality. In: 2019 6th international conference on soft computing & machine intelligence (ISCMI), pp 54–59
Leys C, Ley C, Klein O, Bernard P, Licata L (2013) Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. J Exp Soc Psychol 49(4):764–766
Article Google Scholar
Miller J (1991) Reaction time analysis with outlier exclusion: bias varies with sample size. Q J Exp Psychol 43(4):907–912
Article Google Scholar
Sun P, Chawla S, Arunasalam B (2006) Mining for outliers in sequential databases. In: Proceedings of the 2006 SIAM international conference on data mining, pp 94–105
Knight NL, Wang J (2009) A comparison of outlier detection procedures and robust estimation methods in GPS positioning. J Navig 62(4):699–709
Article Google Scholar
Cousineau D, Chartier S (2010) Outliers detection and treatment: a review. Int J Psychol Res 3(1):58–67
Article Google Scholar
Chawla S, Sun P (2006) SLOM: a new measure for local spatial outliers. Knowl Inf Syst 9(4):412–429
Article Google Scholar
Rousseeuw PJ, Croux C (1993) Alternatives to the median absolute deviation. J Am Stat Assoc 88(424):1273–1283
Article MathSciNet Google Scholar
Fränti P, Sieranoja S (2017) K-means properties on six clustering benchmark datasets. Appl Intell:1–17

Download references

Acknowledgements

This work was funded by the University of Johannesburg Global Excellence and Stature Doctoral Scholarship.

Author information

Authors and Affiliations

Department of Electrical and Electronics Engineering, Institute for Intelligent Systems, University of Johannesburg, Johannesburg, South Africa
Peter Olukanmi, Fulufhelo Nelwamondo & Tshilidzi Marwala
Durban University of Technology, Durban, South Africa
Bhekisipho Twala

Authors

Peter Olukanmi
View author publications
You can also search for this author inPubMed Google Scholar
Fulufhelo Nelwamondo
View author publications
You can also search for this author inPubMed Google Scholar
Tshilidzi Marwala
View author publications
You can also search for this author inPubMed Google Scholar
Bhekisipho Twala
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Peter Olukanmi.

Ethics declarations

Conflict of interest

We have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Olukanmi, P., Nelwamondo, F., Marwala, T. et al. Automatic detection of outliers and the number of clusters in k-means clustering via Chebyshev-type inequalities. Neural Comput & Applic 34, 5939–5958 (2022). https://doi.org/10.1007/s00521-021-06689-x

Download citation

Received: 26 May 2021
Accepted: 27 October 2021
Published: 20 January 2022
Issue Date: April 2022
DOI: https://doi.org/10.1007/s00521-021-06689-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic detection of outliers and the number of clusters in k-means clustering via Chebyshev-type inequalities

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MORe++: k-Means Based Outlier Removal on High-Dimensional Data

A New K-means-Based Algorithm for Automatic Clustering and Outlier Discovery

A Comparative Evaluation of Different Distance Measures for Determining Initial Seeds in the K-means Algorithm

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now