Abstract
Data distribution has a significant impact on clustering results. This study focuses on the effect of cluster size distribution on clustering, namely the uniform effect of k-means and fuzzy c-means (FCM) clustering. We first provide some related works of k-means and FCM clustering. Then, the structure decomposition analysis of the objective functions of k-means and FCM is presented. Afterward, extensive experiments on both synthetic two-dimensional and three-dimensional data sets and real-world data sets from the UCI machine learning repository are conducted. The results demonstrate that FCM has stronger uniform effect than k-means clustering. Also, it reveals that the fuzzifier value m = 2 in FCM, which has been widely adopted in many applications, is not a good choice, particularly for data sets with great variation in cluster sizes. Therefore, for data sets with significant uneven distributions in cluster sizes, a smaller fuzzifier value is preferred for FCM clustering, and k-means clustering is a better choice compared with FCM clustering.
Similar content being viewed by others
References
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall Inc., Upper Saddle River
Bianchi FM, Livi L, Rizzi A (2015) Two density-based k-means initialization algorithms for non-metric data clustering. Pattern Anal Appl 19:1–19
Bianchi FM, Livi L, Rizzi A (2016) Two density-based k-means initialization algorithms for non-metric data clustering. Pattern Anal Appl 19:745–763
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16:645–678
Xu YJ, Wu XJ (2016) An affine subspace clustering algorithm based on ridge regression. Pattern Anal Appl 20:557–566
Cornuéjols A, Wemmert C, Gançarski P, Bennani Y (2018) Collaborative clustering: why, when, what and how. Inf Fusion 39:81–95
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA, USA, pp 281–297
Gerlhof C, Kemper A, Kilger C, Moerkotte G (1993) Partition-based clustering in object bases: from theory to practice. In: International conference on foundations of data organization and algorithms. Springer, pp 301–316
Guha S, Rastogi R, Shim K (2001) CURE: an efficient clustering algorithm for large databases. Inf Syst 26:35–58
Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 32:241–254
Karypis G, Han E-H, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32:68–75
Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: KDD 1998, pp 58–65
Sheikholeslami G, Chatterjee S, Zhang A (1998) Wavecluster: a multi-resolution clustering approach for very large spatial databases. In: VLDB 1998. pp 428–439
Liao W, Liu Y, Choudhary A (2004) A grid-based clustering algorithm using adaptive mesh refinement. In: 7th workshop on mining scientific and engineering datasets of SIAM international conference on data mining, pp 61–69
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39:1–38
Chen LS, Prentice RL, Wang P (2014) A penalized EM algorithm incorporating missing data mechanism for Gaussian parameter estimation. Biometrics 70:312–322
De Carvalho FDA, Lechevallier Y, De Melo FM (2012) Partitioning hard clustering algorithms based on multiple dissimilarity matrices. Pattern Recognit 45:447–464
Tîrnăucă C, Gómez-Pérez D, Balcázar JL, Montaña JL (2018) Global optimality in k-means clustering. Inf Sci 439–440:79–94
Ferreira MRP, de Carvalho FAT, Simões EC (2016) Kernel-based hard clustering methods with kernelization of the metric and automatic weighting of the variables. Pattern Recognit 51:310–321
Yang M-S (1993) A survey of fuzzy clustering. Math Comput Model 18:1–16
Sert SA, Bagci H, Yazici A (2015) MOFCA: multi-objective fuzzy clustering algorithm for wireless sensor networks. Appl Soft Comput 30:151–165
Bonis T, Oudot S (2018) A fuzzy clustering algorithm for the mode-seeking framework. Pattern Recognit Lett 102:37–43
Jothi R, Mohanty SK, Ojha A (2017) DK-means: a deterministic k-means clustering algorithm for gene expression analysis. Pattern Anal Appl. https://doi.org/10.1007/s10044-017-0673-0
Aparajeeta J, Nanda PK, Das N (2016) Modified possibilistic fuzzy c-means algorithms for segmentation of magnetic resonance image. Appl Soft Comput 41:104–119
Zhou K, Yang S, Shao Z (2017) Household monthly electricity consumption pattern mining: a fuzzy clustering-based model and a case study. J Clean Prod 141:900–908
Bigdeli E, Mohammadi M, Raahemi B, Matwin S (2017) A fast and noise resilient cluster-based anomaly detection. Pattern Anal Appl 20:183–199
Kamburov A, Lawrence MS, Polak P, Leshchiner I, Lage K, Golub TR et al (2015) Comprehensive assessment of cancer missense mutation clustering in protein structures. Proc Natl Acad Sci 112:E5486–E5495
Chifu A-G, Hristea F, Mothe J, Popescu M (2015) Word sense discrimination in information retrieval: a spectral clustering-based approach. Inf Process Manag 51:16–31
Kumar KM, Reddy ARM (2017) An efficient k-means clustering filtering algorithm using density based initial cluster centers. Inf Sci 418–419:286–301
Rodríguez J, Medina-Pérez MA, Gutierrez-Rodríguez AE, Monroy R, Terashima-Marín H (2018) Cluster validation using an ensemble of supervised classifiers. Knowl Based Syst 145:134–144
Farcomeni A (2014) Robust constrained clustering in presence of entry-wise outliers. Technometrics 56:102–111
Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley, Reading
Xiong H, Wu J, Chen J (2009) k-means clustering versus validation measures: a data-distribution perspective. IEEE Trans Syst Man Cybern Part B (Cybern) 39:318–331
Wu J, Xiong H, Chen J (2009) Towards understanding hierarchical clustering: a data distribution perspective. Neurocomputing 72:2319–2330
Zhou K, Yang S (2016) Exploring the uniform effect of FCM clustering: a data distribution perspective. Knowl Based Syst 96:76–83
Lichman M (2013) UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml. Accessed July 2017
Zhou K, Fu C, Yang S (2014) Fuzziness parameter selection in fuzzy c-means: the perspective of cluster validation. Sci China Inf Sci 57:1–8
Sledge IJ, Bezdek JC, Havens TC, Keller JM (2010) Relational generalizations of cluster validity indices. IEEE Trans Fuzzy Syst 18:771–786
Shen Y, Shi H, Zhang JQ (2000) Improvement and optimization of a fuzzy c-means clustering algorithm. Syst Eng Electron 3:1430–1433
Yang MS, Nataliani Y (2017) Robust-learning fuzzy c-means clustering algorithm with unknown number of clusters. Pattern Recognit 71:45–59
Martino FD, Sessa S (2018) Extended fuzzy c-means hotspot detection method for large and very large event datasets. Inf Sci 441:198–215
Memon KH (2018) A histogram approach for determining fuzzifier values of interval type-2 fuzzy c-means. Expert Syst Appl 91:27–35
Suleman A (2017) Measuring the congruence of fuzzy partitions in fuzzy c-means clustering. Appl Soft Comput 52:1285–1295
Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York
Janalipour M, Mohammadzadeh A (2017) Evaluation of effectiveness of three fuzzy systems and three texture extraction methods for building damage detection from post-event LiDAR data. Int J Digit Earth 12:1241–1268
Ozkan I, Turksen IB (2007) Upper and lower values for the level of fuzziness in FCM. Inf Sci 177:5143–5152
Wu KL (2012) Analysis of parameter selections for fuzzy c-means. Pattern Recognit 45:407–415
Idri A, Hosni M, Abran A (2016) Improved estimation of software development effort using classical and fuzzy analogy ensembles. Appl Soft Comput 49:990–1019
Idri A, Abnane I, Abran A (2017) Evaluating Pred(p) and standardized accuracy criteria in software development effort estimation. J Softw Evol Process 9:9. https://doi.org/10.1002/smr.1925
Chan KP, Cheung YS (1992) Clustering of clusters. Pattern Recognit 25:211–217
Pal NR, Bezdek JC (1995) On cluster validity for the fuzzy c-mean model. IEEE Trans Fuzzy Syst 3:370–379
Yu J, Cheng Q, Huang H (2004) Analysis of the weighting exponent in the FCM. IEEE Trans Syst Man Cybern B Cybern 34:634–639
Dacunha-Castelle D, Duflo M (1986) Probability and statistics. Springer, New York
Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: ACM SIGKDD international conference on knowledge discovery and data mining, Paris, France, June 28–July 2009, pp 877–886
Wu J, Xiong H, Wu P, Chen J (2007) Local decomposition for rare class analysis. In: ACM SIGKDD international conference on knowledge discovery and data mining, San Jose, California, USA, Aug 2007, pp 191–220
Acknowledgements
The authors would like to thank the anonymous reviewers very much for their valuable comments and suggestions for improving the quality of the paper. This work was supported by the National Natural Science Foundation of China under Grant Nos. 71822104, 71501056 and 71690235, Anhui Science and Technology Major Project under Grant No. 17030901024, China Postdoctoral Science Foundation under Grant No. 2017M612072, and Hong Kong Scholars Program under Grant No. 2017-167.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhou, K., Yang, S. Effect of cluster size distribution on clustering: a comparative study of k-means and fuzzy c-means clustering. Pattern Anal Applic 23, 455–466 (2020). https://doi.org/10.1007/s10044-019-00783-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-019-00783-6