Skip to main content
Log in

Effect of cluster size distribution on clustering: a comparative study of k-means and fuzzy c-means clustering

  • Theoretical advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Data distribution has a significant impact on clustering results. This study focuses on the effect of cluster size distribution on clustering, namely the uniform effect of k-means and fuzzy c-means (FCM) clustering. We first provide some related works of k-means and FCM clustering. Then, the structure decomposition analysis of the objective functions of k-means and FCM is presented. Afterward, extensive experiments on both synthetic two-dimensional and three-dimensional data sets and real-world data sets from the UCI machine learning repository are conducted. The results demonstrate that FCM has stronger uniform effect than k-means clustering. Also, it reveals that the fuzzifier value m = 2 in FCM, which has been widely adopted in many applications, is not a good choice, particularly for data sets with great variation in cluster sizes. Therefore, for data sets with significant uneven distributions in cluster sizes, a smaller fuzzifier value is preferred for FCM clustering, and k-means clustering is a better choice compared with FCM clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall Inc., Upper Saddle River

    MATH  Google Scholar 

  2. Bianchi FM, Livi L, Rizzi A (2015) Two density-based k-means initialization algorithms for non-metric data clustering. Pattern Anal Appl 19:1–19

    Google Scholar 

  3. Bianchi FM, Livi L, Rizzi A (2016) Two density-based k-means initialization algorithms for non-metric data clustering. Pattern Anal Appl 19:745–763

    MathSciNet  Google Scholar 

  4. Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16:645–678

    Google Scholar 

  5. Xu YJ, Wu XJ (2016) An affine subspace clustering algorithm based on ridge regression. Pattern Anal Appl 20:557–566

    MathSciNet  Google Scholar 

  6. Cornuéjols A, Wemmert C, Gançarski P, Bennani Y (2018) Collaborative clustering: why, when, what and how. Inf Fusion 39:81–95

    Google Scholar 

  7. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA, USA, pp 281–297

  8. Gerlhof C, Kemper A, Kilger C, Moerkotte G (1993) Partition-based clustering in object bases: from theory to practice. In: International conference on foundations of data organization and algorithms. Springer, pp 301–316

  9. Guha S, Rastogi R, Shim K (2001) CURE: an efficient clustering algorithm for large databases. Inf Syst 26:35–58

    MATH  Google Scholar 

  10. Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 32:241–254

    MATH  Google Scholar 

  11. Karypis G, Han E-H, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32:68–75

    Google Scholar 

  12. Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: KDD 1998, pp 58–65

  13. Sheikholeslami G, Chatterjee S, Zhang A (1998) Wavecluster: a multi-resolution clustering approach for very large spatial databases. In: VLDB 1998. pp 428–439

  14. Liao W, Liu Y, Choudhary A (2004) A grid-based clustering algorithm using adaptive mesh refinement. In: 7th workshop on mining scientific and engineering datasets of SIAM international conference on data mining, pp 61–69

  15. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39:1–38

    MathSciNet  MATH  Google Scholar 

  16. Chen LS, Prentice RL, Wang P (2014) A penalized EM algorithm incorporating missing data mechanism for Gaussian parameter estimation. Biometrics 70:312–322

    MathSciNet  MATH  Google Scholar 

  17. De Carvalho FDA, Lechevallier Y, De Melo FM (2012) Partitioning hard clustering algorithms based on multiple dissimilarity matrices. Pattern Recognit 45:447–464

    MATH  Google Scholar 

  18. Tîrnăucă C, Gómez-Pérez D, Balcázar JL, Montaña JL (2018) Global optimality in k-means clustering. Inf Sci 439–440:79–94

    MathSciNet  Google Scholar 

  19. Ferreira MRP, de Carvalho FAT, Simões EC (2016) Kernel-based hard clustering methods with kernelization of the metric and automatic weighting of the variables. Pattern Recognit 51:310–321

    Google Scholar 

  20. Yang M-S (1993) A survey of fuzzy clustering. Math Comput Model 18:1–16

    MathSciNet  MATH  Google Scholar 

  21. Sert SA, Bagci H, Yazici A (2015) MOFCA: multi-objective fuzzy clustering algorithm for wireless sensor networks. Appl Soft Comput 30:151–165

    Google Scholar 

  22. Bonis T, Oudot S (2018) A fuzzy clustering algorithm for the mode-seeking framework. Pattern Recognit Lett 102:37–43

    Google Scholar 

  23. Jothi R, Mohanty SK, Ojha A (2017) DK-means: a deterministic k-means clustering algorithm for gene expression analysis. Pattern Anal Appl. https://doi.org/10.1007/s10044-017-0673-0

    Article  Google Scholar 

  24. Aparajeeta J, Nanda PK, Das N (2016) Modified possibilistic fuzzy c-means algorithms for segmentation of magnetic resonance image. Appl Soft Comput 41:104–119

    Google Scholar 

  25. Zhou K, Yang S, Shao Z (2017) Household monthly electricity consumption pattern mining: a fuzzy clustering-based model and a case study. J Clean Prod 141:900–908

    Google Scholar 

  26. Bigdeli E, Mohammadi M, Raahemi B, Matwin S (2017) A fast and noise resilient cluster-based anomaly detection. Pattern Anal Appl 20:183–199

    MathSciNet  Google Scholar 

  27. Kamburov A, Lawrence MS, Polak P, Leshchiner I, Lage K, Golub TR et al (2015) Comprehensive assessment of cancer missense mutation clustering in protein structures. Proc Natl Acad Sci 112:E5486–E5495

    Google Scholar 

  28. Chifu A-G, Hristea F, Mothe J, Popescu M (2015) Word sense discrimination in information retrieval: a spectral clustering-based approach. Inf Process Manag 51:16–31

    Google Scholar 

  29. Kumar KM, Reddy ARM (2017) An efficient k-means clustering filtering algorithm using density based initial cluster centers. Inf Sci 418–419:286–301

    MathSciNet  Google Scholar 

  30. Rodríguez J, Medina-Pérez MA, Gutierrez-Rodríguez AE, Monroy R, Terashima-Marín H (2018) Cluster validation using an ensemble of supervised classifiers. Knowl Based Syst 145:134–144

    Google Scholar 

  31. Farcomeni A (2014) Robust constrained clustering in presence of entry-wise outliers. Technometrics 56:102–111

    MathSciNet  Google Scholar 

  32. Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley, Reading

    Google Scholar 

  33. Xiong H, Wu J, Chen J (2009) k-means clustering versus validation measures: a data-distribution perspective. IEEE Trans Syst Man Cybern Part B (Cybern) 39:318–331

    Google Scholar 

  34. Wu J, Xiong H, Chen J (2009) Towards understanding hierarchical clustering: a data distribution perspective. Neurocomputing 72:2319–2330

    Google Scholar 

  35. Zhou K, Yang S (2016) Exploring the uniform effect of FCM clustering: a data distribution perspective. Knowl Based Syst 96:76–83

    Google Scholar 

  36. Lichman M (2013) UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml. Accessed July 2017

  37. Zhou K, Fu C, Yang S (2014) Fuzziness parameter selection in fuzzy c-means: the perspective of cluster validation. Sci China Inf Sci 57:1–8

    Google Scholar 

  38. Sledge IJ, Bezdek JC, Havens TC, Keller JM (2010) Relational generalizations of cluster validity indices. IEEE Trans Fuzzy Syst 18:771–786

    Google Scholar 

  39. Shen Y, Shi H, Zhang JQ (2000) Improvement and optimization of a fuzzy c-means clustering algorithm. Syst Eng Electron 3:1430–1433

    Google Scholar 

  40. Yang MS, Nataliani Y (2017) Robust-learning fuzzy c-means clustering algorithm with unknown number of clusters. Pattern Recognit 71:45–59

    Google Scholar 

  41. Martino FD, Sessa S (2018) Extended fuzzy c-means hotspot detection method for large and very large event datasets. Inf Sci 441:198–215

    MathSciNet  Google Scholar 

  42. Memon KH (2018) A histogram approach for determining fuzzifier values of interval type-2 fuzzy c-means. Expert Syst Appl 91:27–35

    Google Scholar 

  43. Suleman A (2017) Measuring the congruence of fuzzy partitions in fuzzy c-means clustering. Appl Soft Comput 52:1285–1295

    Google Scholar 

  44. Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York

    MATH  Google Scholar 

  45. Janalipour M, Mohammadzadeh A (2017) Evaluation of effectiveness of three fuzzy systems and three texture extraction methods for building damage detection from post-event LiDAR data. Int J Digit Earth 12:1241–1268

    Google Scholar 

  46. Ozkan I, Turksen IB (2007) Upper and lower values for the level of fuzziness in FCM. Inf Sci 177:5143–5152

    MATH  Google Scholar 

  47. Wu KL (2012) Analysis of parameter selections for fuzzy c-means. Pattern Recognit 45:407–415

    MATH  Google Scholar 

  48. Idri A, Hosni M, Abran A (2016) Improved estimation of software development effort using classical and fuzzy analogy ensembles. Appl Soft Comput 49:990–1019

    Google Scholar 

  49. Idri A, Abnane I, Abran A (2017) Evaluating Pred(p) and standardized accuracy criteria in software development effort estimation. J Softw Evol Process 9:9. https://doi.org/10.1002/smr.1925

    Article  Google Scholar 

  50. Chan KP, Cheung YS (1992) Clustering of clusters. Pattern Recognit 25:211–217

    Google Scholar 

  51. Pal NR, Bezdek JC (1995) On cluster validity for the fuzzy c-mean model. IEEE Trans Fuzzy Syst 3:370–379

    Google Scholar 

  52. Yu J, Cheng Q, Huang H (2004) Analysis of the weighting exponent in the FCM. IEEE Trans Syst Man Cybern B Cybern 34:634–639

    Google Scholar 

  53. Dacunha-Castelle D, Duflo M (1986) Probability and statistics. Springer, New York

    MATH  Google Scholar 

  54. Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: ACM SIGKDD international conference on knowledge discovery and data mining, Paris, France, June 28–July 2009, pp 877–886

  55. Wu J, Xiong H, Wu P, Chen J (2007) Local decomposition for rare class analysis. In: ACM SIGKDD international conference on knowledge discovery and data mining, San Jose, California, USA, Aug 2007, pp 191–220

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers very much for their valuable comments and suggestions for improving the quality of the paper. This work was supported by the National Natural Science Foundation of China under Grant Nos. 71822104, 71501056 and 71690235, Anhui Science and Technology Major Project under Grant No. 17030901024, China Postdoctoral Science Foundation under Grant No. 2017M612072, and Hong Kong Scholars Program under Grant No. 2017-167.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kaile Zhou.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, K., Yang, S. Effect of cluster size distribution on clustering: a comparative study of k-means and fuzzy c-means clustering. Pattern Anal Applic 23, 455–466 (2020). https://doi.org/10.1007/s10044-019-00783-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-019-00783-6

Keywords

Navigation