Effect of cluster size distribution on clustering: a comparative study of k-means and fuzzy c-means clustering

Zhou, Kaile; Yang, Shanlin

doi:10.1007/s10044-019-00783-6

Effect of cluster size distribution on clustering: a comparative study of k-means and fuzzy c-means clustering

Theoretical advances
Published: 06 March 2019

Volume 23, pages 455–466, (2020)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Kaile Zhou^1,2,3 &
Shanlin Yang^1,2

1368 Accesses
38 Citations
Explore all metrics

Abstract

Data distribution has a significant impact on clustering results. This study focuses on the effect of cluster size distribution on clustering, namely the uniform effect of k-means and fuzzy c-means (FCM) clustering. We first provide some related works of k-means and FCM clustering. Then, the structure decomposition analysis of the objective functions of k-means and FCM is presented. Afterward, extensive experiments on both synthetic two-dimensional and three-dimensional data sets and real-world data sets from the UCI machine learning repository are conducted. The results demonstrate that FCM has stronger uniform effect than k-means clustering. Also, it reveals that the fuzzifier value m = 2 in FCM, which has been widely adopted in many applications, is not a good choice, particularly for data sets with great variation in cluster sizes. Therefore, for data sets with significant uneven distributions in cluster sizes, a smaller fuzzifier value is preferred for FCM clustering, and k-means clustering is a better choice compared with FCM clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Modified fuzzy c-mean for custom-sized clusters

Article 17 July 2019

Generalized Fuzzy c-Means Clustering and Its Theoretical Properties

Improved fuzzy C-means algorithm based on density peak

Article 31 July 2019

References

Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall Inc., Upper Saddle River
MATH Google Scholar
Bianchi FM, Livi L, Rizzi A (2015) Two density-based k-means initialization algorithms for non-metric data clustering. Pattern Anal Appl 19:1–19
Google Scholar
Bianchi FM, Livi L, Rizzi A (2016) Two density-based k-means initialization algorithms for non-metric data clustering. Pattern Anal Appl 19:745–763
MathSciNet Google Scholar
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16:645–678
Google Scholar
Xu YJ, Wu XJ (2016) An affine subspace clustering algorithm based on ridge regression. Pattern Anal Appl 20:557–566
MathSciNet Google Scholar
Cornuéjols A, Wemmert C, Gançarski P, Bennani Y (2018) Collaborative clustering: why, when, what and how. Inf Fusion 39:81–95
Google Scholar
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA, USA, pp 281–297
Gerlhof C, Kemper A, Kilger C, Moerkotte G (1993) Partition-based clustering in object bases: from theory to practice. In: International conference on foundations of data organization and algorithms. Springer, pp 301–316
Guha S, Rastogi R, Shim K (2001) CURE: an efficient clustering algorithm for large databases. Inf Syst 26:35–58
MATH Google Scholar
Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 32:241–254
MATH Google Scholar
Karypis G, Han E-H, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32:68–75
Google Scholar
Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: KDD 1998, pp 58–65
Sheikholeslami G, Chatterjee S, Zhang A (1998) Wavecluster: a multi-resolution clustering approach for very large spatial databases. In: VLDB 1998. pp 428–439
Liao W, Liu Y, Choudhary A (2004) A grid-based clustering algorithm using adaptive mesh refinement. In: 7th workshop on mining scientific and engineering datasets of SIAM international conference on data mining, pp 61–69
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39:1–38
MathSciNet MATH Google Scholar
Chen LS, Prentice RL, Wang P (2014) A penalized EM algorithm incorporating missing data mechanism for Gaussian parameter estimation. Biometrics 70:312–322
MathSciNet MATH Google Scholar
De Carvalho FDA, Lechevallier Y, De Melo FM (2012) Partitioning hard clustering algorithms based on multiple dissimilarity matrices. Pattern Recognit 45:447–464
MATH Google Scholar
Tîrnăucă C, Gómez-Pérez D, Balcázar JL, Montaña JL (2018) Global optimality in k-means clustering. Inf Sci 439–440:79–94
MathSciNet Google Scholar
Ferreira MRP, de Carvalho FAT, Simões EC (2016) Kernel-based hard clustering methods with kernelization of the metric and automatic weighting of the variables. Pattern Recognit 51:310–321
Google Scholar
Yang M-S (1993) A survey of fuzzy clustering. Math Comput Model 18:1–16
MathSciNet MATH Google Scholar
Sert SA, Bagci H, Yazici A (2015) MOFCA: multi-objective fuzzy clustering algorithm for wireless sensor networks. Appl Soft Comput 30:151–165
Google Scholar
Bonis T, Oudot S (2018) A fuzzy clustering algorithm for the mode-seeking framework. Pattern Recognit Lett 102:37–43
Google Scholar
Jothi R, Mohanty SK, Ojha A (2017) DK-means: a deterministic k-means clustering algorithm for gene expression analysis. Pattern Anal Appl. https://doi.org/10.1007/s10044-017-0673-0
Article Google Scholar
Aparajeeta J, Nanda PK, Das N (2016) Modified possibilistic fuzzy c-means algorithms for segmentation of magnetic resonance image. Appl Soft Comput 41:104–119
Google Scholar
Zhou K, Yang S, Shao Z (2017) Household monthly electricity consumption pattern mining: a fuzzy clustering-based model and a case study. J Clean Prod 141:900–908
Google Scholar
Bigdeli E, Mohammadi M, Raahemi B, Matwin S (2017) A fast and noise resilient cluster-based anomaly detection. Pattern Anal Appl 20:183–199
MathSciNet Google Scholar
Kamburov A, Lawrence MS, Polak P, Leshchiner I, Lage K, Golub TR et al (2015) Comprehensive assessment of cancer missense mutation clustering in protein structures. Proc Natl Acad Sci 112:E5486–E5495
Google Scholar
Chifu A-G, Hristea F, Mothe J, Popescu M (2015) Word sense discrimination in information retrieval: a spectral clustering-based approach. Inf Process Manag 51:16–31
Google Scholar
Kumar KM, Reddy ARM (2017) An efficient k-means clustering filtering algorithm using density based initial cluster centers. Inf Sci 418–419:286–301
MathSciNet Google Scholar
Rodríguez J, Medina-Pérez MA, Gutierrez-Rodríguez AE, Monroy R, Terashima-Marín H (2018) Cluster validation using an ensemble of supervised classifiers. Knowl Based Syst 145:134–144
Google Scholar
Farcomeni A (2014) Robust constrained clustering in presence of entry-wise outliers. Technometrics 56:102–111
MathSciNet Google Scholar
Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley, Reading
Google Scholar
Xiong H, Wu J, Chen J (2009) k-means clustering versus validation measures: a data-distribution perspective. IEEE Trans Syst Man Cybern Part B (Cybern) 39:318–331
Google Scholar
Wu J, Xiong H, Chen J (2009) Towards understanding hierarchical clustering: a data distribution perspective. Neurocomputing 72:2319–2330
Google Scholar
Zhou K, Yang S (2016) Exploring the uniform effect of FCM clustering: a data distribution perspective. Knowl Based Syst 96:76–83
Google Scholar
Lichman M (2013) UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml. Accessed July 2017
Zhou K, Fu C, Yang S (2014) Fuzziness parameter selection in fuzzy c-means: the perspective of cluster validation. Sci China Inf Sci 57:1–8
Google Scholar
Sledge IJ, Bezdek JC, Havens TC, Keller JM (2010) Relational generalizations of cluster validity indices. IEEE Trans Fuzzy Syst 18:771–786
Google Scholar
Shen Y, Shi H, Zhang JQ (2000) Improvement and optimization of a fuzzy c-means clustering algorithm. Syst Eng Electron 3:1430–1433
Google Scholar
Yang MS, Nataliani Y (2017) Robust-learning fuzzy c-means clustering algorithm with unknown number of clusters. Pattern Recognit 71:45–59
Google Scholar
Martino FD, Sessa S (2018) Extended fuzzy c-means hotspot detection method for large and very large event datasets. Inf Sci 441:198–215
MathSciNet Google Scholar
Memon KH (2018) A histogram approach for determining fuzzifier values of interval type-2 fuzzy c-means. Expert Syst Appl 91:27–35
Google Scholar
Suleman A (2017) Measuring the congruence of fuzzy partitions in fuzzy c-means clustering. Appl Soft Comput 52:1285–1295
Google Scholar
Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York
MATH Google Scholar
Janalipour M, Mohammadzadeh A (2017) Evaluation of effectiveness of three fuzzy systems and three texture extraction methods for building damage detection from post-event LiDAR data. Int J Digit Earth 12:1241–1268
Google Scholar
Ozkan I, Turksen IB (2007) Upper and lower values for the level of fuzziness in FCM. Inf Sci 177:5143–5152
MATH Google Scholar
Wu KL (2012) Analysis of parameter selections for fuzzy c-means. Pattern Recognit 45:407–415
MATH Google Scholar
Idri A, Hosni M, Abran A (2016) Improved estimation of software development effort using classical and fuzzy analogy ensembles. Appl Soft Comput 49:990–1019
Google Scholar
Idri A, Abnane I, Abran A (2017) Evaluating Pred(p) and standardized accuracy criteria in software development effort estimation. J Softw Evol Process 9:9. https://doi.org/10.1002/smr.1925
Article Google Scholar
Chan KP, Cheung YS (1992) Clustering of clusters. Pattern Recognit 25:211–217
Google Scholar
Pal NR, Bezdek JC (1995) On cluster validity for the fuzzy c-mean model. IEEE Trans Fuzzy Syst 3:370–379
Google Scholar
Yu J, Cheng Q, Huang H (2004) Analysis of the weighting exponent in the FCM. IEEE Trans Syst Man Cybern B Cybern 34:634–639
Google Scholar
Dacunha-Castelle D, Duflo M (1986) Probability and statistics. Springer, New York
MATH Google Scholar
Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: ACM SIGKDD international conference on knowledge discovery and data mining, Paris, France, June 28–July 2009, pp 877–886
Wu J, Xiong H, Wu P, Chen J (2007) Local decomposition for rare class analysis. In: ACM SIGKDD international conference on knowledge discovery and data mining, San Jose, California, USA, Aug 2007, pp 191–220

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers very much for their valuable comments and suggestions for improving the quality of the paper. This work was supported by the National Natural Science Foundation of China under Grant Nos. 71822104, 71501056 and 71690235, Anhui Science and Technology Major Project under Grant No. 17030901024, China Postdoctoral Science Foundation under Grant No. 2017M612072, and Hong Kong Scholars Program under Grant No. 2017-167.

Author information

Authors and Affiliations

School of Management, Hefei University of Technology, Hefei, 230009, China
Kaile Zhou & Shanlin Yang
Key Laboratory of Process Optimization and Intelligent Decision-Making of Ministry of Education, Hefei University of Technology, Hefei, 230009, China
Kaile Zhou & Shanlin Yang
City University of Hong Kong, Kowloon, Hong Kong SAR, China
Kaile Zhou

Authors

Kaile Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Shanlin Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kaile Zhou.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, K., Yang, S. Effect of cluster size distribution on clustering: a comparative study of k-means and fuzzy c-means clustering. Pattern Anal Applic 23, 455–466 (2020). https://doi.org/10.1007/s10044-019-00783-6

Download citation

Received: 26 October 2017
Accepted: 30 January 2019
Published: 06 March 2019
Issue Date: February 2020
DOI: https://doi.org/10.1007/s10044-019-00783-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Effect of cluster size distribution on clustering: a comparative study of k-means and fuzzy c-means clustering

Abstract

Access this article

Similar content being viewed by others

Modified fuzzy c-mean for custom-sized clusters

Generalized Fuzzy c-Means Clustering and Its Theoretical Properties

Improved fuzzy C-means algorithm based on density peak

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Effect of cluster size distribution on clustering: a comparative study of k-means and fuzzy c-means clustering

Abstract

Access this article

Similar content being viewed by others

Modified fuzzy c-mean for custom-sized clusters

Generalized Fuzzy c-Means Clustering and Its Theoretical Properties

Improved fuzzy C-means algorithm based on density peak

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation