Abstract
Clustering categorical data arising as an important problem of data mining has recently attracted much attention. In this paper, the problem of unsupervised dimensionality reduction for categorical data is first studied. Based on the theory of rough sets, the attributes of categorical data are decomposed into a number of rough subspaces. A novel clustering ensemble algorithm based on rough subspaces is then proposed to deal with categorical data. The algorithm employs some of rough subspaces with high quality to cluster the data and yields a robust and stable solution by exploiting the resulting partitions. We also introduce a cluster index to evaluate the solution of clustering algorithm for categorical data. Experimental results for selected UCI data sets show that the proposed method produces better results than those obtained by other methods when being evaluated in terms of cluster validity indexes.
Similar content being viewed by others
References
Al-Razgan M, Domeniconi C, Barbara D (2008) Random subspace ensembles for clustering categorical data. SCI 126:31–48
Anderberg MR (1973) Cluster analysis for applications. Academic Press, New York
Ayad HG, Kamel MS (2008) Cumulative voting consensus method for partitions with variable number of clusters. IEEE Trans Pattern Anal Mach Intell 30(1):160–173
Ayad HG, Kamel MS (2010) On voting-based consensus of cluster ensembles. Pattern Recogn 43(5):1943–1953
Ball GH, Hall DJ (1967) A clustering technique for summarizing multivariate data. Behav Sci 12(2):153–155
Bargiela A, Pedrycz W (2005) A model of granular data: a design problem with the Tchebyschev FCM. Soft Comput 9(3):155–163
Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Publishers, Norwell
Domeniconi C, Al-Razgan M (2009) Weighted cluster ensembles: methods and analysis. ACM Trans Knowl Discov Data 2(4):1–40
Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: A cluster ensemble approach. In: Proceedings of the 20th international conference on machine learning. pp 186–193
Fern XZ, Brodley CE (2004) Solving cluster ensemble problems by bipartite graph partitioning. In: Proceedings of the 21th international conference on machine learning. Banff, Alberta, Canada
Fischer B, Buhmann JM (2003) Bagging for path-based clustering. IEEE Trans Pattern Anal Mach Intell 25(11):1411–1415
Fred A, Jain AK (2005) Combining multiple clusterings using evidence accumulation. IEEE Trans Pattern Anal Mach Intell 27(6):835–850
Ghaemi R, Sulaiman MN, Ibrahim H et al (2009) A survey: clustering ensembles techniques. World Acad Sci Eng Technol 50:636–645
Gionis A, Mannila H, Tsaparas P (2007) Clustering aggregation. ACM Trans Knowl Discov Data 1(1):1–30
Hadjitodorov ST, Kuncheva LI, Todorova LP (2006) Moderate diversity for better cluster ensembles. Inf Fusion 7(3):264–275
He ZY, Xu XF, Deng SC (2005) A cluster ensemble method for clustering categorical data. Inf Fusion 6(2):143–151
Hong Y, Kwong S, Chang YC et al (2008a) Unsupervised feature selection using clustering ensembles and population based incremental learning algorithm. Pattern Recogn 41(9):2742–2756
Hong Y, Kwong S, Chang YC et al (2008b) Consensus unsupervised feature ranking from multiple views. Pattern Recogn Lett 29(5):595–602
Hore P, Hall LO, Goldgof DB (2009) A scalable framework for cluster ensembles. Pattern Recogn 42(5):676–688
Huang ZX, Ng MK (1999) A fuzzy k-modes algorithm for clustering categorical data. IEEE Trans Fuzzy Syst 7(4):446–452
Iam-On N, Boongoen T, Garrett S et al (2011) A link-based approach to the cluster ensemble problem. IEEE Trans Pattern Anal Mach Intell 33(12):2396–2409
Iam-On N, Boongeon T, Garrett S et al (2012) A link-based cluster ensemble approach for categorical data clustering. IEEE Trans Knowl Data Eng 24(3):413–425
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recogn Lett 31(8):651–666
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Inc., Upper Saddle River
Jia JH, Xiao X, Liu BX et al (2011) Bagging-based spectral clustering ensemble selection. Pattern Recogn Lett 32(10):1456–1467
Jiang Y, Zhou Z-H (2004) SOM ensemble-based image segmentation. Neural Process Lett 20(3):171–178
Kuncheva LI, Vetrov DP (2006) Evaluation of stability of k-means cluster ensembles with respect to random initialization. IEEE Trans Pattern Anal Mach Intell 28(11):1798–1808
Lange T, Buhmann JM (2005) Combining partitions by probabilistic label aggregation. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery in data mining. pp 147–156
Li TY, Chen Y (2010) Fuzzy clustering ensemble with selection of number of clusters. J Comput 5(7):1112–1118
Li T, Ogihara M, Ma S (2010) On combining multiple clusterings: an overview and a new perspective. Appl Intell 33(2):207–219
Liu Q (2001) Rough sets and rough reasoning. Science Press, Beijing (in Chinese)
Luo HL, Jing FR, Xie XB (2006) Combining multiple clusterings using information theory based genetic algorithm. In: Proceedings of the 2006 international conference on computational intelligence and security. pp 84–89
Miao DQ, Li DG (2008) Rough sets theory, algorithms and applications. Tsinghua University Press, Beijing (in Chinese)
Miao DQ, Zhao Y, Yao YY et al (2009) Relative reducts in consistent and inconsistent decision tables of the Pawlak rough set model. Inf Sci 179(24):4140–4150
Minaei-Bidgoli B, Topchy A, Punch W (2004) A comparison of resampling methods for clustering ensembles. In: Proceedings of the international conference on artificial intelligence (IC-AI’04). pp 939–945
Monti S, Tamayo P, Mesirov J, Golub T (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52(1–2):91–118
Øhrn A, Komorowski J (1997) ROSETTA: a rough set toolkit for analysis of data. In: Proceedings of the 3rd international joint conference on information sciences and 5th international workshop on rough sets and soft computing (RSSC’97), Durham, NC, USA, March. pp 403–407
Pawlak Z (1982) Rough sets. Int J Comput Inf Sci 11(5):341–356
Pawlak Z (1991) Rough sets: theoretical aspects of reasoning about data. Kluwer Academic Publishers, Dordrecht
Pedrycz W (1996) Conditional fuzzy C-means. Pattern Recogn Lett 17(6):625–632
Pedrycz W (2005) Knowledge based clustering: From data to information granules. Wiley, Hoboken
Pedrycz W, Loia V, Senatore S (2010) Fuzzy clustering with viewpoints. IEEE Trans Fuzzy Syst 18(2):274–284
Rokach L (2010) Ensemble-based classifiers. Artif Intell Rev 33(1–2):1–39
Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617
Thangavel K, Pethalakshmi A (2009) Dimensionality reduction based on rough set theory: a review. Appl Soft Comput 9(1):1–12
Topchy A, Jain AK, Punch W (2005) Clustering ensembles: models of consensus and weak partitions. IEEE Trans Pattern Anal Mach Intell 27(12):1866–1881
Tumer K, Agogino AK (2008) Ensemble clustering with voting active clusters. Pattern Recogn Lett 29(14):1947–1953
Vega-Pons S, Ruiz-Shulcloper J (2011) A survey of clustering ensemble algorithms. Int J Pattern Recognit Artif Intell 25(3):337–372
Wang GY (2001) Rough sets theory and knowledge acquisition. Xi’an Jiaotong University Press, Xi’an (in Chinese)
Wang JY, Gao C (2009) An improved algorithm for attribute reduction based on discernibility matrix. Comput Eng 35(3):66–68 (in Chinese)
Wang WN, Zhang YJ (2007) On fuzzy cluster validity indices. Fuzzy Sets Syst 158(19):2095–2117
Yu ZW, Wong H-S (2009) Class discovery from gene expression data based on perturbation and cluster ensemble. IEEE Trans Nanobiosci 8(2):147–160
Yu ZW, Wong H-S, Wang HQ (2007a) Graph-based consensus clustering for class discovery from gene expression data. Bioinformatics 23(21):2888–2896
Yu ZW, Zhang SH, Wong H-S, et al (2007) Image segmentation based on cluster ensemble. In: Proceedings of the 4th international symposium on neural networks: advances in neural networks, part III. Springer, Berlin, pp 894–903
Yu ZW, Deng ZK, Wong H-S, et al (2008) Fuzzy cluster ensemble and its application on 3D head model classification. In: Proceedings of the IEEE international joint conference on neural networks (IJCNN 2008). pp 569–576
Yu ZW, Wong H-S, You J et al (2011) Knowledge based cluster ensemble for cancer discovery from biomolecular data. IEEE Trans Nanobiosci 10(2):76–85
Yu ZW, Wong H-S, You J et al (2012a) Hybrid cluster ensemble framework based on the random combination of data transformation operators. Pattern Recogn 45(5):1826–1837
Yu ZW, You J, Wong H-S et al (2012b) From cluster ensemble to structure ensemble. Inf Sci 198:81–99
Zhang WX, Wu WZ, Liang JY et al (2001) Rough sets theory and methods. Science Press, Beijing (in Chinese)
Zhang XR, Jiao LC, Liu F et al (2008) Spectral clustering ensemble applied to SAR image segmentation. IEEE Trans Geosci Remote Sens 46(7):2126–2136
Zhou ZH, Wu JX, Tang W (2002) Ensembling neural networks: many could be better than all. Artif Intell 137(1–2):239–263
Zhou J, Miao DQ, Pedrycz W et al (2011) Analysis of alternative objective functions for attribute reduction in complete decision tables. Soft Comput 15(8):1601–1616
Acknowledgments
The authors would like to thank the Editors for their kindly help and the anonymous referees for their valuable comments and helpful suggestions. Special thanks go to Ms. Ting Zhu for her assistance in revising the paper. The work is partially supported by the National Natural Science Foundation of China (Serial No. 60970061, 61075056, 61103067, 61202170, 61203247, 61273304), China Postdoctoral Science Foundation (Serial No. 2011M500626, 2011M500815) and Fundamental Research Funds for the Central Universities.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by A. Di Nola.
Rights and permissions
About this article
Cite this article
Gao, C., Pedrycz, W. & Miao, D. Rough subspace-based clustering ensemble for categorical data. Soft Comput 17, 1643–1658 (2013). https://doi.org/10.1007/s00500-012-0972-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-012-0972-8