Abstract
Clustering categorical data poses two challenges defining an inherently meaningful similarity measure, and effectively dealing with clusters which are often embedded in different subspaces. In this paper, we propose a novel divisive hierarchical clustering algorithm for categorical data, named DHCC. We view the task of clustering categorical data from an optimization perspective, and propose effective procedures to initialize and refine the splitting of clusters. The initialization of the splitting is based on multiple correspondence analysis (MCA). We also devise a strategy for deciding when to terminate the splitting process. The proposed algorithm has five merits. First, due to its hierarchical nature, our algorithm yields a dendrogram representing nested groupings of patterns and similarity levels at different granularities. Second, it is parameter-free, fully automatic and, in particular, requires no assumption regarding the number of clusters. Third, it is independent of the order in which the data is processed. Fourth, it is scalable to large data sets. And finally, our algorithm is capable of seamlessly discovering clusters embedded in subspaces, thanks to its use of a novel data representation and Chi-square dissimilarity measures. Experiments on both synthetic and real data demonstrate the superior performance of our algorithm.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abdi H, Valentin D (2007) Multiple correspondence analysis. In: Saltkind N (eds) Encyclopedia of measurement and statistics. Sage, Thousand Oaks
Aggarwal CC, Wolf JL, Yu PS, Procopiuc C, Park JS (1999) Fast algorithms for projected clustering. In: Proceedings of the 1999 ACM SIGMOD international conference on management of data, Philadelphia, PA, USA, pp 61–72
Andritsos P, Tsaparas P, Miller R, Sevcik K (2004) LIMBO: scalable clustering of categorical data. Lecture notes in computer science. Springer, New York
Barbara D, Li Y, Couto J (2002) COOLCAT: an entropy-based algorithm for categorical clustering. In: Proceedings of the 11th ACM international conference on information and knowledge management, McLean, VA, USA, pp 582–589
Bouguessa M, Wang S (2009) Mining projected clusters in high-dimensional spaces. IEEE Trans Knowl Data Eng 21: 507–522
Brand M (2006) Fast low-rank modifications of the thin singular value decomposition. Linear Algebra Appl 415: 20–30
Cesario E, Manco G, Ortale R (2007) Top-down parameter-free clustering of high-dimensional categorical data. IEEE Trans Knowl Data Eng 19: 1607–1624
Chen K, Liu L (2005) The ‘best k’ for entropy-based categorical data clustering. In: Proceedings of the 17th international conference on scientific and statistical database management, Santa Barbara, CA, USA, pp 253–262
Chen HL, Chuang KT, Chen MS (2008) On data labeling for clustering categorical data. IEEE Trans Knowl Data Eng 20: 1458–1472
Das K, Schneider J (2007) Detecting anomalous records in categorical datasets. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, San Jose, CA, USA, pp 220–229
Ding C, He X (2002) Cluster merging and splitting in hierarchical clustering algorithms. In: Proceedings of the 2nd IEEE international conference on data mining, Maebashi, Japan, pp 139–146
Do H, Kim J (2008) Categorical data clustering using the combinations of attribute values. Lecture notes in computer science. Springer, New York
Drineas P, Drinea E, Huggins P (2003) An experimental evaluation of a Monte-Carlo algorithm for singular value decomposition. Lecture notes in computer science. Springer, New York
Everitt B, Landau S, Leese M (2001) Cluster analysis 4. Arnold Publishers, London
Gan G, Wu J (2004) Subspace clustering for high dimensional categorical data. SIGKDD Explor 6(2): 87–94
Ganti V, Gehrke J, Ramakrishnan R (1999) CACTUS: clustering categorical data using summaries. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, CA, USA, pp 73–83
Grabmeier J, Rudolph A (2002) Techniques of cluster algorithms in data mining. Data Min Knowl Discov 6(4): 303–360
Greenacre M (1993) Correspondence analysis in practice. Academic Press, London
Greenacre M, Blasius J (2006) Multiple correspondence analysis and related methods. Chapman & Hall, London
Guha S, Rastogi R, Shim K (1999) ROCK: a robust clustering algorithm for categorical attributes. In: Proceedings of the 15th IEEE international conference on data engineering, Sydney, Australia, pp 512–521
Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical value. Data Mining Knowl Discov 2: 283–304
Huang JZ, Ng MK, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell 27(5): 657–668
Jain A, Murty M, Flynn P (1999) Data clustering: a review. ACM Comput Surv 31(3): 264–323
Jin R, Breitbart Y, Muoh C (2009) Data discretization unification. Knowl Inf Syst 19: 1–29
Keogh E, Lonardi S, Ratanamahatana C (2004) Towards parameter-free data mining. In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining, Seattle, WA, USA, pp 206–215
Li T, Ma S, Ogihara M (2004) Entropy-based criterion in categorical clustering. In: Proceedings of the 21st international conference on machine learning, Banff, Alberta, Canada, pp 536–543
Lu Y, Wang S, Li S, Zhou C (2011) Particle swarm optimizer for variable weighting in clustering high-dimensional data. Mach Learn 82: 43–70
Messaoud R, Boussaid O, Rabaseda S (2006) Efficient multidimensional data representations based on multiple correspondence analysis. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, Philadelphia, PA, USA, pp 662–667
Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6(1): 90–105
Salvador S, Chan P (2004) Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In: Proceedings of the 16th IEEE international conference on tools with artificial intelligence, Boca Raton, FL, USA, pp 576–584
San O, Huynh V, Nakamori Y (2004) An alternative extension of the k-means algorithm for clustering categorical data. Int J Appl Math Comput Sci 14(2): 241–247
Sun H, Wang S, Jiang Q (2004) FCM-based model selection algorithm for determining the number of clusters. Pattern Recognit 37(10): 2027–2037
Tan P, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley, Reading
Wang K, Xu C, Liu B (1999) Clustering transactions using large items. In: Proceeding of the 10th ACM conference on information and knowledge management (CIKM), Kansas City, MO, USA, pp 483–490
Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, Paris, France, pp 877–885
Xiong T, Wang S, Mayers A, Monga E (2008) Personal bankruptcy prediction using sequence mining. In: Proceeding of KDD2008 workshop on data mining for business applications, Las Vegas, NV, USA, pp 32–38
Xiong T, Wang S, Mayers A, Monga E (2009) A new MCA-based divisive hierarchical algorithm for clustering categorical data. In: Proceedings of the 9th IEEE international conference on data mining, Miami, FL, USA, pp 1058–1063
Yan H, Chen K, Liu L (2006) Efficiently clustering transactional data with weighted coverage density. In: Proceeding of the 17th ACM conference on information and knowledge management (CIKM), Arlington, Virginia, USA, pp 367–376
Yang Y, Guan X, You J (2002) CLOPE: a fast and effective clustering algorithm for transactional data. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, Edmonton, Alberta, Canada, pp 682–687
Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document datasets. Data Mining Knowl Discov 10: 141–168
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Charu Aggarwal.
Rights and permissions
About this article
Cite this article
Xiong, T., Wang, S., Mayers, A. et al. DHCC: Divisive hierarchical clustering of categorical data. Data Min Knowl Disc 24, 103–135 (2012). https://doi.org/10.1007/s10618-011-0221-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-011-0221-2