“Best K”: critical clustering structures in categorical datasets

Chen, Keke; Liu, Ling

doi:10.1007/s10115-008-0159-x

“Best K”: critical clustering structures in categorical datasets

Regular Paper
Published: 04 September 2008

Volume 20, pages 1–33, (2009)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Keke Chen¹ &
Ling Liu²

224 Accesses
Explore all metrics

Abstract

The demand on cluster analysis for categorical data continues to grow over the last decade. A well-known problem in categorical clustering is to determine the best K number of clusters. Although several categorical clustering algorithms have been developed, surprisingly, none has satisfactorily addressed the problem of best K for categorical clustering. Since categorical data does not have an inherent distance function as the similarity measure, traditional cluster validation techniques based on geometric shapes and density distributions are not appropriate for categorical data. In this paper, we study the entropy property between the clustering results of categorical data with different K number of clusters, and propose the BKPlot method to address the three important cluster validation problems: (1) How can we determine whether there is significant clustering structure in a categorical dataset? (2) If there is significant clustering structure, what is the set of candidate “best Ks”? (3) If the dataset is large, how can we efficiently and reliably determine the best Ks?

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aggarwal CC, Magdalena C, Yu PS (2002) Finding localized associations in market basket data. IEEE Trans Knowl Data Eng 14(1): 51–62
Article Google Scholar
Agresti A (1990) Categorical Data Analysis. Wiley, NY
MATH Google Scholar
Andritsos P, Tsaparas P, Miller RJ, Sevcik KC (2004) Limbo:scalable clustering of categorical data. In: Proceedings of international conference on extending database technology (EDBT)
Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) OPTICS: Ordering points to identify the clustering structure. In: Proceedings of ACM SIGMOD conference, pp 49–60
Barbara D, Jajodia S (eds) (2002) Applications of data mining in computer security. Kluwer, Dordrecht
Google Scholar
Barbara D, Li Y, Couto J (2002) Coolcat: an entropy-based algorithm for categorical clustering. In: Proceedings of ACM conference on information and knowledge management (CIKM)
Baulieu F (1997) Two variant axiom systems for presence/absence based dissimilarity coefficients. J Classif 14
Baxevanis A, Ouellette F (eds) (2001) Bioinformatics: a practical guide to the analysis of genes and proteins, 2nd edn. Wiley, NY
Bock H (1989) Probabilistic aspects in cluster analysis. In: Conceptual and numerical analysis of data. Springer, Berlin
Brand M (1998) An entropic estimator for structure discovery. In: Proceedings Of neural information processing systems (NIPS). pp 723–729
Celeux G, Govaert G (1991) Clustering criteria for discrete data and latent class models. J Classif
Chakrabarti D, Papadimitriou S, Modha DS, Faloutsos C (2004) Fully automatic cross-associations. In: Proceedings of ACM SIGKDD conference
Chen K, Liu L (2004) VISTA: Validating and refining clusters via visualization. Inf Vis 3(4): 257–270
Article Google Scholar
Chen K, Liu L (2005) The “best k” for entropy-based categorical clustering. In: Proceedings of international conference on scientific and statistical database management (SSDBM). pp 253–262
Chen K, Liu L (2006) Detecting the change of clustering structure in categorical data streams. In: SIAM data mining conference
Cheng CH, Fu AW-C, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: Proceedings of ACM SIGKDD conference
Cover T, Thomas J (1991) Elements of information theory. Wiley, NY
Book MATH Google Scholar
Dhillon IS, Mellela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of ACM SIGKDD conference
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Second international conference on knowledge discovery and data mining, pp 226–231
Ganti V, Gehrke J, Ramakrishnan R (1999) CACTUS-clustering categorical data using summaries. In: Proceedings of ACM SIGKDD Conference
Gibson D, Kleinberg J, Raghavan P (2000) Clustering categorical data: An approach based on dynamical systems. In: Proceedings of very large databases conference (VLDB). pp 222–236
Gondek D, Hofmann T (2007) ‘Non-redundant data clustering’. Knowl Inf Syst 12(1): 1–24
Article Google Scholar
Guha S, Rastogi R, Shim K (2000) ROCK: A robust clustering algorithm for categorical attributes. Inf Syst 25(5): 345–366
Article Google Scholar
Halkidi M, Batistakis Y, Vazirgiannis M (2002) Cluster validity methods: Part I and II. SIGMOD Rec 31(2): 40–45
Article Google Scholar
Hastie T, Tibshirani R, Friedmann J (2001) The elements of statistical learning. Springer, Berlin
MATH Google Scholar
Huang Z (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Workshop on research issues on data mining and knowledge discovery
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice hall, New York
MATH Google Scholar
Jain AK, Dubes RC (1999) Data clustering: a review. ACM Comput Surv 31: 264–323
Article Google Scholar
Lehmann EL, Casella G (1998) Theory of Point Estimation. Springer, Berlin
MATH Google Scholar
Li T, Ma S, Ogihara M (2004) Entropy-based criterion in categorical clustering. In: Proceedings of international conference on machine learning (ICML)
Meek C, Thiesson B, Heckerman D (2002) The learning-curve sampling method applied to model-based clustering. J Mach Learn Res 2: 397–418
Article MATH MathSciNet Google Scholar
Sharma S (1995) Applied multivariate techniques. Wiley, NY
Google Scholar
Tishby N, Pereira FC, Bialek W (1999) The information bottleneck method. In: Proceedings of the 37-th annual allerton conference on communication, control and computing
Wang J, Karypis G (2006) ‘On efficiently summarizing categorical databases’. Knowl Inf Syst 9(1): 19–37
Article Google Scholar
Wrigley N (1985) Categorical data analysis for geographers and environmental scientists. Longman, London
Google Scholar
Yu JX, Qian W, Lu H, Zhou A (2006) Finding centric local outliers in categorical/numerical spaces. Knowl Inf Syst 9(3): 309–338
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Wright State University, Dayton, OH, USA
Keke Chen
College of Computing, Georgia Institute of Technology, Atlanta, GA, USA
Ling Liu

Authors

Keke Chen
View author publications
You can also search for this author inPubMed Google Scholar
Ling Liu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Keke Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, K., Liu, L. “Best K”: critical clustering structures in categorical datasets. Knowl Inf Syst 20, 1–33 (2009). https://doi.org/10.1007/s10115-008-0159-x

Download citation

Received: 13 May 2007
Revised: 16 June 2008
Accepted: 13 July 2008
Published: 04 September 2008
Issue Date: July 2009
DOI: https://doi.org/10.1007/s10115-008-0159-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

“Best K”: critical clustering structures in categorical datasets

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Categorical Data Clustering

Clusterability test for categorical data

Categorical Data Clustering

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

“Best K”: critical clustering structures in categorical datasets

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Categorical Data Clustering

Clusterability test for categorical data

Categorical Data Clustering

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now