Abstract
For categorical data, there are three widely-used internal validity functions: the \(k\)-modes objective function, the category utility function and the information entropy function, which are defined based on within-cluster information only. Many clustering algorithms have been developed to use them as objective functions and find their optimal solutions. In this paper, we study the generalization, effectiveness and normalization of the three validity functions from a solution-space perspective. First, we present a generalized validity function for categorical data. Based on it, we analyze the generality and difference of the three validity functions in the solution space. Furthermore, we address the problem whether the between-cluster information is ignored when these validity functions are used to evaluate clustering results. To the end, we analyze the upper and lower bounds of the three validity functions for a given data set, which can help us estimate the clustering difficulty on a data set and compare the performance of a clustering algorithm on different data sets.
Similar content being viewed by others
References
Aggarwal CC, Magdalena C, Yu PS (2002) Finding localized associations in market basket data. IEEE Trans Knowl Data Eng 14(1):51–62
Andritsos P, Tsaparas P, Miller RJ, Sevcik KC (2004) Limbo: scalable clustering of categorical data. In: Proceedings of the ninth international conference on extending database technology
Bai L, Liang JY, Dang CY, Cao FY (2011) A novel attribute weighting algorithm for clustering high-dimensional categorical data. Pattern Recognit 44(12):2843–2861
Bai L, Liang JY, Dang CY (2013) The impact of cluster representatives on the convergence of the \(k\)-modes type clustering. IEEE Trans Pattern Anal Mach Intell 35(6):1509–1522
Barbara D, Jajodia S (2002) Applications of data mining in computer security. Kluwer, Dordrecht
Barbara D, Li Y, Couto J (2002) Coolcat: an entropy-based algorithm for categorical clustering. In: Proceedings of the eleventh international conference on information and knowledge management, pp 582–589
Baxevanis A, Ouellette F (2001) Bioinformatics: a practical guide to the analysis of genes and proteins, 2nd edn. Wiley, New York
Berry MJA, Linoff G (1996) Data mining techniques for marketing. Sales and customer support. John Wiley and Sons, New York
Chen HL, Chuang KT, Chen MS (2008) On data labeling for clustering categorical data. IEEE Trans Knowl Data Eng 20(11):1458–1472
Chen K, Liu L (2005) The ”best k” for entropy-based categorical clustering. In: Proceedings of international conference on scientific and statistical database management (SSDBM), pp 253C–262C
Chen K, Liu L (2009) He-tree: a framework for detecting changes in clustering structure for categorical data streams. VLDB J 18(5):1241–1260
Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York
Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3:32–57
Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2(2):139–172
Gluck MA, Corter JE (1985) Information uncertainty and the utility. In: Proceedings of the seventh annual conference of cognitive science society, pp 283–287
Gowda KC, Diday E (1991) Symbolic clustering using a new dissimilarity measure. Pattern Recognit 24(6):567–578
Halkidi M, Vazirgiannis M (2001) Clustering validity assessment: finding the optimal partitioning of a data set. In: Proceedings of IEEE international conference on data mining (ICDM), pp 187–194
Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17(2–3): 107–145
He Z, Deng S, Xu X (2005) Improving \(k\)-modes algorithm considering frequencies of attribute values in mode. In: Proceedings of computational intelligence and security, pp 157–162
Huang ZX (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Proceedings of SIGMOD workshop research issues on data mining and knowledge discovery, pp 1–8
Huang ZX, Ng MK (1999) A fuzzy \(k\)-modes algorithm for clustering categorical data. IEEE Trans Fuzzy Syst 7(4):446–452
Huang ZX, Ng MK, Rong H, Li Z (2005) Automated variable weighting in \(k\)-means type clustering. IEEE Trans Fuzzy Syst 27(5):657–668
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, Englewood Cliffs
Li T, Ma S, Ogihara M (2004) Entropy-based criterion in categorical clustering. In: Proceedings of international conference on machine learning (ICML), pp 536–543
Liang JY, Chin KS, Dang CY, Yam RCM (2002) A new method for measuring uncertainty and fuzziness in rough set theory. Int J Gen Syst 31(4):331–342
Liu Y, Li Z, Xiong H, Gao X, Wu J (2010) Understanding of internal clustering validation measures. In: The 10th IEEE international conference on data mining (ICDM), pp 911–916
Liu Y, Li Z, Xiong H, Gao X, Wu J, Wu S (2013) Understanding and enhancement of internal clustering validation measure. IEEE Trans Syst Man Cybern B Cybern (TSMCB) 43(3):982–994
Luo P, Xiong H, Zhan GX, Wu JJ, Shi ZZ (2009) Information-theoretic distance measures for clustering validation: generalization and normalization. IEEE Trans Knowl Data Eng 21(9):1949–1962
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. University of California Press, Berkeley, pp 281–297
Mirkin B (2001) Reinterpreting the category utility function. Mach Learn 45(3):219–228
Ng MK, Li MJ, Huang ZX, He ZY (2007) On the impact of dissimilarity measure in \(k\)-modes clustering algorithm. IEEE Trans Pattern Anal Mach Intell 29(3):503–507
San O, Huynh V, Nakamori Y (2004) An alternative extension of the \(k\)-means algorithm for clustering categorical data. Pattern Recognit 14(2):241–247
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: Proceedings of workshop text mining, 6th ACMSIGKDD international conference on knowledge discovery and data mining, pp 20–23
UCI (2012) UCI machine learning repository. http://www.ics.uci.edu/mlearn/MLRepository.html
Wrigley N (1985) Categorical data analysis for geographers and environmental scientists. Longman, London
Wu J, Yuan H, Chen G (2010) Validation of overlapping clustering: a random clustering perspective. Inf Sci 180(22):4353–4369
Xiong H, Wu J, Chen J (2009) K-means clustering versus validation measures: a data distribution perspective. IEEE Trans Syst Man Cybern B Cybern 39(2):318–331
Yang YM (2004) An evaluation of statistical approaches to text categorization. J Inf Retr 1(1–2):67–88
Yu J (2005) General c-means clustering model. IEEE Trans Pattern Anal Mach Intell 27(8):1197–1211
Zhao Y, Karypis G (2004) Criterion functions for document clustering: experiments and analysis. Mach Learn 55(3):311–331
Acknowledgments
The authors are very grateful to the editors and reviewers for their valuable comments and suggestions. This work was supported by the National Natural Science Foundation of China (Nos. 71031006, 61305073, 61432011), the National Key Basic Research and Development Program of China (973) (No. 2013CB329404), the Foundation of Doctoral Program Research of Ministry of Education of China (No. 20131401120001).
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: G. Karypsis.
Rights and permissions
About this article
Cite this article
Bai, L., Liang, J. Cluster validity functions for categorical data: a solution-space perspective. Data Min Knowl Disc 29, 1560–1597 (2015). https://doi.org/10.1007/s10618-014-0387-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-014-0387-5