Skip to main content
Log in

Cluster validity functions for categorical data: a solution-space perspective

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

For categorical data, there are three widely-used internal validity functions: the \(k\)-modes objective function, the category utility function and the information entropy function, which are defined based on within-cluster information only. Many clustering algorithms have been developed to use them as objective functions and find their optimal solutions. In this paper, we study the generalization, effectiveness and normalization of the three validity functions from a solution-space perspective. First, we present a generalized validity function for categorical data. Based on it, we analyze the generality and difference of the three validity functions in the solution space. Furthermore, we address the problem whether the between-cluster information is ignored when these validity functions are used to evaluate clustering results. To the end, we analyze the upper and lower bounds of the three validity functions for a given data set, which can help us estimate the clustering difficulty on a data set and compare the performance of a clustering algorithm on different data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

References

  • Aggarwal CC, Magdalena C, Yu PS (2002) Finding localized associations in market basket data. IEEE Trans Knowl Data Eng 14(1):51–62

    Article  Google Scholar 

  • Andritsos P, Tsaparas P, Miller RJ, Sevcik KC (2004) Limbo: scalable clustering of categorical data. In: Proceedings of the ninth international conference on extending database technology

  • Bai L, Liang JY, Dang CY, Cao FY (2011) A novel attribute weighting algorithm for clustering high-dimensional categorical data. Pattern Recognit 44(12):2843–2861

    Article  MATH  Google Scholar 

  • Bai L, Liang JY, Dang CY (2013) The impact of cluster representatives on the convergence of the \(k\)-modes type clustering. IEEE Trans Pattern Anal Mach Intell 35(6):1509–1522

    Article  Google Scholar 

  • Barbara D, Jajodia S (2002) Applications of data mining in computer security. Kluwer, Dordrecht

    Book  Google Scholar 

  • Barbara D, Li Y, Couto J (2002) Coolcat: an entropy-based algorithm for categorical clustering. In: Proceedings of the eleventh international conference on information and knowledge management, pp 582–589

  • Baxevanis A, Ouellette F (2001) Bioinformatics: a practical guide to the analysis of genes and proteins, 2nd edn. Wiley, New York

    Book  Google Scholar 

  • Berry MJA, Linoff G (1996) Data mining techniques for marketing. Sales and customer support. John Wiley and Sons, New York

    Google Scholar 

  • Chen HL, Chuang KT, Chen MS (2008) On data labeling for clustering categorical data. IEEE Trans Knowl Data Eng 20(11):1458–1472

    Article  Google Scholar 

  • Chen K, Liu L (2005) The ”best k” for entropy-based categorical clustering. In: Proceedings of international conference on scientific and statistical database management (SSDBM), pp 253C–262C

  • Chen K, Liu L (2009) He-tree: a framework for detecting changes in clustering structure for categorical data streams. VLDB J 18(5):1241–1260

    Article  Google Scholar 

  • Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York

    MATH  Google Scholar 

  • Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3:32–57

    Article  MathSciNet  MATH  Google Scholar 

  • Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2(2):139–172

    Google Scholar 

  • Gluck MA, Corter JE (1985) Information uncertainty and the utility. In: Proceedings of the seventh annual conference of cognitive science society, pp 283–287

  • Gowda KC, Diday E (1991) Symbolic clustering using a new dissimilarity measure. Pattern Recognit 24(6):567–578

    Article  Google Scholar 

  • Halkidi M, Vazirgiannis M (2001) Clustering validity assessment: finding the optimal partitioning of a data set. In: Proceedings of IEEE international conference on data mining (ICDM), pp 187–194

  • Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17(2–3): 107–145

  • He Z, Deng S, Xu X (2005) Improving \(k\)-modes algorithm considering frequencies of attribute values in mode. In: Proceedings of computational intelligence and security, pp 157–162

  • Huang ZX (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Proceedings of SIGMOD workshop research issues on data mining and knowledge discovery, pp 1–8

  • Huang ZX, Ng MK (1999) A fuzzy \(k\)-modes algorithm for clustering categorical data. IEEE Trans Fuzzy Syst 7(4):446–452

    Article  Google Scholar 

  • Huang ZX, Ng MK, Rong H, Li Z (2005) Automated variable weighting in \(k\)-means type clustering. IEEE Trans Fuzzy Syst 27(5):657–668

    Google Scholar 

  • Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, Englewood Cliffs

    MATH  Google Scholar 

  • Li T, Ma S, Ogihara M (2004) Entropy-based criterion in categorical clustering. In: Proceedings of international conference on machine learning (ICML), pp 536–543

  • Liang JY, Chin KS, Dang CY, Yam RCM (2002) A new method for measuring uncertainty and fuzziness in rough set theory. Int J Gen Syst 31(4):331–342

    Article  MathSciNet  MATH  Google Scholar 

  • Liu Y, Li Z, Xiong H, Gao X, Wu J (2010) Understanding of internal clustering validation measures. In: The 10th IEEE international conference on data mining (ICDM), pp 911–916

  • Liu Y, Li Z, Xiong H, Gao X, Wu J, Wu S (2013) Understanding and enhancement of internal clustering validation measure. IEEE Trans Syst Man Cybern B Cybern (TSMCB) 43(3):982–994

    Google Scholar 

  • Luo P, Xiong H, Zhan GX, Wu JJ, Shi ZZ (2009) Information-theoretic distance measures for clustering validation: generalization and normalization. IEEE Trans Knowl Data Eng 21(9):1949–1962

    Google Scholar 

  • MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. University of California Press, Berkeley, pp 281–297

  • Mirkin B (2001) Reinterpreting the category utility function. Mach Learn 45(3):219–228

    Article  MATH  Google Scholar 

  • Ng MK, Li MJ, Huang ZX, He ZY (2007) On the impact of dissimilarity measure in \(k\)-modes clustering algorithm. IEEE Trans Pattern Anal Mach Intell 29(3):503–507

    Article  Google Scholar 

  • San O, Huynh V, Nakamori Y (2004) An alternative extension of the \(k\)-means algorithm for clustering categorical data. Pattern Recognit 14(2):241–247

    MathSciNet  MATH  Google Scholar 

  • Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: Proceedings of workshop text mining, 6th ACMSIGKDD international conference on knowledge discovery and data mining, pp 20–23

  • UCI (2012) UCI machine learning repository. http://www.ics.uci.edu/mlearn/MLRepository.html

  • Wrigley N (1985) Categorical data analysis for geographers and environmental scientists. Longman, London

    Google Scholar 

  • Wu J, Yuan H, Chen G (2010) Validation of overlapping clustering: a random clustering perspective. Inf Sci 180(22):4353–4369

    Article  Google Scholar 

  • Xiong H, Wu J, Chen J (2009) K-means clustering versus validation measures: a data distribution perspective. IEEE Trans Syst Man Cybern B Cybern 39(2):318–331

    Article  Google Scholar 

  • Yang YM (2004) An evaluation of statistical approaches to text categorization. J Inf Retr 1(1–2):67–88

    Google Scholar 

  • Yu J (2005) General c-means clustering model. IEEE Trans Pattern Anal Mach Intell 27(8):1197–1211

    Article  Google Scholar 

  • Zhao Y, Karypis G (2004) Criterion functions for document clustering: experiments and analysis. Mach Learn 55(3):311–331

    Article  MATH  Google Scholar 

Download references

Acknowledgments

The authors are very grateful to the editors and reviewers for their valuable comments and suggestions. This work was supported by the National Natural Science Foundation of China (Nos. 71031006, 61305073, 61432011), the National Key Basic Research and Development Program of China (973) (No. 2013CB329404), the Foundation of Doctoral Program Research of Ministry of Education of China (No. 20131401120001).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiye Liang.

Additional information

Responsible editor: G. Karypsis.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bai, L., Liang, J. Cluster validity functions for categorical data: a solution-space perspective. Data Min Knowl Disc 29, 1560–1597 (2015). https://doi.org/10.1007/s10618-014-0387-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-014-0387-5

Keywords

Navigation