Cluster validity functions for categorical data: a solution-space perspective

Bai, Liang; Liang, Jiye

doi:10.1007/s10618-014-0387-5

Cluster validity functions for categorical data: a solution-space perspective

Published: 02 October 2014

Volume 29, pages 1560–1597, (2015)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Liang Bai^1,2 &
Jiye Liang¹

679 Accesses
10 Citations
Explore all metrics

Abstract

For categorical data, there are three widely-used internal validity functions: the \(k\)-modes objective function, the category utility function and the information entropy function, which are defined based on within-cluster information only. Many clustering algorithms have been developed to use them as objective functions and find their optimal solutions. In this paper, we study the generalization, effectiveness and normalization of the three validity functions from a solution-space perspective. First, we present a generalized validity function for categorical data. Based on it, we analyze the generality and difference of the three validity functions in the solution space. Furthermore, we address the problem whether the between-cluster information is ignored when these validity functions are used to evaluate clustering results. To the end, we analyze the upper and lower bounds of the three validity functions for a given data set, which can help us estimate the clustering difficulty on a data set and compare the performance of a clustering algorithm on different data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A new criterion for assessing discriminant validity in variance-based structural equation modeling

Article Open access 22 August 2014

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Density-Based Clustering Based on Hierarchical Density Estimates

References

Aggarwal CC, Magdalena C, Yu PS (2002) Finding localized associations in market basket data. IEEE Trans Knowl Data Eng 14(1):51–62
Article Google Scholar
Andritsos P, Tsaparas P, Miller RJ, Sevcik KC (2004) Limbo: scalable clustering of categorical data. In: Proceedings of the ninth international conference on extending database technology
Bai L, Liang JY, Dang CY, Cao FY (2011) A novel attribute weighting algorithm for clustering high-dimensional categorical data. Pattern Recognit 44(12):2843–2861
Article MATH Google Scholar
Bai L, Liang JY, Dang CY (2013) The impact of cluster representatives on the convergence of the \(k\)-modes type clustering. IEEE Trans Pattern Anal Mach Intell 35(6):1509–1522
Article Google Scholar
Barbara D, Jajodia S (2002) Applications of data mining in computer security. Kluwer, Dordrecht
Book Google Scholar
Barbara D, Li Y, Couto J (2002) Coolcat: an entropy-based algorithm for categorical clustering. In: Proceedings of the eleventh international conference on information and knowledge management, pp 582–589
Baxevanis A, Ouellette F (2001) Bioinformatics: a practical guide to the analysis of genes and proteins, 2nd edn. Wiley, New York
Book Google Scholar
Berry MJA, Linoff G (1996) Data mining techniques for marketing. Sales and customer support. John Wiley and Sons, New York
Google Scholar
Chen HL, Chuang KT, Chen MS (2008) On data labeling for clustering categorical data. IEEE Trans Knowl Data Eng 20(11):1458–1472
Article Google Scholar
Chen K, Liu L (2005) The ”best k” for entropy-based categorical clustering. In: Proceedings of international conference on scientific and statistical database management (SSDBM), pp 253C–262C
Chen K, Liu L (2009) He-tree: a framework for detecting changes in clustering structure for categorical data streams. VLDB J 18(5):1241–1260
Article Google Scholar
Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York
MATH Google Scholar
Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3:32–57
Article MathSciNet MATH Google Scholar
Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2(2):139–172
Google Scholar
Gluck MA, Corter JE (1985) Information uncertainty and the utility. In: Proceedings of the seventh annual conference of cognitive science society, pp 283–287
Gowda KC, Diday E (1991) Symbolic clustering using a new dissimilarity measure. Pattern Recognit 24(6):567–578
Article Google Scholar
Halkidi M, Vazirgiannis M (2001) Clustering validity assessment: finding the optimal partitioning of a data set. In: Proceedings of IEEE international conference on data mining (ICDM), pp 187–194
Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17(2–3): 107–145
He Z, Deng S, Xu X (2005) Improving \(k\)-modes algorithm considering frequencies of attribute values in mode. In: Proceedings of computational intelligence and security, pp 157–162
Huang ZX (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Proceedings of SIGMOD workshop research issues on data mining and knowledge discovery, pp 1–8
Huang ZX, Ng MK (1999) A fuzzy \(k\)-modes algorithm for clustering categorical data. IEEE Trans Fuzzy Syst 7(4):446–452
Article Google Scholar
Huang ZX, Ng MK, Rong H, Li Z (2005) Automated variable weighting in \(k\)-means type clustering. IEEE Trans Fuzzy Syst 27(5):657–668
Google Scholar
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, Englewood Cliffs
MATH Google Scholar
Li T, Ma S, Ogihara M (2004) Entropy-based criterion in categorical clustering. In: Proceedings of international conference on machine learning (ICML), pp 536–543
Liang JY, Chin KS, Dang CY, Yam RCM (2002) A new method for measuring uncertainty and fuzziness in rough set theory. Int J Gen Syst 31(4):331–342
Article MathSciNet MATH Google Scholar
Liu Y, Li Z, Xiong H, Gao X, Wu J (2010) Understanding of internal clustering validation measures. In: The 10th IEEE international conference on data mining (ICDM), pp 911–916
Liu Y, Li Z, Xiong H, Gao X, Wu J, Wu S (2013) Understanding and enhancement of internal clustering validation measure. IEEE Trans Syst Man Cybern B Cybern (TSMCB) 43(3):982–994
Google Scholar
Luo P, Xiong H, Zhan GX, Wu JJ, Shi ZZ (2009) Information-theoretic distance measures for clustering validation: generalization and normalization. IEEE Trans Knowl Data Eng 21(9):1949–1962
Google Scholar
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. University of California Press, Berkeley, pp 281–297
Mirkin B (2001) Reinterpreting the category utility function. Mach Learn 45(3):219–228
Article MATH Google Scholar
Ng MK, Li MJ, Huang ZX, He ZY (2007) On the impact of dissimilarity measure in \(k\)-modes clustering algorithm. IEEE Trans Pattern Anal Mach Intell 29(3):503–507
Article Google Scholar
San O, Huynh V, Nakamori Y (2004) An alternative extension of the \(k\)-means algorithm for clustering categorical data. Pattern Recognit 14(2):241–247
MathSciNet MATH Google Scholar
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: Proceedings of workshop text mining, 6th ACMSIGKDD international conference on knowledge discovery and data mining, pp 20–23
UCI (2012) UCI machine learning repository. http://www.ics.uci.edu/mlearn/MLRepository.html
Wrigley N (1985) Categorical data analysis for geographers and environmental scientists. Longman, London
Google Scholar
Wu J, Yuan H, Chen G (2010) Validation of overlapping clustering: a random clustering perspective. Inf Sci 180(22):4353–4369
Article Google Scholar
Xiong H, Wu J, Chen J (2009) K-means clustering versus validation measures: a data distribution perspective. IEEE Trans Syst Man Cybern B Cybern 39(2):318–331
Article Google Scholar
Yang YM (2004) An evaluation of statistical approaches to text categorization. J Inf Retr 1(1–2):67–88
Google Scholar
Yu J (2005) General c-means clustering model. IEEE Trans Pattern Anal Mach Intell 27(8):1197–1211
Article Google Scholar
Zhao Y, Karypis G (2004) Criterion functions for document clustering: experiments and analysis. Mach Learn 55(3):311–331
Article MATH Google Scholar

Download references

Acknowledgments

The authors are very grateful to the editors and reviewers for their valuable comments and suggestions. This work was supported by the National Natural Science Foundation of China (Nos. 71031006, 61305073, 61432011), the National Key Basic Research and Development Program of China (973) (No. 2013CB329404), the Foundation of Doctoral Program Research of Ministry of Education of China (No. 20131401120001).

Author information

Authors and Affiliations

Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan, 030006, Shanxi, China
Liang Bai & Jiye Liang
Key Laboratory of Network Data Science and Technology, Institute of Computing Technology Chinese Academy of Sciences, Beijing, 100190, China
Liang Bai

Authors

Liang Bai
View author publications
You can also search for this author in PubMed Google Scholar
Jiye Liang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiye Liang.

Additional information

Responsible editor: G. Karypsis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bai, L., Liang, J. Cluster validity functions for categorical data: a solution-space perspective. Data Min Knowl Disc 29, 1560–1597 (2015). https://doi.org/10.1007/s10618-014-0387-5

Download citation

Received: 26 April 2013
Accepted: 18 September 2014
Published: 02 October 2014
Issue Date: November 2015
DOI: https://doi.org/10.1007/s10618-014-0387-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cluster validity functions for categorical data: a solution-space perspective

Abstract

Access this article

Similar content being viewed by others

A new criterion for assessing discriminant validity in variance-based structural equation modeling

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cluster validity functions for categorical data: a solution-space perspective

Abstract

Access this article

Similar content being viewed by others

A new criterion for assessing discriminant validity in variance-based structural equation modeling

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation