Abstract
Methods of cluster analysis are well known techniques of multivariate analysis used for many years. Their main applications concern clustering objects characterized by quantitative variables. For this case various coefficients for clustering evaluation and determination of cluster numbers have been proposed. However, in some areas, i.e., for segmentation of Internet users, the variables are often nominal or ordinal as their origin in questionnaire responses. That is why we are dealing with the evaluation criteria for the case of categorical variables here. The criteria based on variability measures are proposed. Instead of variance as a measure for quantitative variables, three measures for nominal variables are considered: the variability measure based on a modal frequency, Gini’s coefficient of mutability, and the entropy. The proposed evaluation criteria are applied to a real-dataset.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Barbará, D., Li, Y., Couto, J.: COOLCAT: An entropy-based algorithm for categorical clustering. In: Proceedings of the 11th International Conference on Information and Knowledge Management, pp. 582–589. ACM Press, McLean (2002)
Calinski, T., Habarasz, J.: A dendrite method for cluster analysis. Communications in Statistics 3, 1–27 (1974)
Chatuverdi, A., Foods, K., Green, P.E., Carroll, J.D.: K-modes clustering. Journal of Classification 18, 35–55 (2001)
Gan, G., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications. ASA-SIAM, Philadelphia (2007)
Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS – Clustering categorical data using summaries. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 73–83. ACM Press, San Diego (1999)
Gini, C.W.: Variability and Mutability. Contribution to the study of statistical distributions and relations. Studi Economico-Giuridici della R. Universita de Cagliari (1912); Reviewed in: Light, R.J., Margolin, B.H.: An Analysis of Variance for Categorical Data. J. American Statistical Association 66, 534–544 (1971)
Goodman, L.A., Kruskal, W.H.: Measures of association for crossclassification. Journal of the American Statistical Association 49, 732–764 (1954)
Gordon, A.D.: Classification, 2nd edn. Chapman & Hall/CRC, Boca Raton (1999)
Guha, S., Rastogi, R., Shim, K.: ROCK: A robust clustering algorithm for categorical attributes. Information Systems 25, 345–366 (2000)
He, Z., Xu, X., Deng, S.: Squeezer: An efficient algorithm for clustering categorical data. Journal of Computer Science and Technology 17, 611–625 (2002)
Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, University of British Columbia, pp. 1–8 (1997)
Huang, Z.: Extensions to the k-means algorithm to clustering large data sets with categorical values. Data Mining and Knowledge Discovery 2, 283–304 (1998)
Kogan, J.: Introduction to Clustering Large and High-Dimensional Data. Cambridge University Press, New York (2007)
Magidson, J., Vermunt, J.K.: Latent class models for clustering: A comparison with K-means. Canadian Journal of Marketing Research 20, 37–44 (2002)
Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T.: Numerical Recipes in C: The Art of Scientific Computing, p. 634. Cambridge University Press, Cambridge (1988)
Sharma, S.: Applied Multivariate Techniques. John Wiley & Sons, Inc., New York (1995)
Sila, M.: Analysis of Internet Visits and Internet Users (in Czech). Diploma thesis. University of Economics, Prague (2010)
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large databases. ACM SIGMOD Record 25, 103–114 (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Rezankova, H., Loster, T., Husek, D. (2011). Evaluation of Categorical Data Clustering. In: Mugellini, E., Szczepaniak, P.S., Pettenati, M.C., Sokhn, M. (eds) Advances in Intelligent Web Mastering – 3. Advances in Intelligent and Soft Computing, vol 86. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-18029-3_18
Download citation
DOI: https://doi.org/10.1007/978-3-642-18029-3_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-18028-6
Online ISBN: 978-3-642-18029-3
eBook Packages: EngineeringEngineering (R0)