Abstract
This paper deals with similarity measures for categorical data in hierarchical clustering, which can deal with variables with more than two categories, and which aspire to replace the simple matching approach standardly used in this area. These similarity measures consider additional characteristics of a dataset, such as a frequency distribution of categories or the number of categories of a given variable. The paper recognizes two main aims. First, to compare and evaluate the selected similarity measures regarding the quality of produced clusters in hierarchical clustering. Second, to propose new similarity measures for nominal variables. All the examined similarity measures are compared regarding the quality of the produced clusters using the mean ranked scores of two internal evaluation coefficients. The analysis is performed on the generated datasets, and thus, it allows determining in which particular situations a certain similarity measure is recommended for use.
Similar content being viewed by others
References
Anderberg, M. R. (1973). Cluster analysis for applications. Probability and mathematical statistics. New York: Academic Press.
Boriah, S., Chandola, V., Kumar, V. (2008). Similarity measures for categorical data: a comparative evaluation. In Proceedings of the eighth SIAM International Conference on Data Mining (pp. 243–254).
Chandola, V., Boriah, S., Kumar, V. (2009). A framework for exploring categorical data. In Proceedings of the ninth SIAM International Conference on Data Mining (pp. 187–198): SIAM.
Chatuverdi, A., Foods, K., Green, P. E., Carroll, J. D. (2001). K-modes clustering. Journal of Classification, 18(1), 35–55.
Chen, L., & Guo, G. (2014). Centroid-based classification of categorical data. In Li, F., Li, G., Hwang, S.-w., Yao, B., Zhang, Z. (Eds.) Web-age information management (pp. 472–475). Cham: Springer International Publishing.
Deza, M. M., & Deza, E. (2009). Encyclopedia of distances. Berlin: Springer.
Eskin, E., Arnold, A., Prerau, M., Portnoy, L., Stolfo, S. (2002). A geometric framework for unsupervised anomaly detection, (pp. 77–101). Boston: Springer US.
Everitt, B., Landau, S., Leese, M., Stahl, D. (2011). Cluster analysis. Wiley series in probability and statistics. New York: Wiley.
Goodall, D. W. (1966). A new similarity index based on probability. Biometrics, 22(4), 882–907.
Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 27(4), 857–871.
Hennig, C., Meila, M., Murtagh, F., Rocci, R. (2015). Handbook of cluster analysis. Chapman & Hall/CRC Handbooks of modern statistical methods. Taylor & Francis.
Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 283–304.
Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11(2), 37–50.
Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning (pp. 296–304): Morgan Kaufmann.
Morlini, I., & Zani, S. (2012). A new class of weighted similarity indices using polytomous variables. Journal of Classification, 29(2), 199–226.
Qiu, W., & Joe, H. (2015). clusterGeneration: random cluster generation (with specified degree of separation). R package version 1.3.4.
Qiu, W., & Joe, H. (2016). Generation of random clusters with specified degree of separation. Journal of Classification, 23(2), 315–334.
Řezanková, H. (2009). Cluster analysis and categorical data. Statistika, 89(2), 216–232.
Řezanková, H., Löster, T., Húsek, D. (2011). Evaluation of categorical data clustering. Advances in Intelligent Web Mastering, 3, 173–182.
San, O. M., Huynh, V. N., Nakamori, Y. (2004). An alternative extension of the k-means algorithm for clustering categorical data. International Journal of Applied Mathematics and Computer Science, 14(2), 241–247.
Shannon, C. E. (2001). A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5(1), 3–55.
Sokal, R. R., & Michener, C. D. (1958). A statistical method for evaluating systematic relationships. University of Kansas Scientific Bulletin, 28, 1409–1438.
Spärck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21.
Strauss, T., & von Maltitz, M. J. (2017). Generalising Ward’s method for use with Manhattan distances. PLoS ONE, 12(1), 1–21.
Šulc, Z., & Řezanková, H. (2015). nomclust: an R package for hierarchical clustering of objects characterized by nominal variables. In Proceedings of the 9th International Days of Statistics and Economics (pp. 1581–1590). Slaný: Melandrium.
Todeschini, R., Consonni, J., Xiang, H., Holliday, V., Buscema, M., Willett, P. (2012). Similarity coefficients for binary chemoinformatics data: overview and extended comparison using simulated and real data sets. Journal of Chemical Information and Modeling, 52(11), 2884–2901.
Warrens, M. J. (2008). Similarity coefficients for binary data. Ph.D. thesis, University of Leiden.
Warrens, M. J. (2016). Inequalities between similarities for numerical data. Journal of Classification, 33(2), 141–148.
Yi, J., Yang, G., Wan, J. (2016). Category discrimination based feature selection algorithm in chinese text classification. Journal of Information Science and Engineering, 32(5), 1145–1159.
Yim, O., & Ramdeen, K. T. (2015). Hierarchical cluster analysis: comparison of three linkage measures and application to psychological data. The Quantitative Methods for Psychology, 11(1), 8–21.
Funding
This paper was supported by the University of Economics, Prague under the IGA project no. F4/41/2016.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Rights and permissions
About this article
Cite this article
Šulc, Z., Řezanková, H. Comparison of Similarity Measures for Categorical Data in Hierarchical Clustering. J Classif 36, 58–72 (2019). https://doi.org/10.1007/s00357-019-09317-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-019-09317-5