Comparison of Similarity Measures for Categorical Data in Hierarchical Clustering

Šulc, Zdeněk; Řezanková, Hana

doi:10.1007/s00357-019-09317-5

Comparison of Similarity Measures for Categorical Data in Hierarchical Clustering

Published: 02 April 2019

Volume 36, pages 58–72, (2019)
Cite this article

Journal of Classification Aims and scope Submit manuscript

1631 Accesses
24 Citations
Explore all metrics

Abstract

This paper deals with similarity measures for categorical data in hierarchical clustering, which can deal with variables with more than two categories, and which aspire to replace the simple matching approach standardly used in this area. These similarity measures consider additional characteristics of a dataset, such as a frequency distribution of categories or the number of categories of a given variable. The paper recognizes two main aims. First, to compare and evaluate the selected similarity measures regarding the quality of produced clusters in hierarchical clustering. Second, to propose new similarity measures for nominal variables. All the examined similarity measures are compared regarding the quality of the produced clusters using the mean ranked scores of two internal evaluation coefficients. The analysis is performed on the generated datasets, and thus, it allows determining in which particular situations a certain similarity measure is recommended for use.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Nomclust 2.0: an R package for hierarchical clustering of objects characterized by nominal variables

Article 10 March 2022

Comparison of internal evaluation criteria in hierarchical clustering of categorical data

Article 13 April 2024

Towards a Classification of Binary Similarity Measures

References

Anderberg, M. R. (1973). Cluster analysis for applications. Probability and mathematical statistics. New York: Academic Press.
MATH Google Scholar
Boriah, S., Chandola, V., Kumar, V. (2008). Similarity measures for categorical data: a comparative evaluation. In Proceedings of the eighth SIAM International Conference on Data Mining (pp. 243–254).
Chandola, V., Boriah, S., Kumar, V. (2009). A framework for exploring categorical data. In Proceedings of the ninth SIAM International Conference on Data Mining (pp. 187–198): SIAM.
Chatuverdi, A., Foods, K., Green, P. E., Carroll, J. D. (2001). K-modes clustering. Journal of Classification, 18(1), 35–55.
Article MathSciNet Google Scholar
Chen, L., & Guo, G. (2014). Centroid-based classification of categorical data. In Li, F., Li, G., Hwang, S.-w., Yao, B., Zhang, Z. (Eds.) Web-age information management (pp. 472–475). Cham: Springer International Publishing.
Deza, M. M., & Deza, E. (2009). Encyclopedia of distances. Berlin: Springer.
Book MATH Google Scholar
Eskin, E., Arnold, A., Prerau, M., Portnoy, L., Stolfo, S. (2002). A geometric framework for unsupervised anomaly detection, (pp. 77–101). Boston: Springer US.
Google Scholar
Everitt, B., Landau, S., Leese, M., Stahl, D. (2011). Cluster analysis. Wiley series in probability and statistics. New York: Wiley.
MATH Google Scholar
Goodall, D. W. (1966). A new similarity index based on probability. Biometrics, 22(4), 882–907.
Article Google Scholar
Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 27(4), 857–871.
Article Google Scholar
Hennig, C., Meila, M., Murtagh, F., Rocci, R. (2015). Handbook of cluster analysis. Chapman & Hall/CRC Handbooks of modern statistical methods. Taylor & Francis.
Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 283–304.
Article Google Scholar
Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11(2), 37–50.
Article Google Scholar
Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning (pp. 296–304): Morgan Kaufmann.
Morlini, I., & Zani, S. (2012). A new class of weighted similarity indices using polytomous variables. Journal of Classification, 29(2), 199–226.
Article MathSciNet MATH Google Scholar
Qiu, W., & Joe, H. (2015). clusterGeneration: random cluster generation (with specified degree of separation). R package version 1.3.4.
Qiu, W., & Joe, H. (2016). Generation of random clusters with specified degree of separation. Journal of Classification, 23(2), 315–334.
Article MathSciNet MATH Google Scholar
Řezanková, H. (2009). Cluster analysis and categorical data. Statistika, 89(2), 216–232.
Google Scholar
Řezanková, H., Löster, T., Húsek, D. (2011). Evaluation of categorical data clustering. Advances in Intelligent Web Mastering, 3, 173–182.
Article Google Scholar
San, O. M., Huynh, V. N., Nakamori, Y. (2004). An alternative extension of the k-means algorithm for clustering categorical data. International Journal of Applied Mathematics and Computer Science, 14(2), 241–247.
MathSciNet MATH Google Scholar
Shannon, C. E. (2001). A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5(1), 3–55.
Article MathSciNet Google Scholar
Sokal, R. R., & Michener, C. D. (1958). A statistical method for evaluating systematic relationships. University of Kansas Scientific Bulletin, 28, 1409–1438.
Google Scholar
Spärck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21.
Article Google Scholar
Strauss, T., & von Maltitz, M. J. (2017). Generalising Ward’s method for use with Manhattan distances. PLoS ONE, 12(1), 1–21.
Article Google Scholar
Šulc, Z., & Řezanková, H. (2015). nomclust: an R package for hierarchical clustering of objects characterized by nominal variables. In Proceedings of the 9th International Days of Statistics and Economics (pp. 1581–1590). Slaný: Melandrium.
Todeschini, R., Consonni, J., Xiang, H., Holliday, V., Buscema, M., Willett, P. (2012). Similarity coefficients for binary chemoinformatics data: overview and extended comparison using simulated and real data sets. Journal of Chemical Information and Modeling, 52(11), 2884–2901.
Article Google Scholar
Warrens, M. J. (2008). Similarity coefficients for binary data. Ph.D. thesis, University of Leiden.
Warrens, M. J. (2016). Inequalities between similarities for numerical data. Journal of Classification, 33(2), 141–148.
Article MathSciNet MATH Google Scholar
Yi, J., Yang, G., Wan, J. (2016). Category discrimination based feature selection algorithm in chinese text classification. Journal of Information Science and Engineering, 32(5), 1145–1159.
MathSciNet Google Scholar
Yim, O., & Ramdeen, K. T. (2015). Hierarchical cluster analysis: comparison of three linkage measures and application to psychological data. The Quantitative Methods for Psychology, 11(1), 8–21.
Article Google Scholar

Download references

Funding

This paper was supported by the University of Economics, Prague under the IGA project no. F4/41/2016.

Author information

Authors and Affiliations

Department of Statistics and Probability, University of Economics, Prague, W. Churchill sq. 4, 130 67, Prague 3, Czech Republic
Zdeněk Šulc & Hana Řezanková

Authors

Zdeněk Šulc
View author publications
You can also search for this author in PubMed Google Scholar
Hana Řezanková
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zdeněk Šulc.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Table 11 Rank orders of all combinations of similarity measures and linkage methods (PSFE)

Full size table

Table 12 Rank orders of all combinations of similarity measures and linkage methods (PSFM)

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Šulc, Z., Řezanková, H. Comparison of Similarity Measures for Categorical Data in Hierarchical Clustering. J Classif 36, 58–72 (2019). https://doi.org/10.1007/s00357-019-09317-5

Download citation

Published: 02 April 2019
Issue Date: 15 April 2019
DOI: https://doi.org/10.1007/s00357-019-09317-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparison of Similarity Measures for Categorical Data in Hierarchical Clustering

Abstract

Access this article

Similar content being viewed by others

Nomclust 2.0: an R package for hierarchical clustering of objects characterized by nominal variables

Comparison of internal evaluation criteria in hierarchical clustering of categorical data

Towards a Classification of Binary Similarity Measures

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Comparison of Similarity Measures for Categorical Data in Hierarchical Clustering

Abstract

Access this article

Similar content being viewed by others

Nomclust 2.0: an R package for hierarchical clustering of objects characterized by nominal variables

Comparison of internal evaluation criteria in hierarchical clustering of categorical data

Towards a Classification of Binary Similarity Measures

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation