Abstract
In this paper, we present the second generation of the nomclust R package, which we developed for the hierarchical clustering of data containing nominal variables (nominal data). The package completely covers the hierarchical clustering process, from dissimilarity matrix calculation, over the choice of a clustering method, to the evaluation of the final clusters. Through the whole clustering process, similarity measures, clustering methods, and evaluation criteria developed solely for nominal data are used, which makes this package unique. In the first part of the paper, the theoretical background of the methods used in the package is described. In the second part, the functionality of the package is demonstrated in several examples. The second generation of the package is completely rewritten to be more natural for the workflow of R users. It includes new similarity measures and evaluation criteria. We also added several graphical outputs and support for S3 generic functions. Finally, due to code optimizations, the calculation time of dissimilarity matrix calculation was substantially reduced.



Similar content being viewed by others
Notes
The datasets contained four numbers of variables (four, six, eight, ten), three ranges of categories (2–4, 2–6, 6–10), and the number of cases varied from 300 to 700. Each of the datasets contained four clusters with the middle between-cluster distance. All the combinations were five times replicated.
References
Anderlucci L, Hennig C (2014) The clustering of categorical data: a comparison of a model-based and a distance-based approach. Commun Stat Theory Methods 43(4):704–721. https://doi.org/10.1080/03610926.2013.806665
Bacher J, Wenzig K, Vogler M (2004) SPSS TwoStep cluster: a first evaluation. In: RC33 sixth international conference on social science methodology. Friedrich-Alexander-Universität Erlangen-Nürnberg, Lehrstuhl für Soziologie. https://www.ssoar.info/ssoar/handle/document/32715
Biem A (2003) A model selection criterion for classification: application to HMM topology optimization. In: Seventh international conference on document analysis and recognition vol 1, pp 104–108. https://doi.org/10.1109/ICDAR.2003.1227641
Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the eighth SIAM international conference on data mining, pp 243–254. https://doi.org/10.1137/1.9781611972788.22
Chaturvedi A, Green PE, Caroll JD (2001) K-modes clustering. J Classif 18(1):35–55. https://doi.org/10.1007/s00357-001-0004-3
Chen K, Liu L (2009) “Best K”: critical clustering structures in categorical datasets. Knowl Inf Syst 20(1):1–33. https://doi.org/10.1007/s10115-008-0159-x
Cibulková J, Šulc Z, Sirota S, Řezanková H (2020) Association among similarity and distance measures for binary data in cluster analysis. Metodološki Zvezki 17(1):33–54
Eddelbuettel D, Francois R (2013) Rcpp: seamless R and C++ integration. J Stat Softw 40(8):1–18
Ellerman D (2013) An introduction to logical entropy and its relation to Shannon entropy. Int J Semant Comput 7(2):121–145. https://doi.org/10.1142/S1793351X13400059
Eskin E, Arnold A, Prerau M, Portnoy L, Stolfo S (2002) A geometric framework for unsupervised anomaly detection. In: Barbará D, Sushil J (eds) Applications of data mining in computer security. Springer, Boston, pp 77–101. https://doi.org/10.1007/978-1-4615-0953-0_4
Everitt BS, Landau S, Leese M (2009) Cluster analysis, 5th edn. Wiley Publishing, New Jersey. https://doi.org/10.1002/9780470977811
Goodall DW (1966) A new similarity index based on probability. Biometrics 22(4):882–907. https://www.jstor.org/stable/2528080
Gower JC (1971) A general coefficient of similarity and some of its properties. Biometrics 27(4):857–871
Hagenaars J, McCutcheon A (2002) Applied latent class analysis. Cambridge University Press, Cambridge. ISBN 9781139439237. https://doi.org/10.1017/CBO9780511499531
Halkidi M, Vazirgiannis M, Hennig, C (2015) Method-independent indices for cluster validation and estimating the number of clusters. In: Hennig C, Meila M, Murtagh F, Rocci R (eds) Handbook of cluster analysis. Chapman and Hall/CRC, Cambridge, pp 595–618. https://doi.org/10.1201/b19706
Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th international conference on machine learning, Morgan Kaufmann, pp 296–304
Linzer DA, Lewis JB (2011) poLCA: an R package for polytomous variable latent class analysis. J Stat Softw 42(10). https://doi.org/10.18637/jss.v042.i10
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability, University of California Press, pp 281—297
Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2019) cluster: cluster analysis basics and extensions. R package version 2.1.0, https://cran.r-project.org/web/packages/cluster/index.html
Morlini I, Zani S (2012) A new class of weighted similarity indices using polytomous variables. J Classif 29(2):199–226. https://doi.org/10.1007/s00357-012-9107-2
Ng MK, Li MJ, Huang JZ, He Z (2007) On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Trans Pattern Anal Mach Intell 29(3):503–507. https://doi.org/10.1109/TPAMI.2007.53
R Core Team (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65. https://doi.org/10.1016/0377-0427(87)90125-7
Řezanková H, Löster T, Húsek D (2011) Evaluation of categorical data clustering. In: Mugellini E, Szczepaniak PS, Pettenati MC, Sokhn M (eds) Advances in intelligent web mastering, vol 3, pp 173–182. Springer, Berlin. https://doi.org/10.1007/978-3-642-18029-3_18
Sokal RR, Michener CD (1958) A statistical method for evaluating systematic relationships. Univ Kansas Sci Bull 28:1409–1438
Spärck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21. https://doi.org/10.1108/eb026526
SPSS (2001) The SPSS TwoStep cluster component. Technical report, SPSS Inc
Šulc Z, Cibulková J, Procházka J, Řezanková H (2018). Internal evaluation criteria for categorical data in hierarchical clustering: optimal number of clusters determination. Metodoloski Zvezki 15(2):1–20. http://ibmi.mf.uni-lj.si/mz/2018/no-2/Sulc2018.pdf
Šulc Z, Cibulková J, Řezanková H (2020). nomclust: hierarchical nominal clustering package. R package version 2.5.0, https://cran.r-project.org/web/packages/nomclust/index.html
Šulc Z, Řezanková H (2015). nomclust: an R package for hierarchical clustering of objects characterized by nominal variables. In: Proceedings of the 9th international days of statistics and economics, pp 1581–1590. Melandrium, Slaný. https://msed.vse.cz/msed_2015/article/48-Sulc-Zdenek-paper.pdf
Šulc Z, Řezanková H (2019) Comparison of similarity measures for categorical data in hierarchical clustering. J Class 36(1):58–72. https://doi.org/10.1007/s00357-019-09317-5
Thorndike RL (1953) Who belongs in the family? Psychometrika 18(4):267–276. https://doi.org/10.1007/BF02289263
Todeschini R, Consonni V, Xiang H, Holliday J, Buscema M, Willett P (2012) Similarity coefficients for binary chemoinformatics data: overview and extended comparison using simulated and real data sets. J Chem Inf Model 52(11):2884–2901. https://doi.org/10.1021/ci300261r
Weihs C, Ligges U, Luebke K, Raabe N (2005) klaR Analyzing German business cycles. In: Baier D, Decker R, Schmidt-Thieme L (eds) Data analysis and decision support, pp 335–343. Springer, Berlin. https://doi.org/10.1007/3-540-28397-8
Acknowledgements
This paper was supported by the Prague University of Economics and Business under grant IGA F4/44/2018.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sulc, Z., Cibulkova, J. & Rezankova, H. Nomclust 2.0: an R package for hierarchical clustering of objects characterized by nominal variables. Comput Stat 37, 2161–2184 (2022). https://doi.org/10.1007/s00180-022-01209-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-022-01209-4