Nomclust 2.0: an R package for hierarchical clustering of objects characterized by nominal variables

Sulc, Zdenek; Cibulkova, Jana; Rezankova, Hana

doi:10.1007/s00180-022-01209-4

Nomclust 2.0: an R package for hierarchical clustering of objects characterized by nominal variables

Original paper
Published: 10 March 2022

Volume 37, pages 2161–2184, (2022)
Cite this article

Computational Statistics Aims and scope Submit manuscript

594 Accesses
4 Citations
Explore all metrics

Abstract

In this paper, we present the second generation of the nomclust R package, which we developed for the hierarchical clustering of data containing nominal variables (nominal data). The package completely covers the hierarchical clustering process, from dissimilarity matrix calculation, over the choice of a clustering method, to the evaluation of the final clusters. Through the whole clustering process, similarity measures, clustering methods, and evaluation criteria developed solely for nominal data are used, which makes this package unique. In the first part of the paper, the theoretical background of the methods used in the package is described. In the second part, the functionality of the package is demonstrated in several examples. The second generation of the package is completely rewritten to be more natural for the workflow of R users. It includes new similarity measures and evaluation criteria. We also added several graphical outputs and support for S3 generic functions. Finally, due to code optimizations, the calculation time of dissimilarity matrix calculation was substantially reduced.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering: an R library to facilitate the analysis and comparison of cluster algorithms

Article Open access 17 December 2022

FPDclustering: a comprehensive R package for probabilistic distance clustering based methods

Article Open access 15 May 2024

Categorical Data Clustering

Notes

The datasets contained four numbers of variables (four, six, eight, ten), three ranges of categories (2–4, 2–6, 6–10), and the number of cases varied from 300 to 700. Each of the datasets contained four clusters with the middle between-cluster distance. All the combinations were five times replicated.

References

Anderlucci L, Hennig C (2014) The clustering of categorical data: a comparison of a model-based and a distance-based approach. Commun Stat Theory Methods 43(4):704–721. https://doi.org/10.1080/03610926.2013.806665
Article MathSciNet MATH Google Scholar
Bacher J, Wenzig K, Vogler M (2004) SPSS TwoStep cluster: a first evaluation. In: RC33 sixth international conference on social science methodology. Friedrich-Alexander-Universität Erlangen-Nürnberg, Lehrstuhl für Soziologie. https://www.ssoar.info/ssoar/handle/document/32715
Biem A (2003) A model selection criterion for classification: application to HMM topology optimization. In: Seventh international conference on document analysis and recognition vol 1, pp 104–108. https://doi.org/10.1109/ICDAR.2003.1227641
Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the eighth SIAM international conference on data mining, pp 243–254. https://doi.org/10.1137/1.9781611972788.22
Chaturvedi A, Green PE, Caroll JD (2001) K-modes clustering. J Classif 18(1):35–55. https://doi.org/10.1007/s00357-001-0004-3
Article MathSciNet MATH Google Scholar
Chen K, Liu L (2009) “Best K”: critical clustering structures in categorical datasets. Knowl Inf Syst 20(1):1–33. https://doi.org/10.1007/s10115-008-0159-x
Cibulková J, Šulc Z, Sirota S, Řezanková H (2020) Association among similarity and distance measures for binary data in cluster analysis. Metodološki Zvezki 17(1):33–54
Google Scholar
Eddelbuettel D, Francois R (2013) Rcpp: seamless R and C++ integration. J Stat Softw 40(8):1–18
MATH Google Scholar
Ellerman D (2013) An introduction to logical entropy and its relation to Shannon entropy. Int J Semant Comput 7(2):121–145. https://doi.org/10.1142/S1793351X13400059
Article MATH Google Scholar
Eskin E, Arnold A, Prerau M, Portnoy L, Stolfo S (2002) A geometric framework for unsupervised anomaly detection. In: Barbará D, Sushil J (eds) Applications of data mining in computer security. Springer, Boston, pp 77–101. https://doi.org/10.1007/978-1-4615-0953-0_4
Everitt BS, Landau S, Leese M (2009) Cluster analysis, 5th edn. Wiley Publishing, New Jersey. https://doi.org/10.1002/9780470977811
Book MATH Google Scholar
Goodall DW (1966) A new similarity index based on probability. Biometrics 22(4):882–907. https://www.jstor.org/stable/2528080
Gower JC (1971) A general coefficient of similarity and some of its properties. Biometrics 27(4):857–871
Article Google Scholar
Hagenaars J, McCutcheon A (2002) Applied latent class analysis. Cambridge University Press, Cambridge. ISBN 9781139439237. https://doi.org/10.1017/CBO9780511499531
Halkidi M, Vazirgiannis M, Hennig, C (2015) Method-independent indices for cluster validation and estimating the number of clusters. In: Hennig C, Meila M, Murtagh F, Rocci R (eds) Handbook of cluster analysis. Chapman and Hall/CRC, Cambridge, pp 595–618. https://doi.org/10.1201/b19706
Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th international conference on machine learning, Morgan Kaufmann, pp 296–304
Linzer DA, Lewis JB (2011) poLCA: an R package for polytomous variable latent class analysis. J Stat Softw 42(10). https://doi.org/10.18637/jss.v042.i10
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability, University of California Press, pp 281—297
Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2019) cluster: cluster analysis basics and extensions. R package version 2.1.0, https://cran.r-project.org/web/packages/cluster/index.html
Morlini I, Zani S (2012) A new class of weighted similarity indices using polytomous variables. J Classif 29(2):199–226. https://doi.org/10.1007/s00357-012-9107-2
Article MathSciNet MATH Google Scholar
Ng MK, Li MJ, Huang JZ, He Z (2007) On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Trans Pattern Anal Mach Intell 29(3):503–507. https://doi.org/10.1109/TPAMI.2007.53
Article Google Scholar
R Core Team (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65. https://doi.org/10.1016/0377-0427(87)90125-7
Article MATH Google Scholar
Řezanková H, Löster T, Húsek D (2011) Evaluation of categorical data clustering. In: Mugellini E, Szczepaniak PS, Pettenati MC, Sokhn M (eds) Advances in intelligent web mastering, vol 3, pp 173–182. Springer, Berlin. https://doi.org/10.1007/978-3-642-18029-3_18
Sokal RR, Michener CD (1958) A statistical method for evaluating systematic relationships. Univ Kansas Sci Bull 28:1409–1438
Google Scholar
Spärck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21. https://doi.org/10.1108/eb026526
Article Google Scholar
SPSS (2001) The SPSS TwoStep cluster component. Technical report, SPSS Inc
Šulc Z, Cibulková J, Procházka J, Řezanková H (2018). Internal evaluation criteria for categorical data in hierarchical clustering: optimal number of clusters determination. Metodoloski Zvezki 15(2):1–20. http://ibmi.mf.uni-lj.si/mz/2018/no-2/Sulc2018.pdf
Šulc Z, Cibulková J, Řezanková H (2020). nomclust: hierarchical nominal clustering package. R package version 2.5.0, https://cran.r-project.org/web/packages/nomclust/index.html
Šulc Z, Řezanková H (2015). nomclust: an R package for hierarchical clustering of objects characterized by nominal variables. In: Proceedings of the 9th international days of statistics and economics, pp 1581–1590. Melandrium, Slaný. https://msed.vse.cz/msed_2015/article/48-Sulc-Zdenek-paper.pdf
Šulc Z, Řezanková H (2019) Comparison of similarity measures for categorical data in hierarchical clustering. J Class 36(1):58–72. https://doi.org/10.1007/s00357-019-09317-5
Article MathSciNet MATH Google Scholar
Thorndike RL (1953) Who belongs in the family? Psychometrika 18(4):267–276. https://doi.org/10.1007/BF02289263
Article Google Scholar
Todeschini R, Consonni V, Xiang H, Holliday J, Buscema M, Willett P (2012) Similarity coefficients for binary chemoinformatics data: overview and extended comparison using simulated and real data sets. J Chem Inf Model 52(11):2884–2901. https://doi.org/10.1021/ci300261r
Article Google Scholar
Weihs C, Ligges U, Luebke K, Raabe N (2005) klaR Analyzing German business cycles. In: Baier D, Decker R, Schmidt-Thieme L (eds) Data analysis and decision support, pp 335–343. Springer, Berlin. https://doi.org/10.1007/3-540-28397-8

Download references

Acknowledgements

This paper was supported by the Prague University of Economics and Business under grant IGA F4/44/2018.

Author information

Authors and Affiliations

Department of Statistics and Probability Faculty of Informatics and Statistics, Prague University of Economics and Business, W. Churchill Sq. 1938/4 130 67 Prague 3, Prague, Czech Republic
Zdenek Sulc, Jana Cibulkova & Hana Rezankova

Authors

Zdenek Sulc
View author publications
You can also search for this author inPubMed Google Scholar
Jana Cibulkova
View author publications
You can also search for this author inPubMed Google Scholar
Hana Rezankova
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Zdenek Sulc.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sulc, Z., Cibulkova, J. & Rezankova, H. Nomclust 2.0: an R package for hierarchical clustering of objects characterized by nominal variables. Comput Stat 37, 2161–2184 (2022). https://doi.org/10.1007/s00180-022-01209-4

Download citation

Received: 02 July 2020
Accepted: 09 February 2022
Published: 10 March 2022
Issue Date: November 2022
DOI: https://doi.org/10.1007/s00180-022-01209-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Nomclust 2.0: an R package for hierarchical clustering of objects characterized by nominal variables

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Clustering: an R library to facilitate the analysis and comparison of cluster algorithms

FPDclustering: a comprehensive R package for probabilistic distance clustering based methods

Categorical Data Clustering

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now