A method for k-means-like clustering of categorical data

Nguyen, Thu-Hien Thi; Dinh, Duy-Tai; Sriboonchitta, Songsak; Huynh, Van-Nam

doi:10.1007/s12652-019-01445-5

A method for k-means-like clustering of categorical data

Original Research
Published: 06 September 2019

Volume 14, pages 15011–15021, (2023)
Cite this article

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Thu-Hien Thi Nguyen¹,
Duy-Tai Dinh¹,
Songsak Sriboonchitta² &
…
Van-Nam Huynh ORCID: orcid.org/0000-0002-3860-7815¹

1421 Accesses
14 Citations
Explore all metrics

Abstract

Despite recent efforts, the challenge in clustering categorical and mixed data in the context of big data still remains due to the lack of inherently meaningful measure of similarity between categorical objects and the high computational complexity of existing clustering techniques. While k-means method is well known for its efficiency in clustering large data sets, working only on numerical data prohibits it from being applied for clustering categorical data. In this paper, we aim to develop a novel extension of k-means method for clustering categorical data, making use of an information theoretic-based dissimilarity measure and a kernel-based method for representation of cluster means for categorical objects. Such an approach allows us to formulate the problem of clustering categorical data in the fashion similar to k-means clustering, while a kernel-based definition of centers also provides an interpretation of cluster means being consistent with the statistical interpretation of the cluster means for numerical data. In order to demonstrate the performance of the new clustering method, a series of experiments on real datasets from UCI Machine Learning Repository are conducted and the obtained results are compared with several previously developed algorithms for clustering categorical data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A k-Means-Like Algorithm for Clustering Categorical Data Using an Information Theoretic-Based Dissimilarity Measure

The Performance of Objective Functions for Clustering Categorical Data

Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient

Notes

This paper is a significantly revised and extended version of Nguyen and Huynh (2016).

References

Aitchison J, Aitken CGG (1976) Multivariate binary discrimination by the kernel method. Biometrika 63(3):413–420. https://doi.org/10.1093/biomet/63.3.413
Article MathSciNet Google Scholar
Berkhin P (2002) Survey of clustering data mining techniques. Technical report
Blake CL, Merz CJ (1998) UCI Repository of machine learning databases. University of California, Department of Information and Computer Science, Irvine, CA. http://www.ics.uci.edu/~mlearn/MLRepository.html
Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the SIAM international conference on data mining, SDM—2008, pp 243–254. https://doi.org/10.1137/1.9781611972788.22
Chen L, Wang S (2013) Central clustering of categorical data with automated feature weighting. In: Proceedings of the twenty-third international joint conference on artificial intelligence, pp 1260–1266. https://www.ijcai.org/Proceedings/13/Papers/190.pdf
Fahad et al (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279. https://doi.org/10.1109/TETC.2014.2330519
Article Google Scholar
Ganti V, Gehrke J, Ramakrishnan R (1999) CATUS—clustering categorical data using summaries. In: Proceedings of the international conference on knowledge discovery and data mining, (San Diego, USA), pp 73–83. https://doi.org/10.1145/312129.312201
Gibson D, Kleinberg J, Raghavan P (2000) Clustering categorical data: an approach based on dynamic systems. VLDB J 8:222–236. https://doi.org/10.1007/s007780050005
Article Google Scholar
Guha S, Rastogi R, Shim K (2000) ROCK: a robust clustering algorithm for categorical attributes. Inf Syst 25(5):345–366. https://doi.org/10.1016/S0306-4379(00)00022-3
Article Google Scholar
Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: Proceedings of ACM SIGMOD international conference on management of data, New York, pp 73–84. https://doi.org/10.1145/276304.276312
Han J, Kamber M (2001) Data mining: concepts and techniques. Morgan Kaufmann Publishers, San Francisco
Google Scholar
Huang Z (1997) Clustering large data sets with mixed numeric and categorical values. In: Lu H, Motoda H, Liu H (eds) KDD: techniques and applications. World Scientific, Singapore, pp 21–34
Google Scholar
Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2:283–304. https://doi.org/10.1023/A:1009769707641
Article Google Scholar
Huang Z, Ng MK, Rong H, Li Z (2005) Automated variable weighting in \(k\)-means type clustering. IEEE Trans Pattern Anal Mach Intell 27(5):657–668. https://doi.org/10.1109/TPAMI.2005.95
Article PubMed Google Scholar
Hubert L, Arabie P (1995) Comparing partitions. J Classif 2(1):193–218. https://doi.org/10.1007/BF01908075
Article Google Scholar
Ienco D, Pensa RG, Meo R (2012) From context to distance: learning dissimilarity for categorical data clustering. ACM Trans Knowl Discov Data 6(1):1–25. https://doi.org/10.1145/2133360.2133361
Article Google Scholar
Ienco D, Pensa RG, Meo R (2009) Context-based distance learning for categorical data clustering. In: Advances in intelligent data analysis viii: 8th international symposium. Springer, pp 83–94. https://doi.org/10.1007/978-3-642-03915-7_8
Kogan J, Teboulle M, Nicholas C (2005) Data driven similarity measures for \(k\)-means like clustering algorithms. Inf Retr 8(2):331–349. https://doi.org/10.1007/s10791-005-5666-8
Article Google Scholar
Kushwaha N, Pant M (2018) Fuzzy magnetic optimization clustering algorithm with its application to health care. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-018-0941-x
Article Google Scholar
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444. https://doi.org/10.1038/nature14539
Article CAS PubMed ADS Google Scholar
Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th international conference on machine learning, pp 296–304. http://dl.acm.org/citation.cfm?id=645527.657297
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth symposium on mathematical statistics and probability, Berkeley, CA, 1967, vol 1, no. AD 669871, pp 281–297
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Book Google Scholar
Ng MK, Li MJ, Huang JZ, He Z (2007) On the impact of dissimilarity measure in \(k\)-modes clustering algorithm. IEEE Trans Pattern Anal Mach Intell 29:503–507. https://doi.org/10.1109/TPAMI.2007.53
Article PubMed Google Scholar
Nguyen TTH, Huynh VN (2016) A \(k\)-means like algorithm for clustering categorical data using an information theoretic-based dissimilarity measure. In: Foundations of information and knowledge systems—9th international symposium, FoIKS-2016. Springer, pp 115–130. https://doi.org/10.1007/978-3-319-30024-5_7
San OM, Huynh VN, Nakamori Y (2004) An alternative extension of the \(k\)-means algorithm for clustering categorical data. Int J Appl Math Comput Sci 14(2):241–247. http://matwbn.icm.edu.pl/ksiazki/amc/amc14/amc14212.pdf
Selim SZ, Ismail MA (1984) k-Means-type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Trans Pattern Anal Mach Intell 6:81–87. https://doi.org/10.1109/TPAMI.1984.4767478
Article CAS PubMed Google Scholar
Shirkhorshidi A, Aghabozorgi S, Wah T, Herawan T, (2014) Big data clustering: a review. In: Computational science and its applications—ICCSA (2014) 14th international conference, Guimaraes, Portugal, Proceedings, part V, pp 707–720: https://doi.org/10.1007/978-3-319-09156-3_49
Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617. https://doi.org/10.1162/153244303321897735
Article MathSciNet Google Scholar
Sumangali K, Aswani Kumar Ch (2019) Concept lattice simplification in formal concept analysis using attribute clustering. J Ambient Intell Humaniz Comput 10:2327–2343. https://doi.org/10.1007/s12652-018-0831-2
Article Google Scholar
Tellaroli P, Bazzi M, Donato M, Brazzale AR, Draghici S (2016) Cross-clustering: a partial clustering algorithm with automatic estimation of the number of clusters. PLoS One 11(3):e0152333. https://doi.org/10.1371/journal.pone.0152333
Article CAS PubMed PubMed Central Google Scholar
Titterington DM (1980) A comparative study of kernel-based density estimates for categorical data. Technometrics 22(2):259–268. https://doi.org/10.1080/00401706.1980.10486142
Article Google Scholar
Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2:165–193. https://doi.org/10.1007/s40745-015-0040-1
Article Google Scholar
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678. https://doi.org/10.1109/TNN.2005.845141
Article PubMed Google Scholar

Download references

Acknowledgements

This paper is based upon work supported in part by the Asian Office of Aerospace R&D (AOARD), Air Force Office of Scientific Research (Grant no. FA2386-17-1-4046). We would also like to thank the Associate Editor and the anonymous reviewers for their careful reading of our manuscript and thoughtful comments, which have helped us to improve our manuscript significantly.

Author information

Authors and Affiliations

Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, Ishikawa, 923-1292, Japan
Thu-Hien Thi Nguyen, Duy-Tai Dinh & Van-Nam Huynh
Faculty of Economics, Centre of Excellence in Econometrics, Chiang Mai University, Chiang Mai, Thailand
Songsak Sriboonchitta

Authors

Thu-Hien Thi Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Duy-Tai Dinh
View author publications
You can also search for this author in PubMed Google Scholar
Songsak Sriboonchitta
View author publications
You can also search for this author in PubMed Google Scholar
Van-Nam Huynh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Van-Nam Huynh.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nguyen, TH.T., Dinh, DT., Sriboonchitta, S. et al. A method for k-means-like clustering of categorical data. J Ambient Intell Human Comput 14, 15011–15021 (2023). https://doi.org/10.1007/s12652-019-01445-5

Download citation

Received: 15 February 2019
Accepted: 29 August 2019
Published: 06 September 2019
Issue Date: November 2023
DOI: https://doi.org/10.1007/s12652-019-01445-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A method for k-means-like clustering of categorical data

Abstract

Access this article

Similar content being viewed by others

A k-Means-Like Algorithm for Clustering Categorical Data Using an Information Theoretic-Based Dissimilarity Measure

The Performance of Objective Functions for Clustering Categorical Data

Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A method for k-means-like clustering of categorical data

Abstract

Access this article

Similar content being viewed by others

A k-Means-Like Algorithm for Clustering Categorical Data Using an Information Theoretic-Based Dissimilarity Measure

The Performance of Objective Functions for Clustering Categorical Data

Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation