Abstract
Despite recent efforts, the challenge in clustering categorical and mixed data in the context of big data still remains due to the lack of inherently meaningful measure of similarity between categorical objects and the high computational complexity of existing clustering techniques. While k-means method is well known for its efficiency in clustering large data sets, working only on numerical data prohibits it from being applied for clustering categorical data. In this paper, we aim to develop a novel extension of k-means method for clustering categorical data, making use of an information theoretic-based dissimilarity measure and a kernel-based method for representation of cluster means for categorical objects. Such an approach allows us to formulate the problem of clustering categorical data in the fashion similar to k-means clustering, while a kernel-based definition of centers also provides an interpretation of cluster means being consistent with the statistical interpretation of the cluster means for numerical data. In order to demonstrate the performance of the new clustering method, a series of experiments on real datasets from UCI Machine Learning Repository are conducted and the obtained results are compared with several previously developed algorithms for clustering categorical data.
Similar content being viewed by others
Notes
This paper is a significantly revised and extended version of Nguyen and Huynh (2016).
References
Aitchison J, Aitken CGG (1976) Multivariate binary discrimination by the kernel method. Biometrika 63(3):413–420. https://doi.org/10.1093/biomet/63.3.413
Berkhin P (2002) Survey of clustering data mining techniques. Technical report
Blake CL, Merz CJ (1998) UCI Repository of machine learning databases. University of California, Department of Information and Computer Science, Irvine, CA. http://www.ics.uci.edu/~mlearn/MLRepository.html
Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the SIAM international conference on data mining, SDM—2008, pp 243–254. https://doi.org/10.1137/1.9781611972788.22
Chen L, Wang S (2013) Central clustering of categorical data with automated feature weighting. In: Proceedings of the twenty-third international joint conference on artificial intelligence, pp 1260–1266. https://www.ijcai.org/Proceedings/13/Papers/190.pdf
Fahad et al (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279. https://doi.org/10.1109/TETC.2014.2330519
Ganti V, Gehrke J, Ramakrishnan R (1999) CATUS—clustering categorical data using summaries. In: Proceedings of the international conference on knowledge discovery and data mining, (San Diego, USA), pp 73–83. https://doi.org/10.1145/312129.312201
Gibson D, Kleinberg J, Raghavan P (2000) Clustering categorical data: an approach based on dynamic systems. VLDB J 8:222–236. https://doi.org/10.1007/s007780050005
Guha S, Rastogi R, Shim K (2000) ROCK: a robust clustering algorithm for categorical attributes. Inf Syst 25(5):345–366. https://doi.org/10.1016/S0306-4379(00)00022-3
Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: Proceedings of ACM SIGMOD international conference on management of data, New York, pp 73–84. https://doi.org/10.1145/276304.276312
Han J, Kamber M (2001) Data mining: concepts and techniques. Morgan Kaufmann Publishers, San Francisco
Huang Z (1997) Clustering large data sets with mixed numeric and categorical values. In: Lu H, Motoda H, Liu H (eds) KDD: techniques and applications. World Scientific, Singapore, pp 21–34
Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2:283–304. https://doi.org/10.1023/A:1009769707641
Huang Z, Ng MK, Rong H, Li Z (2005) Automated variable weighting in \(k\)-means type clustering. IEEE Trans Pattern Anal Mach Intell 27(5):657–668. https://doi.org/10.1109/TPAMI.2005.95
Hubert L, Arabie P (1995) Comparing partitions. J Classif 2(1):193–218. https://doi.org/10.1007/BF01908075
Ienco D, Pensa RG, Meo R (2012) From context to distance: learning dissimilarity for categorical data clustering. ACM Trans Knowl Discov Data 6(1):1–25. https://doi.org/10.1145/2133360.2133361
Ienco D, Pensa RG, Meo R (2009) Context-based distance learning for categorical data clustering. In: Advances in intelligent data analysis viii: 8th international symposium. Springer, pp 83–94. https://doi.org/10.1007/978-3-642-03915-7_8
Kogan J, Teboulle M, Nicholas C (2005) Data driven similarity measures for \(k\)-means like clustering algorithms. Inf Retr 8(2):331–349. https://doi.org/10.1007/s10791-005-5666-8
Kushwaha N, Pant M (2018) Fuzzy magnetic optimization clustering algorithm with its application to health care. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-018-0941-x
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444. https://doi.org/10.1038/nature14539
Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th international conference on machine learning, pp 296–304. http://dl.acm.org/citation.cfm?id=645527.657297
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth symposium on mathematical statistics and probability, Berkeley, CA, 1967, vol 1, no. AD 669871, pp 281–297
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Ng MK, Li MJ, Huang JZ, He Z (2007) On the impact of dissimilarity measure in \(k\)-modes clustering algorithm. IEEE Trans Pattern Anal Mach Intell 29:503–507. https://doi.org/10.1109/TPAMI.2007.53
Nguyen TTH, Huynh VN (2016) A \(k\)-means like algorithm for clustering categorical data using an information theoretic-based dissimilarity measure. In: Foundations of information and knowledge systems—9th international symposium, FoIKS-2016. Springer, pp 115–130. https://doi.org/10.1007/978-3-319-30024-5_7
San OM, Huynh VN, Nakamori Y (2004) An alternative extension of the \(k\)-means algorithm for clustering categorical data. Int J Appl Math Comput Sci 14(2):241–247. http://matwbn.icm.edu.pl/ksiazki/amc/amc14/amc14212.pdf
Selim SZ, Ismail MA (1984) k-Means-type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Trans Pattern Anal Mach Intell 6:81–87. https://doi.org/10.1109/TPAMI.1984.4767478
Shirkhorshidi A, Aghabozorgi S, Wah T, Herawan T, (2014) Big data clustering: a review. In: Computational science and its applications—ICCSA (2014) 14th international conference, Guimaraes, Portugal, Proceedings, part V, pp 707–720: https://doi.org/10.1007/978-3-319-09156-3_49
Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617. https://doi.org/10.1162/153244303321897735
Sumangali K, Aswani Kumar Ch (2019) Concept lattice simplification in formal concept analysis using attribute clustering. J Ambient Intell Humaniz Comput 10:2327–2343. https://doi.org/10.1007/s12652-018-0831-2
Tellaroli P, Bazzi M, Donato M, Brazzale AR, Draghici S (2016) Cross-clustering: a partial clustering algorithm with automatic estimation of the number of clusters. PLoS One 11(3):e0152333. https://doi.org/10.1371/journal.pone.0152333
Titterington DM (1980) A comparative study of kernel-based density estimates for categorical data. Technometrics 22(2):259–268. https://doi.org/10.1080/00401706.1980.10486142
Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2:165–193. https://doi.org/10.1007/s40745-015-0040-1
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678. https://doi.org/10.1109/TNN.2005.845141
Acknowledgements
This paper is based upon work supported in part by the Asian Office of Aerospace R&D (AOARD), Air Force Office of Scientific Research (Grant no. FA2386-17-1-4046). We would also like to thank the Associate Editor and the anonymous reviewers for their careful reading of our manuscript and thoughtful comments, which have helped us to improve our manuscript significantly.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Nguyen, TH.T., Dinh, DT., Sriboonchitta, S. et al. A method for k-means-like clustering of categorical data. J Ambient Intell Human Comput 14, 15011–15021 (2023). https://doi.org/10.1007/s12652-019-01445-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-019-01445-5