k-CCM: A Center-Based Algorithm for Clustering Categorical Data with Missing Values

Dinh, Duy-Tai; Huynh, Van-Nam

doi:10.1007/978-3-030-00202-2_22

k-CCM: A Center-Based Algorithm for Clustering Categorical Data with Missing Values

Conference paper
First Online: 16 September 2018

1022 Accesses
5 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11144))

Abstract

This paper focuses on solving the problem of clustering for categorical data with missing values. Specifically, we design a new framework that can impute missing values and assign objects into appropriate clusters. For the imputation step, we use a decision tree-based method to fill in missing values. For the clustering step, we use a kernel density estimation approach to define cluster centers and an information theoretic-based dissimilarity measure to quantify the differences between objects. Then, we propose a center-based algorithm for clustering categorical data with missing values, namely k-CCM. An experimental evaluation was performed on real-life datasets with missing values to compare the performance of the proposed algorithm with other popular clustering algorithms in terms of clustering quality. Generally, the experimental result shows that the proposed algorithm has a comparative performance when compared to other algorithms for all datasets.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

Aitchison, J., Aitken, C.G.: Multivariate binary discrimination by the kernel method. Biometrika 63(3), 413–420 (1976)
Article MathSciNet Google Scholar
Berkhin, P.: A survey of clustering data mining techniques. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grouping Multidimensional Data, pp. 25–71. Springer, Heidelberg (2006). https://doi.org/10.1007/3-540-28349-8_2
Chapter Google Scholar
Chen, L., Wang, S.: Central clustering of categorical data with automated feature weighting. In: IJCAI, pp. 1260–1266 (2013)
Google Scholar
Deb, R., Liew, A.W.C.: Missing value imputation for the analysis of incomplete traffic accident data. Inf. Sci.s 339, 274–289 (2016)
Article Google Scholar
Fujikawa, Y., Ho, T.B.: Cluster-based algorithms for dealing with missing values. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 549–554. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-47887-6_54
Chapter Google Scholar
Gan, G., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications. SIAM (2007)
Google Scholar
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining Knowl. Discov. 2(3), 283–304 (1998)
Article MathSciNet Google Scholar
Kim, D.W., Lee, K., Lee, D., Lee, K.H.: A k-populations algorithm for clustering categorical data. Pattern Recogn. 38(7), 1131–1134 (2005)
Article Google Scholar
Lin, D., et al.: An information-theoretic definition of similarity. In: ICML, vol. 98, pp. 296–304. Citeseer (1998)
Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Book Google Scholar
Nguyen, T.-H.T., Huynh, V.-N.: A k-means-like algorithm for clustering categorical data using an information theoretic-based dissimilarity measure. In: Gyssens, M., Simari, G. (eds.) FoIKS 2016. LNCS, vol. 9616, pp. 115–130. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30024-5_7
Chapter Google Scholar
Rahman, M.G., Islam, M.Z.: Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques. Knowl. Based Syst. 53, 51–65 (2013)
Article Google Scholar
San, O.M., Huynh, V.N., Nakamori, Y.: An alternative extension of the k-means algorithm for clustering categorical data. Int. J. Appl. Math. Comput. Sci. 14, 241–247 (2004)
MathSciNet MATH Google Scholar
Tan, P.N., Kumar, V.: Interestingness measures for association patterns: a perspective. In: Proceedings of Workshop on Postprocessing in Machine Learning and Data Mining (2000)
Google Scholar
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining, 1st edn. Addison-Wesley Longman Publishing Co., Inc., Boston (2005)
Google Scholar
Zaït, M., Messatfa, H.: A comparative study of clustering methods. Fut. Gener. Comput. Syst. 13(2–3), 149–159 (1997)
Article Google Scholar
Thanh-Phu, N., Duy-Tai, D., Van-Nam, H: A new context-based clustering framework for categorical data. Pacific Rim International Conference on Artificial Intelligence, pp. 697–709. Springer (2018)
Google Scholar

Download references

Acknowledgment

This paper is based upon work supported in part by the Air Force Office of Scientific Research/Asian Office of Aerospace Research and Development (AFOSR/AOARD) under award number FA2386-17-1-4046.

Author information

Authors and Affiliations

School of Knowledge Science, Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan
Duy-Tai Dinh & Van-Nam Huynh

Authors

Duy-Tai Dinh
View author publications
You can also search for this author in PubMed Google Scholar
Van-Nam Huynh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Duy-Tai Dinh .

Editor information

Editors and Affiliations

Maynooth University, Maynooth, Ireland
Vicenç Torra
Department Management Science, Tamagawa University, Tokyo, Japan
Yasuo Narukawa
University of the Balearic Islands, Palma de Mallorca, Spain
Isabel Aguiló
University of the Balearic Islands, Palma de Mallorca, Baleares, Spain
Manuel González-Hidalgo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dinh, DT., Huynh, VN. (2018). k-CCM: A Center-Based Algorithm for Clustering Categorical Data with Missing Values. In: Torra, V., Narukawa, Y., Aguiló, I., González-Hidalgo, M. (eds) Modeling Decisions for Artificial Intelligence. MDAI 2018. Lecture Notes in Computer Science(), vol 11144. Springer, Cham. https://doi.org/10.1007/978-3-030-00202-2_22

Download citation

DOI: https://doi.org/10.1007/978-3-030-00202-2_22
Published: 16 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00201-5
Online ISBN: 978-3-030-00202-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics