A Unified Metric for Categorical and Numerical Attributes in Data Clustering

Cheung, Yiu-ming; Jia, Hong

doi:10.1007/978-3-642-37456-2_12

A Unified Metric for Categorical and Numerical Attributes in Data Clustering

Yiu-ming Cheung^23,24 &
Hong Jia²³

Conference paper

9715 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7819))

Abstract

Most of the existing clustering approaches are applicable to purely numerical or categorical data only, but not both. In general, it is a nontrivial task to perform clustering on mixed data composed of numerical and categorical attributes because there exists an awkward gap between the similarity metrics for categorical and numerical data. This paper therefore presents a general clustering framework based on the concept of object-cluster similarity and gives a unified similarity metric which can be simply applied to the data with categorical, numerical, or mixed attributes. Accordingly, an iterative clustering algorithm is developed, whose efficacy is experimentally demonstrated on different benchmark data sets.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Michalski, R.S., Bratko, I., Kubat, M.: Machine learning and data mining: methods and applications. Wiley, New York (1998)
Google Scholar
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B (Methodological) 39(1), 1–38 (1977)
MathSciNet MATH Google Scholar
Hsu, C.C.: Generalizing self-organizing map for categorical data. IEEE Transactions on Neural Networks 17(2), 294–304 (2006)
Article Google Scholar
Cesario, E., Manco, G., Ortale, R.: Top-down parameter-free clustering of high-dimensional categorical data. IEEE Transactions on Knowledge and Data Engineering 19(12), 1607–1624 (2007)
Article Google Scholar
Goodall, D.W.: A new similarity index based on probability. Biometric 22(4), 882–907 (1966)
Article Google Scholar
Li, C., Biswas, G.: Unsupervised learning with mixed numeric and nominal data. IEEE Transactions on Knowledge and Data Engineering 14(4), 673–690 (2002)
Article Google Scholar
Zaki, M.J., Peters, M.: Click: Mining subspace clusters in categorical data via k-partite maximal cliques. In: Proceedings of the 21st International Conference on Data Engineering, pp. 355–356 (2005)
Google Scholar
Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for categorical attributes. Information Systems 25(5), 345–366 (2001)
Article Google Scholar
Barbara, D., Couto, J., Li, Y.: Coolcat: An entropy-based algorithm for categorical clustering. In: Proceedings of the 11th ACM Conference on Information and Knowledge Management, pp. 582–589 (2002)
Google Scholar
Huang, Z.: Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the First Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 21–24 (1997)
Google Scholar
Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pp. 1–8 (1997)
Google Scholar
Huang, Z., Ng, M.K.: A note on k-modes clustering. Journal of Classification 20(2), 257–261 (2003)
Article MathSciNet MATH Google Scholar
Khan, S.S., Kant, S.: Computation of initial modes for k-modes clustering algorithm using evidence accumulation. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI 2007), pp. 2784–2789 (2007)
Google Scholar
He, Z., Deng, S., Xu, X.: Improving k-modes algorithm considering frequencies of attribute values in mode. In: Hao, Y., Liu, J., Wang, Y.-P., Cheung, Y.-m., Yin, H., Jiao, L., Ma, J., Jiao, Y.-C. (eds.) CIS 2005. LNCS (LNAI), vol. 3801, pp. 157–162. Springer, Heidelberg (2005)
Chapter Google Scholar
Ng, M.K., Li, M.J., Huang, J.Z., He, Z.: On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(3), 503–507 (2007)
Article Google Scholar
Jain, A.K.: Data clustering: 50 years beyound k-means. Pattern Recognition Letters 31(8), 651–666 (2010)
Article Google Scholar
Basak, J., Krishnapuram, R.: Interpretable hierarchical clustering by constructing an unsupervised decision tree. IEEE Transactions on Knowledge and Data Engineering 17(1), 121–132 (2005)
Article Google Scholar
Shepard, R.N.: Toward a universal law of generalization for physical science. Science 237, 1317–1323 (1987)
Article MathSciNet MATH Google Scholar
Santini, S., Jain, R.: Similarity measures. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(9), 871–883 (1999)
Article Google Scholar
He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. In: Advances in Neural Information Processing Systems (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Institute of Computational and Theoretical Studies, Hong Kong Baptist University, Hong Kong SAR, China
Yiu-ming Cheung & Hong Jia
United International College, Beijing Normal University - Hong Kong Baptist University, Zhuhai, China
Yiu-ming Cheung

Authors

Yiu-ming Cheung
View author publications
You can also search for this author in PubMed Google Scholar
Hong Jia
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computing Science, Simon Fraser University, 8888 University Drive, V5A 1S6, Burnaby, BC, Canada
Jian Pei
Dept. of Computer Science and Information Engineering, Institute of Medical Informatics, National Cheng Kung University, Tainan, Taiwan
Vincent S. Tseng
Faculty of Engineering and Information Technology, University of Technology Sydney, Broadway, P.O. Box 123, 2007, Sydney, NSW, Australia
Longbing Cao & Guandong Xu &
Asian Office of Aerospace Research and Development (AOARD), Air Force Office of Scientific Research (AFOSR), Air Force Research Laboratory USA, Osaka University, 7-23-17 Roppongi, 106-0032, Minato-ku, Tokyo, Japan
Hiroshi Motoda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cheung, Ym., Jia, H. (2013). A Unified Metric for Categorical and Numerical Attributes in Data Clustering. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2013. Lecture Notes in Computer Science(), vol 7819. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37456-2_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-37456-2_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37455-5
Online ISBN: 978-3-642-37456-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics