Clustering with Domain Value Dissimilarity for Categorical Data

Lee, Jeonghoon; Lee, Yoon-Joon; Park, Minho

doi:10.1007/978-3-642-03067-3_25

Jeonghoon Lee²⁰,
Yoon-Joon Lee²⁰ &
Minho Park²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5633))

Included in the following conference series:

Industrial Conference on Data Mining

1703 Accesses
3 Citations

Abstract

Clustering is a representative grouping process to find out hidden information and understand the characteristics of dataset to get a view of the further analysis. The concept of similarity and dissimilarity of objects is a fundamental decisive factor for clustering and the measure of them dominates the quality of results. When attributes of data are categorical, it is not simple to quantify the dissimilarity of data objects that have unimportant attributes or synonymous values. We suggest a new idea to quantify dissimilarity of objects by using distribution information of data correlated to each categorical value. Our method discovers intrinsic relationship of values and measures dissimilarity of objects effectively. Our approach does not couple with a clustering algorithm tightly and so can be applied various algorithms flexibly. Experiments on both synthetic and real datasets show propriety and effectiveness of this method. When our method is applied only to traditional clustering algorithms, the results are considerably improved than those of previous methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Myatt, G.J.: Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining. John Wiley & Sons, Inc., Chichester (2007)
Book MATH Google Scholar
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)
MATH Google Scholar
Ganti, V., Gehrke, J., Ramakrishnan, R.: Cactus-clustering categorical data using summaries. In: Proc. of ACM SIGKDD, pp. 73–83 (1999)
Google Scholar
Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for categorical attributes. In: Information Systems, pp. 512–521 (1999)
Google Scholar
Zhang, Y., Fu, A.W.C., Cai, C.H., Heng, P.A.: Clustering categorical data. In: ICDE, p. 305 (2000)
Google Scholar
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Reading (2005)
Google Scholar
Cost, S., Salzberg, S.: A weighted nearest neighbor algorithm for learning with symbolic features. Machine Learning 10, 57–78 (1993)
Google Scholar
Ahmad, A., Dey, L.: A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set. Pattern Recognition Letters 28(1), 110–118 (2007)
Article Google Scholar
Barbará, D., Li, Y., Couto, J.: COOLCAT: an entropy-based algorithm for categorical clustering. In: Kalpakis, K., Goharian, N., Grossmann, D. (eds.) Proceedings of the Eleventh International Conference on Information and Knowledge Management (CIKM 2002), November 4–9, pp. 582–589. ACM Press, New York (2002)
Google Scholar
Andritsos, P., Tsaparas, P., Miller, R.J., Sevcik, K.C.: LIMBO: Scalable clustering of categorical data. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 123–146. Springer, Heidelberg (2004)
Chapter Google Scholar
Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: A comparative evaluation. In: SDM, pp. 243–254. SIAM, Philadelphia (2008)
Google Scholar
Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Research Issues on Data Mining and Knowledge Discovery, pp. 1–8 (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

School of EECS, Division of Computer Science, Korea Advanced Institute of Science and Technology, Daejeon, 350-701, Republic of Korea
Jeonghoon Lee & Yoon-Joon Lee
Information Technology Department, The Bank of Korea, Seoul, 135-080, Republic of Korea
Minho Park

Authors

Jeonghoon Lee
View author publications
You can also search for this author in PubMed Google Scholar
Yoon-Joon Lee
View author publications
You can also search for this author in PubMed Google Scholar
Minho Park
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institut für Bildverarbeitung und angewandte Informatik, Körnerstr. 10, 04107, Leipzig, Deutschland
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lee, J., Lee, YJ., Park, M. (2009). Clustering with Domain Value Dissimilarity for Categorical Data. In: Perner, P. (eds) Advances in Data Mining. Applications and Theoretical Aspects. ICDM 2009. Lecture Notes in Computer Science(), vol 5633. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03067-3_25

Download citation

DOI: https://doi.org/10.1007/978-3-642-03067-3_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03066-6
Online ISBN: 978-3-642-03067-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics