Abstract
Clustering categorical data is an important and challenging data analysis task. In this paper, we explore the use of kernel K-means to cluster categorical data. We propose a new kernel function based on Hamming distance to embed categorical data in a constructed feature space where the clustering is conducted. We experimentally evaluated the quality of the solutions produced by kernel K-means on real datasets. Results indicated the feasibility of kernel K-means using our proposed kernel function to discover clusters embedded in categorical data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Andritsos, P., Tsaparas, P., Miller, R.J., Sevcik., K.C.: LIMBO: Scalable Clustering of Categorical Data. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992. Springer, Heidelberg (2004)
Barbara, D., Couto, J., Li, Y.: Coolcat: An Entropy-based algorithm for Categorical Clustering. In: Proceedings of the 11th ACM Conference on Information and Knowledge Management (CIKM 2002), McLean, Virginia, USA, November 2002, pp. 582–589. ACM Press, New York (2002)
Ben-hur, A., Horn, D., Siegelmann, H.T., Vapnik, V.: Support Vector Clustering. Journal of Machine Learning Research 2, 125–137
Blake, C.L., Merz, C.J.: UCI Repository of Machine Learning Databases. University of California, Department of Information and Computer Science, Irvine, CA, http://www.ics.uci.edu/~mlearn/MLRepository.html.
Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS: Clustering Categorical Data using Summaries. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), San Diego, CA, USA, August 1999, pp. 73–83. ACM Press, New York (1999)
Girolami, M.: Mercer Kernel Based Clustering in Feature Space. IEEE Transactions on Neural Networks 13(4), 780–784 (2002)
Gibson, D., Kleinberg, J., Raghavan, P.: Clustering Categorical Data: An Approach Based on Dynamical Systems. In: Proceedings of the 24th International Conference on Very Large Data Bases (VLDB), New York, USA, August 1998, pp. 311–322. Morgan Kaufmann, San Francisco (1998)
Gluck, A., Corter, J.: Information, Uncertainty, and the Utility of Categories. In: Proceedings of the 7th Annual Conference of the Cognitive Science Society, Irvine, California, pp. 283–287. Laurence Erlbaum Associates, Mahwah (1985)
Guha, S., Rastogi, R., Shim, K.: ROCK: A Robust Clustering Algorithm for Categorical Attributes. Journal of Information Systems 25(5), 345–366 (2000)
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On Clustering Validation Techniques. Journal of Intelligent Information Systems 17(2–3), 107–145 (2001)
Huang, Z.: Extensions to the K-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery 2(3), 283–304 (1998)
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)
Katsavounidis, I., Kuo, C., Zhang, Z.: A New Initialization Technique for Generalized Lloyd Iteration. IEEE Signal Processing Letters 1(10), 144–146 (1994)
Kondor, R.I., Lafferty, J.: Diffusion Kernels on Graphs and Other Discrete Structures. In: Sammut, C., Hoffmann, A.G. (eds.) Proceedings of the 19th International Conference on Machine Learning (ICML 2002), pp. 315–322. Morgan Kaufmann, San Francisco (2002)
Larsen, B., Aone, C.: Fast and Effective Text Mining Using Linear-time Document Clustering. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, August 1999, pp. 16–22. ACM Press, New York (1999)
Lodhi, H., Shawe-Taylor, J., Cristiani, N., Watkins, C.: Text Classification using String Kernels. Journal of Machine Learning Research 2, 419–444
Shawe-Taylor, J., Cristiani, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2003)
Slonim, N., Tibshy, N.: Agglomerative Information Bottleneck. In: Proceedings of the Neural Information Processing Systems Conference 1999 (NIPS 1999), Beckenridge (1999)
Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Technique. Technical Report #00–034, University of Minnesota, Department of Computer Science and Egineering
Zaki, M.J., Peters, M.: CLICK: Mining Subspaces Clusters in Categorical Data via K-partite Maximal Cliques. TR 04-11, CS Dept., RPI (2004)
Zhang, R., Rudnicky, A.: A Large Scale Clustering Scheme for Kernel K-means. In: Proceedings of the 16th International Conference on Pattern Recognition (ICPR 2002), Quebec City, Canada, August 2002, pp. 289–292 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Couto, J. (2005). Kernel K-Means for Categorical Data. In: Famili, A.F., Kok, J.N., Peña, J.M., Siebes, A., Feelders, A. (eds) Advances in Intelligent Data Analysis VI. IDA 2005. Lecture Notes in Computer Science, vol 3646. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11552253_5
Download citation
DOI: https://doi.org/10.1007/11552253_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28795-7
Online ISBN: 978-3-540-31926-9
eBook Packages: Computer ScienceComputer Science (R0)