Abstract
Semi-supervised clustering can yield considerable improvement over unsupervised clustering. Most existing semi-supervised clustering algorithms are non-hierarchical, derived from the k-means algorithm and designed for analyzing numeric data. Clustering categorical data is a challenging issue due to the lack of inherently meaningful similarity measure, and semi-supervised clustering in the categorical domain remains untouched. In this paper, we propose a novel semi-supervised divisive hierarchical algorithm for categorical data. Our algorithm is parameter-free, fully automatic and effective in taking advantage of instance-level constraint background knowledge to improve the quality of the resultant dendrogram. Experiments on real-life data demonstrate the promising performance of our algorithm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aloise, D., Deshpande, A., Hansen, P., Popat, P.: NP-hardness of Euclidean sum-of-squares clustering. Mach. Learn. 75, 245–248 (2009)
Basu, S., Bilenko, M., Mooney, R.J.: A Probabilistic Framework for Semi-Supervised Clustering. In: ACM KDD (2004)
Cesario, E., Manco, G., Ortale, R.: Top-down parameter-free clustering of high-dimensional categorical data. IEEE T. Knowl. Data En. 19 (2007)
Davidson, I., Ravi, S.S.: Clustering with constraints: Feasibility issues and the k-means algorithm. In: SIAM SDM (2005)
Davidson, I., Ravi, S.S.: Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results. Data Min. Knowl. Disc. 18 (2009)
Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Mach. Learn. 42, 143–175 (2001)
Gan, G., Wu, J.: Subspace clustering for high dimensional categorical data. SIGKDD Explorations 6 (2004)
Greenacre, M., Blasius, J.: Multiple Correspondence Analysis and Related Methods. Chapman & Hall, Boca Raton (2006)
Guha. S., Rastogi, R., Shim, K.: ROCK: a robust clustering algorithm for categorical attributes. In: IEEE ICDE (1999)
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical value. Data. Min. Knowl. Disc. 2, 283–304 (1998)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Comput Surv. 31 (1999)
Keogh, E., Lonardi, S., Ratanamahatana, C.: Toward parameter-free data mining. In: ACM KDD (2004)
Klein, D., Kamvar, S.D., Manning, C.D.: From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: ICML, pp. 307–314 (2002)
Kulis, B., Basu, S., Dhillon, I., Mooney, R.: Semi-Supervised graph clustering: a Kernel Approach. Mach. Learn. 74, 1–22 (2009)
Lelis, L., Sander, J.: Semi-Supervised Density-Based Clustering. In: ICDM (2009)
San, O.M., Huynh, V., Nakamori, Y.: An alternative extension of the k-means algorithm for clustering categorical data. Int. J. Appl. Math. Comp. 14 (2004)
Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Reading (2005)
Tang, W., Xiong, H., Zhong, S., Wu, J.: Enhancing Semi-Supervised Clustering: A Feature Projection Perspective. In: ACM KDD (2007)
Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained k-means clustering with background knowledge. In: ICML, pp. 577–584 (2001)
Xiong, T., Wang, S., Mayers, A., Monga, E.: A New MCA-based Divisive Hierarchical Algorithm for Clustering Categorical Data. In: IEEE ICDM (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Xiong, T., Wang, S., Mayers, A., Monga, E. (2011). Semi-supervised Parameter-Free Divisive Hierarchical Clustering of Categorical Data. In: Huang, J.Z., Cao, L., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2011. Lecture Notes in Computer Science(), vol 6634. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20841-6_22
Download citation
DOI: https://doi.org/10.1007/978-3-642-20841-6_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20840-9
Online ISBN: 978-3-642-20841-6
eBook Packages: Computer ScienceComputer Science (R0)