Semi-supervised Parameter-Free Divisive Hierarchical Clustering of Categorical Data

Xiong, Tengke; Wang, Shengrui; Mayers, André; Monga, Ernest

doi:10.1007/978-3-642-20841-6_22

Tengke Xiong²²,
Shengrui Wang²²,
André Mayers²² &
…
Ernest Monga²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6634))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1661 Accesses
2 Citations

Abstract

Semi-supervised clustering can yield considerable improvement over unsupervised clustering. Most existing semi-supervised clustering algorithms are non-hierarchical, derived from the k-means algorithm and designed for analyzing numeric data. Clustering categorical data is a challenging issue due to the lack of inherently meaningful similarity measure, and semi-supervised clustering in the categorical domain remains untouched. In this paper, we propose a novel semi-supervised divisive hierarchical algorithm for categorical data. Our algorithm is parameter-free, fully automatic and effective in taking advantage of instance-level constraint background knowledge to improve the quality of the resultant dendrogram. Experiments on real-life data demonstrate the promising performance of our algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aloise, D., Deshpande, A., Hansen, P., Popat, P.: NP-hardness of Euclidean sum-of-squares clustering. Mach. Learn. 75, 245–248 (2009)
Article Google Scholar
Basu, S., Bilenko, M., Mooney, R.J.: A Probabilistic Framework for Semi-Supervised Clustering. In: ACM KDD (2004)
Google Scholar
Cesario, E., Manco, G., Ortale, R.: Top-down parameter-free clustering of high-dimensional categorical data. IEEE T. Knowl. Data En. 19 (2007)
Google Scholar
Davidson, I., Ravi, S.S.: Clustering with constraints: Feasibility issues and the k-means algorithm. In: SIAM SDM (2005)
Google Scholar
Davidson, I., Ravi, S.S.: Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results. Data Min. Knowl. Disc. 18 (2009)
Google Scholar
Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Mach. Learn. 42, 143–175 (2001)
Article MATH Google Scholar
Gan, G., Wu, J.: Subspace clustering for high dimensional categorical data. SIGKDD Explorations 6 (2004)
Google Scholar
Greenacre, M., Blasius, J.: Multiple Correspondence Analysis and Related Methods. Chapman & Hall, Boca Raton (2006)
Book MATH Google Scholar
Guha. S., Rastogi, R., Shim, K.: ROCK: a robust clustering algorithm for categorical attributes. In: IEEE ICDE (1999)
Google Scholar
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical value. Data. Min. Knowl. Disc. 2, 283–304 (1998)
Article Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Comput Surv. 31 (1999)
Google Scholar
Keogh, E., Lonardi, S., Ratanamahatana, C.: Toward parameter-free data mining. In: ACM KDD (2004)
Google Scholar
Klein, D., Kamvar, S.D., Manning, C.D.: From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: ICML, pp. 307–314 (2002)
Google Scholar
Kulis, B., Basu, S., Dhillon, I., Mooney, R.: Semi-Supervised graph clustering: a Kernel Approach. Mach. Learn. 74, 1–22 (2009)
Article Google Scholar
Lelis, L., Sander, J.: Semi-Supervised Density-Based Clustering. In: ICDM (2009)
Google Scholar
San, O.M., Huynh, V., Nakamori, Y.: An alternative extension of the k-means algorithm for clustering categorical data. Int. J. Appl. Math. Comp. 14 (2004)
Google Scholar
Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Reading (2005)
Google Scholar
Tang, W., Xiong, H., Zhong, S., Wu, J.: Enhancing Semi-Supervised Clustering: A Feature Projection Perspective. In: ACM KDD (2007)
Google Scholar
Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained k-means clustering with background knowledge. In: ICML, pp. 577–584 (2001)
Google Scholar
Xiong, T., Wang, S., Mayers, A., Monga, E.: A New MCA-based Divisive Hierarchical Algorithm for Clustering Categorical Data. In: IEEE ICDM (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Sherbrooke, Sherbrooke, QC, J1K 2R1, Canada
Tengke Xiong, Shengrui Wang & André Mayers
Department of Mathematics, University of Sherbrooke, Sherbrooke, QC, J1K 2R1, Canada
Ernest Monga

Authors

Tengke Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Shengrui Wang
View author publications
You can also search for this author in PubMed Google Scholar
André Mayers
View author publications
You can also search for this author in PubMed Google Scholar
Ernest Monga
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Shenzhen Institutes of Advanced Technology (SIAT), Chinese Academy of Sciences, 518055, Shenzhen, China
Joshua Zhexue Huang
Faculty of Engineering and Information Technology, Center for Quantum Computation and Intelligent Systems, Data Sciences and Knowledge Discovery Lab, University of Technology Sydney, NSW 2007, Sydney, Australia
Longbing Cao
Department of Computer Science and Engineering, University of Minnesota, MN 55455, Minneapolis, USA
Jaideep Srivastava

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xiong, T., Wang, S., Mayers, A., Monga, E. (2011). Semi-supervised Parameter-Free Divisive Hierarchical Clustering of Categorical Data. In: Huang, J.Z., Cao, L., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2011. Lecture Notes in Computer Science(), vol 6634. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20841-6_22

Download citation

DOI: https://doi.org/10.1007/978-3-642-20841-6_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20840-9
Online ISBN: 978-3-642-20841-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics