skip to main content
research-article

A clustering framework based on subjective and objective validity criteria

Published: 02 February 2008 Publication History

Abstract

Clustering, as an unsupervised learning process is a challenging problem, especially in cases of high-dimensional datasets. Clustering result quality can benefit from user constraints and objective validity assessment. In this article, we propose a semisupervised framework for learning the weighted Euclidean subspace, where the best clustering can be achieved. Our approach capitalizes on: (i) user constraints; and (ii) the quality of intermediate clustering results in terms of their structural properties. The proposed framework uses the clustering algorithm and the validity measure as its parameters. We develop and discuss algorithms for learning and tuning the weights of contributing dimensions and defining the “best” clustering obtained by satisfying user constraints. Experimental results on benchmark datasets demonstrate the superiority of the proposed approach in terms of improved clustering accuracy.

References

[1]
Aggarwal, C., Procopiuc, C., Wolf, J., Yu, P., and Park, J. 1999. Fast algorithms for projected clustering. In Proceedings of the ACM International Conference on Management of Data.
[2]
Aggarwal, C. and Yu, P. 2000. Finding generalized projected clusters in high dimensional spaces. In Proceedings of the ACM International Conference on Management of Data.
[3]
Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. 1998. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the ACM International Conference on Management of Data.
[4]
Anderson, B., Moore, A., and Cohn, D. 2000. A nonparametric approach to noisy and costly optimization. In Proceedings of the International Conference on Machine Learning.
[5]
Bar-Hillel, A., Hertz, T., Shental, N., and Weinshall, D. 2003. Learning distance function using equivalence relations. In Proceedings of the International Conference on Machine Learning (ICML).
[6]
Basu, S., Bilenko, M., and Mooney, R. 2004. A probabilistic framework for semi-supervised clustering. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[7]
Berry, M. and Linoff, G. 1996. Data Mining Techniques for Marketing: Sale and Customer Support. John Wiley and Sons.
[8]
Bilenko, M., Basu, S., and Mooney, R. J. 2004. Integrating constraints and metric learning in semi-supervised clustering. In Proceedings of the International Conference on Machine Learning.
[9]
Blansch, A., Ganarski, P., and Korczak, J. 2006. Maclaw: A modular approach for clustering with local attribute weighting. Pattern Recogn. Lett. 27, 11 (Aug.), 1299--1306.
[10]
Blum, A. and Mitchell, T. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the Annual Conference on Computational Learning Theory, 92--100.
[11]
Cohn, D., Caruana, R., and McCallum, A. 2003. Semi-Supervised clustering with user feedback. Tech. Rep. TR2003-1892, Cornell University, Ithaca, NY.
[12]
Ester, M., Kriegel, H.-P., Sender, J., and Xu, X. 1997. Sensity-Connected sets and their application for trend detection in spatial databases. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 10--15.
[13]
Fayyad, U. G., Piatesky-Shapiro, P. S., and Uthurusamy, R. 1996. Advances in Knowledge Discovery and Data Mining. AAI Press.
[14]
Fisher and Douglas. 1987. Knowledge acquisition via incremental conceptual clustering. Mach. Learn. 2, 139--172.
[15]
Frigui, H. and Nasraoui, O. 2004. Unsupervised learning of prototypes and attribute weights. Pattern Recogn. 37, 3, 943--952.
[16]
Gao, J., Tan, P.-N., and Cheng, H. 2005. Semi-Supervised fuzzy clustering with pairwise-constrained competitive agglomeration. In IEEE Conference on Fuzzy Systems.
[17]
Halkidi, M., Gunopulos, D., Kumar, N., Vazirgiannis, M., and Domeniconi., C. 2005. A framework for semi-supervised learning based on subjective and objective clustering criteria. In Proceedings of the IEEE Conference on Data Mining (ICDM).
[18]
Halkidi, M. and Vazirgiannis, M. 2001. Clustering validity assessment: Finding the optimal partitioning of a data set. In Proceedings of the IEEE Conference on Data Mining (ICDM).
[19]
Hinneburg, A. and Keim, D. 1998. An efficient approach toclustering in large multimedia databases with noise. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 58--65.
[20]
Hogg, R. and Craig, A. 1978. Introduction to Mathematical Statistics. Macmillan, New York.
[21]
Hubert, L. and Arabie, P. 1985. Comparing partitions. J. Classif.
[22]
Jain, A., Mutty, M., and Flyn, P. 1999. Data clustering: A review. ACM Comput. Surv. 31, 3.
[23]
Jing, L., Ng, M., and Huang, J. X. 2005. Subspace clustering of text documents with feature weighting k-means algorithm. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Advances in Knowledge Discovery and Data Mining, Lecture Notes in Computer Science, vol. 3518. Springer, Berlin.
[24]
Kulis, B., Basu, S., Dhillon, I., and Mooney, R. 2005. Semi-Supervised grpah clustering: A kernel approach. In Proceedings of the International Conference on Machine Learning (ICML).
[25]
MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the Symposium on Math, Statistics and Probability, University of California Press, Berkeley, CA, 281--297.
[26]
Nigam, K., McCallum, K., Thrun, S., and Mitchell, T. 2000. Text classification labeled and unlabeled documents using em. Mach. Learn. 39, 103--134.
[27]
Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. 1997. Numerical Recipes in C, the Art of Scientific Computing. Cambridge University Press.
[28]
Segal, E., Wang, H., and Koller, D. 2003. Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics 19, 264--272.
[29]
Stein, B., zu Eissen, S. M., and Wibrock, F. 2003. On cluster validity and the information need of users. In Proceedings of the Artificial Intelligenece and Applications Conference.
[30]
Wagstaff, K. and Cardie. 2000. Clustering with instance-level constraints. In Proceedings of the International Conference on Machine Learning (ICML).
[31]
Wagstaff, K., Cardie, C., Rogers, S., and Schroedl, S. 2001. Constrained k-means clustering with background knowledge. In Proceedings of the International Conference on Machine Learning (ICML). 577--584.
[32]
Xing, E. P., Ng, A. Y., Jordan, M. I., and Russell, S. 2002. Distance metric learning, with application to clustering with side-information. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS).
[33]
Yip, K., Cheung, D., and Ng, M. 2005. On discovery of extremely low-dimensional clusters using semi-supervised projected clustering. In Proceedings of the 21st International Conference on Data Engineering, 329--240.

Cited By

View all
  • (2023)Behavioral Economics in IRA Behavioral Economics Approach to Interactive Information Retrieval10.1007/978-3-031-23229-9_6(155-180)Online publication date: 18-Feb-2023
  • (2018)The Evolutionary Landscape of Localized Prostate Cancers Drives Clinical AggressionCell10.1016/j.cell.2018.03.029173:4(1003-1013.e15)Online publication date: May-2018
  • (2018)IF-CLARANS: Intuitionistic Fuzzy Algorithm for Big Data ClusteringInformation Processing and Management of Uncertainty in Knowledge-Based Systems. Theory and Foundations10.1007/978-3-319-91476-3_4(39-50)Online publication date: 18-May-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 1, Issue 4
January 2008
143 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/1324172
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 February 2008
Accepted: 01 August 2007
Revised: 01 March 2007
Received: 01 August 2006
Published in TKDD Volume 1, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Semisupervised learning
  2. cluster validity
  3. data mining
  4. similarity measure learning
  5. space learning

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)1
Reflects downloads up to 18 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Behavioral Economics in IRA Behavioral Economics Approach to Interactive Information Retrieval10.1007/978-3-031-23229-9_6(155-180)Online publication date: 18-Feb-2023
  • (2018)The Evolutionary Landscape of Localized Prostate Cancers Drives Clinical AggressionCell10.1016/j.cell.2018.03.029173:4(1003-1013.e15)Online publication date: May-2018
  • (2018)IF-CLARANS: Intuitionistic Fuzzy Algorithm for Big Data ClusteringInformation Processing and Management of Uncertainty in Knowledge-Based Systems. Theory and Foundations10.1007/978-3-319-91476-3_4(39-50)Online publication date: 18-May-2018
  • (2018)Clustering ValidityEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_616(499-505)Online publication date: 7-Dec-2018
  • (2016)Clustering ValidityEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_616-2(1-7)Online publication date: 25-Nov-2016
  • (2015)Supervised learning using a symmetric bilinear form for record linkageInformation Fusion10.1016/j.inffus.2014.11.00426:C(144-153)Online publication date: 1-Nov-2015
  • (2014)Enteromorpha Prolifera Detection with MODIS Image Using Semi-supervised ClusteringJournal of Computers10.4304/jcp.9.5.1259-12659:5Online publication date: 1-May-2014
  • (2014)A new knowledge-based constrained clustering approachApplied Soft Computing10.1016/j.asoc.2014.06.00224:C(316-327)Online publication date: 1-Nov-2014
  • (2012)Improvement on Agglomerative Hierarchical Clustering Algorithm Based on Tree Data Structure with Bidirectional ApproachProceedings of the 2012 Third International Conference on Intelligent Systems Modelling and Simulation10.1109/ISMS.2012.13(25-30)Online publication date: 8-Feb-2012
  • (2012)A New Selective Clustering Ensemble AlgorithmProceedings of the 2012 IEEE Ninth International Conference on e-Business Engineering10.1109/ICEBE.2012.17(45-49)Online publication date: 9-Sep-2012
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media