research-article

A clustering framework based on subjective and objective validity criteria

Authors:

M. Vazirgiannis,

C. DomeniconiAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 1, Issue 4

Article No.: 4, Pages 1 - 25

https://doi.org/10.1145/1324172.1324176

Published: 02 February 2008 Publication History

Abstract

Clustering, as an unsupervised learning process is a challenging problem, especially in cases of high-dimensional datasets. Clustering result quality can benefit from user constraints and objective validity assessment. In this article, we propose a semisupervised framework for learning the weighted Euclidean subspace, where the best clustering can be achieved. Our approach capitalizes on: (i) user constraints; and (ii) the quality of intermediate clustering results in terms of their structural properties. The proposed framework uses the clustering algorithm and the validity measure as its parameters. We develop and discuss algorithms for learning and tuning the weights of contributing dimensions and defining the “best” clustering obtained by satisfying user constraints. Experimental results on benchmark datasets demonstrate the superiority of the proposed approach in terms of improved clustering accuracy.

References

[1]

Aggarwal, C., Procopiuc, C., Wolf, J., Yu, P., and Park, J. 1999. Fast algorithms for projected clustering. In Proceedings of the ACM International Conference on Management of Data.

Digital Library

[2]

Aggarwal, C. and Yu, P. 2000. Finding generalized projected clusters in high dimensional spaces. In Proceedings of the ACM International Conference on Management of Data.

Digital Library

[3]

Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. 1998. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the ACM International Conference on Management of Data.

Digital Library

[4]

Anderson, B., Moore, A., and Cohn, D. 2000. A nonparametric approach to noisy and costly optimization. In Proceedings of the International Conference on Machine Learning.

Digital Library

[5]

Bar-Hillel, A., Hertz, T., Shental, N., and Weinshall, D. 2003. Learning distance function using equivalence relations. In Proceedings of the International Conference on Machine Learning (ICML).

[6]

Basu, S., Bilenko, M., and Mooney, R. 2004. A probabilistic framework for semi-supervised clustering. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Digital Library

[7]

Berry, M. and Linoff, G. 1996. Data Mining Techniques for Marketing: Sale and Customer Support. John Wiley and Sons.

Digital Library

[8]

Bilenko, M., Basu, S., and Mooney, R. J. 2004. Integrating constraints and metric learning in semi-supervised clustering. In Proceedings of the International Conference on Machine Learning.

Digital Library

[9]

Blansch, A., Ganarski, P., and Korczak, J. 2006. Maclaw: A modular approach for clustering with local attribute weighting. Pattern Recogn. Lett. 27, 11 (Aug.), 1299--1306.

Digital Library

[10]

Blum, A. and Mitchell, T. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the Annual Conference on Computational Learning Theory, 92--100.

Digital Library

[11]

Cohn, D., Caruana, R., and McCallum, A. 2003. Semi-Supervised clustering with user feedback. Tech. Rep. TR2003-1892, Cornell University, Ithaca, NY.

[12]

Ester, M., Kriegel, H.-P., Sender, J., and Xu, X. 1997. Sensity-Connected sets and their application for trend detection in spatial databases. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 10--15.

[13]

Fayyad, U. G., Piatesky-Shapiro, P. S., and Uthurusamy, R. 1996. Advances in Knowledge Discovery and Data Mining. AAI Press.

Digital Library

[14]

Fisher and Douglas. 1987. Knowledge acquisition via incremental conceptual clustering. Mach. Learn. 2, 139--172.

Digital Library

[15]

Frigui, H. and Nasraoui, O. 2004. Unsupervised learning of prototypes and attribute weights. Pattern Recogn. 37, 3, 943--952.

[16]

Gao, J., Tan, P.-N., and Cheng, H. 2005. Semi-Supervised fuzzy clustering with pairwise-constrained competitive agglomeration. In IEEE Conference on Fuzzy Systems.

[17]

Halkidi, M., Gunopulos, D., Kumar, N., Vazirgiannis, M., and Domeniconi., C. 2005. A framework for semi-supervised learning based on subjective and objective clustering criteria. In Proceedings of the IEEE Conference on Data Mining (ICDM).

Digital Library

[18]

Halkidi, M. and Vazirgiannis, M. 2001. Clustering validity assessment: Finding the optimal partitioning of a data set. In Proceedings of the IEEE Conference on Data Mining (ICDM).

Digital Library

[19]

Hinneburg, A. and Keim, D. 1998. An efficient approach toclustering in large multimedia databases with noise. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 58--65.

[20]

Hogg, R. and Craig, A. 1978. Introduction to Mathematical Statistics. Macmillan, New York.

[21]

Hubert, L. and Arabie, P. 1985. Comparing partitions. J. Classif.

[22]

Jain, A., Mutty, M., and Flyn, P. 1999. Data clustering: A review. ACM Comput. Surv. 31, 3.

Digital Library

[23]

Jing, L., Ng, M., and Huang, J. X. 2005. Subspace clustering of text documents with feature weighting k-means algorithm. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Advances in Knowledge Discovery and Data Mining, Lecture Notes in Computer Science, vol. 3518. Springer, Berlin.

Digital Library

[24]

Kulis, B., Basu, S., Dhillon, I., and Mooney, R. 2005. Semi-Supervised grpah clustering: A kernel approach. In Proceedings of the International Conference on Machine Learning (ICML).

Digital Library

[25]

MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the Symposium on Math, Statistics and Probability, University of California Press, Berkeley, CA, 281--297.

[26]

Nigam, K., McCallum, K., Thrun, S., and Mitchell, T. 2000. Text classification labeled and unlabeled documents using em. Mach. Learn. 39, 103--134.

Digital Library

[27]

Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. 1997. Numerical Recipes in C, the Art of Scientific Computing. Cambridge University Press.

Digital Library

[28]

Segal, E., Wang, H., and Koller, D. 2003. Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics 19, 264--272.

[29]

Stein, B., zu Eissen, S. M., and Wibrock, F. 2003. On cluster validity and the information need of users. In Proceedings of the Artificial Intelligenece and Applications Conference.

[30]

Wagstaff, K. and Cardie. 2000. Clustering with instance-level constraints. In Proceedings of the International Conference on Machine Learning (ICML).

Digital Library

[31]

Wagstaff, K., Cardie, C., Rogers, S., and Schroedl, S. 2001. Constrained k-means clustering with background knowledge. In Proceedings of the International Conference on Machine Learning (ICML). 577--584.

Digital Library

[32]

Xing, E. P., Ng, A. Y., Jordan, M. I., and Russell, S. 2002. Distance metric learning, with application to clustering with side-information. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS).

[33]

Yip, K., Cheung, D., and Ng, M. 2005. On discovery of extremely low-dimensional clusters using semi-supervised projected clustering. In Proceedings of the 21st International Conference on Data Engineering, 329--240.

Digital Library

Cited By

Liu JLiu J(2023)Behavioral Economics in IRA Behavioral Economics Approach to Interactive Information Retrieval10.1007/978-3-031-23229-9_6(155-180)Online publication date: 18-Feb-2023
https://doi.org/10.1007/978-3-031-23229-9_6
Espiritu SLiu LRubanova YBhandari VHolgersen ESzyca LFox NChua MYamaguchi THeisler LLivingstone JWintersinger JYousif FLalonde ERouette ASalcedo AHoulahan KLi CHuang VFraser Mvan der Kwast TMorris QBristow RBoutros P(2018)The Evolutionary Landscape of Localized Prostate Cancers Drives Clinical AggressionCell10.1016/j.cell.2018.03.029173:4(1003-1013.e15)Online publication date: May-2018
https://doi.org/10.1016/j.cell.2018.03.029
Shili HRomdhane L(2018)IF-CLARANS: Intuitionistic Fuzzy Algorithm for Big Data ClusteringInformation Processing and Management of Uncertainty in Knowledge-Based Systems. Theory and Foundations10.1007/978-3-319-91476-3_4(39-50)Online publication date: 18-May-2018
https://doi.org/10.1007/978-3-319-91476-3_4
Show More Cited By

Index Terms

A clustering framework based on subjective and objective validity criteria
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Evolution-Based Tabu Search Approach to Automatic Clustering

Traditional clustering algorithms (e.g., the K-means algorithm and its variants) are used only for a fixed number of clusters. However, in many clustering applications, the actual number of clusters is unknown beforehand. The general solution to this ...
General C-Means Clustering Model

Partitional clustering is an important part of cluster analysis. Based on various theories, numerous clustering algorithms have been developed, and new clustering algorithms continue to appear in the literature. It is known that Occam's razor plays a ...
Cluster validity measurement techniques
AIKED'06: Proceedings of the 5th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases

Clustering is a process of discovering groups of objects such that the objects of the same group are similar, and the objects belonging to different groups are dissimilar. Several research fields deal with the problem of clustering: for example pattern ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 1, Issue 4

January 2008

143 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/1324172

Issue’s Table of Contents

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 February 2008

Accepted: 01 August 2007

Revised: 01 March 2007

Received: 01 August 2006

Published in TKDD Volume 1, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Sixth Framework Programme

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
1,272
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)1

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu JLiu J(2023)Behavioral Economics in IRA Behavioral Economics Approach to Interactive Information Retrieval10.1007/978-3-031-23229-9_6(155-180)Online publication date: 18-Feb-2023
https://doi.org/10.1007/978-3-031-23229-9_6
Espiritu SLiu LRubanova YBhandari VHolgersen ESzyca LFox NChua MYamaguchi THeisler LLivingstone JWintersinger JYousif FLalonde ERouette ASalcedo AHoulahan KLi CHuang VFraser Mvan der Kwast TMorris QBristow RBoutros P(2018)The Evolutionary Landscape of Localized Prostate Cancers Drives Clinical AggressionCell10.1016/j.cell.2018.03.029173:4(1003-1013.e15)Online publication date: May-2018
https://doi.org/10.1016/j.cell.2018.03.029
Shili HRomdhane L(2018)IF-CLARANS: Intuitionistic Fuzzy Algorithm for Big Data ClusteringInformation Processing and Management of Uncertainty in Knowledge-Based Systems. Theory and Foundations10.1007/978-3-319-91476-3_4(39-50)Online publication date: 18-May-2018
https://doi.org/10.1007/978-3-319-91476-3_4
Vazirgiannis M(2018)Clustering ValidityEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_616(499-505)Online publication date: 7-Dec-2018
https://doi.org/10.1007/978-1-4614-8265-9_616
Vazirgiannis M(2016)Clustering ValidityEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_616-2(1-7)Online publication date: 25-Nov-2016
https://doi.org/10.1007/978-1-4899-7993-3_616-2
Abril DTorra VNavarro-Arribas G(2015)Supervised learning using a symmetric bilinear form for record linkageInformation Fusion10.1016/j.inffus.2014.11.00426:C(144-153)Online publication date: 1-Nov-2015
https://dl.acm.org/doi/10.1016/j.inffus.2014.11.004
Wu SShao FWang YSun RWang J(2014)Enteromorpha Prolifera Detection with MODIS Image Using Semi-supervised ClusteringJournal of Computers10.4304/jcp.9.5.1259-12659:5Online publication date: 1-May-2014
https://doi.org/10.4304/jcp.9.5.1259-1265
Seret AVerbraken TBaesens B(2014)A new knowledge-based constrained clustering approachApplied Soft Computing10.1016/j.asoc.2014.06.00224:C(316-327)Online publication date: 1-Nov-2014
https://dl.acm.org/doi/10.1016/j.asoc.2014.06.002
Dalbouh HNorwawi N(2012)Improvement on Agglomerative Hierarchical Clustering Algorithm Based on Tree Data Structure with Bidirectional ApproachProceedings of the 2012 Third International Conference on Intelligent Systems Modelling and Simulation10.1109/ISMS.2012.13(25-30)Online publication date: 8-Feb-2012
https://dl.acm.org/doi/10.1109/ISMS.2012.13
Limin LXiaoping F(2012)A New Selective Clustering Ensemble AlgorithmProceedings of the 2012 IEEE Ninth International Conference on e-Business Engineering10.1109/ICEBE.2012.17(45-49)Online publication date: 9-Sep-2012
https://dl.acm.org/doi/10.1109/ICEBE.2012.17
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents