Abstract
Most structured data in real-life applications are stored in relational databases containing multiple semantically linked relations. Unlike clustering in a single table, when clustering objects in relational databases there are usually a large number of features conveying very different semantic information, and using all features indiscriminately is unlikely to generate meaningful results. Because the user knows her goal of clustering, we propose a new approach called CrossClus, which performs multi-relational clustering under user’s guidance. Unlike semi-supervised clustering which requires the user to provide a training set, we minimize the user’s effort by using a very simple form of user guidance. The user is only required to select one or a small set of features that are pertinent to the clustering goal, and CrossClus searches for other pertinent features in multiple relations. Each feature is evaluated by whether it clusters objects in a similar way with the user specified features. We design efficient and accurate approaches for both feature selection and object clustering. Our comprehensive experiments demonstrate the effectiveness and scalability of CrossClus.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aggarwal CC, Yu PS (2000) Finding generalized projected clusters in high dimensional spaces. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, Dallas, TX, pp 70–81
Aggarwal CC, Procopiuc C, Wolf JL, Yu PS, Park JS (1999) Fast algorithms for projected clustering. In: Proceedings of the 1999 ACM SIGMOD international conference on management of data, Philadelphia, PA, pp 61–72
Bilenko M, Basu S, Mooney RJ (2004) Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of the 2004 international conference on machine learning, Alberta, Canada, pp 81–88
Blockeel H, Dehaspe L and Demoen B (2002). Improving the efficiency of inductive logic programming through the use of query packs. J Artif Intell Res 16: 135–166
Cheeseman P et al (1988) AutoClass: a Bayesian classfication system. In: Proceedings of the 1988 international conference on machine learning, Alberta, Ann Arbor, MI, pp 54–64
DBLP Bibliography. http://www.informatik.uni-trier.de/∼ley/db/
Ding C, Peng H (2003) Minimum redundancy feature selection from microarray gene expression data. In: Proceedings of the 2003 IEEE computer society bioinformatics conference, Stanford, CA, pp 523–529
Dy JG and Brodley CE (2004). Feature selection for unsupervised learning. J Mach Learn Res 5: 845–889
Emde W, Wettschereck D (1996) Relational instance-based learning. In: Proceedings of the 1996 international conference on machine learning, Bari, Italy, pp 122–130
Gärtner T, Lloyd JW and Flach PA (2004). Kernels and distances for structured data. Mach Learn 57: 205–232
Guyon I and Elisseeff A (2003). An introduction to variable and feature selection. J Mach Learn Res 3: 1157–1182
Hall MA (2000) Correlation-based feature selection for discrete and numeric class machine learning. In: Proceedings of the 2000 international conference on machine learning, Stanford, CA, pp 359–366
Hristidis V, Papakonstantinou Y (2002) DISCOVER: keyword search in relational databases. In: Proceedings of the 2002 international conference on very large data bases, Hong Kong, China, pp 670–681
Jain AK, Murty MN and Flynn PJ (1999). Data clustering: a review. ACM Comput Surv 31: 264–323
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. John Wiley and Sons
Klein D, Kamvar SD, Manning C (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: Proceedings of the 2002 international conference on machine learning, Sydney, Australia, pp 307–314
Kirsten M, Wrobel S (1998) Relational distance-based clustering. In: Proceedings of the 1998 international Workshop on inductive logic programming, Madison, WI, pp 261–270
Kirsten M, Wrobel S (2000) Extending K-means clustering to first-order representations. In: Proceedings of the 2000 international workshop on inductive logic programming, London, UK, pp 112–129
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 1967 Berkeley symposium on mathematics, statistics and probability, Berkeley, CA, pp 281–298
Mitchell TM (1997) Machine learning. McGraw Hill
Mitra P, Murthy CA and Pal SK (2002). Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24: 301–312
Ng RT, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 1994 international conference on very large data bases, Santiago de Chile, Chile, pp 144–155
Quinlan JR, Cameron-Jones RM (1993) FOIL: a midterm report. In: Proceedings of the 1993 European conference on machine learning, Vienna, Austria, pp 3–20
Tan P-N, Steinbach M, Kumar W (2005) Introdution to data mining. Addison-Wesley
Wagstaff K, Cardie C, Rogers S, Schroedl S (2001) Constrained k-means clustering with background knowledge. In: Proceedings of the 2001 international conference on machine learning, Williamstown, MA, pp 577–584
Xing EP, Ng AY, Jordan MI, Russell S (2002) Distance metric learning, with application to clustering with side-information. In: Proceedings of the 2002 neural information processing systems, Vancouver, Canada, pp 505–512
Yin X, Han J, Yang J, Yu PS (2004) CrossMine: efficient classification across multiple database relations. In: Proceedings of the 2004 international conference on data engineering, Boston, MA, pp 399–411
Yin X, Han J, Yu PS (2005) Cross-relational clustering with user’s guidance. In: Proceedings of the 2005 ACM SIGKDD international conference on knowledge discovery and data mining, Chicago, IL, pp 344–353
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data, Montreal, Canada, pp 103–114
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Eamonn Keogh.
The work was supported in part by the U.S. National Science Foundation NSF IIS-03-13678 and NSF BDI-05-15813, and an IBM Faculty Award. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect views of the funding agencies.
Rights and permissions
About this article
Cite this article
Yin, X., Han, J. & Yu, P.S. CrossClus: user-guided multi-relational clustering. Data Min Knowl Disc 15, 321–348 (2007). https://doi.org/10.1007/s10618-007-0072-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-007-0072-z