Skip to main content
Log in

CrossClus: user-guided multi-relational clustering

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Most structured data in real-life applications are stored in relational databases containing multiple semantically linked relations. Unlike clustering in a single table, when clustering objects in relational databases there are usually a large number of features conveying very different semantic information, and using all features indiscriminately is unlikely to generate meaningful results. Because the user knows her goal of clustering, we propose a new approach called CrossClus, which performs multi-relational clustering under user’s guidance. Unlike semi-supervised clustering which requires the user to provide a training set, we minimize the user’s effort by using a very simple form of user guidance. The user is only required to select one or a small set of features that are pertinent to the clustering goal, and CrossClus searches for other pertinent features in multiple relations. Each feature is evaluated by whether it clusters objects in a similar way with the user specified features. We design efficient and accurate approaches for both feature selection and object clustering. Our comprehensive experiments demonstrate the effectiveness and scalability of CrossClus.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Aggarwal CC, Yu PS (2000) Finding generalized projected clusters in high dimensional spaces. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, Dallas, TX, pp 70–81

  • Aggarwal CC, Procopiuc C, Wolf JL, Yu PS, Park JS (1999) Fast algorithms for projected clustering. In: Proceedings of the 1999 ACM SIGMOD international conference on management of data, Philadelphia, PA, pp 61–72

  • Bilenko M, Basu S, Mooney RJ (2004) Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of the 2004 international conference on machine learning, Alberta, Canada, pp 81–88

  • Blockeel H, Dehaspe L and Demoen B (2002). Improving the efficiency of inductive logic programming through the use of query packs. J Artif Intell Res 16: 135–166

    MATH  Google Scholar 

  • Cheeseman P et al (1988) AutoClass: a Bayesian classfication system. In: Proceedings of the 1988 international conference on machine learning, Alberta, Ann Arbor, MI, pp 54–64

  • DBLP Bibliography. http://www.informatik.uni-trier.de/∼ley/db/

  • Ding C, Peng H (2003) Minimum redundancy feature selection from microarray gene expression data. In: Proceedings of the 2003 IEEE computer society bioinformatics conference, Stanford, CA, pp 523–529

  • Dy JG and Brodley CE (2004). Feature selection for unsupervised learning. J Mach Learn Res 5: 845–889

    MathSciNet  Google Scholar 

  • Emde W, Wettschereck D (1996) Relational instance-based learning. In: Proceedings of the 1996 international conference on machine learning, Bari, Italy, pp 122–130

  • Gärtner T, Lloyd JW and Flach PA (2004). Kernels and distances for structured data. Mach Learn 57: 205–232

    Article  MATH  Google Scholar 

  • Guyon I and Elisseeff A (2003). An introduction to variable and feature selection. J Mach Learn Res 3: 1157–1182

    Article  MATH  Google Scholar 

  • Hall MA (2000) Correlation-based feature selection for discrete and numeric class machine learning. In: Proceedings of the 2000 international conference on machine learning, Stanford, CA, pp 359–366

  • Hristidis V, Papakonstantinou Y (2002) DISCOVER: keyword search in relational databases. In: Proceedings of the 2002 international conference on very large data bases, Hong Kong, China, pp 670–681

  • Jain AK, Murty MN and Flynn PJ (1999). Data clustering: a review. ACM Comput Surv 31: 264–323

    Article  Google Scholar 

  • Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. John Wiley and Sons

  • Klein D, Kamvar SD, Manning C (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: Proceedings of the 2002 international conference on machine learning, Sydney, Australia, pp 307–314

  • Kirsten M, Wrobel S (1998) Relational distance-based clustering. In: Proceedings of the 1998 international Workshop on inductive logic programming, Madison, WI, pp 261–270

  • Kirsten M, Wrobel S (2000) Extending K-means clustering to first-order representations. In: Proceedings of the 2000 international workshop on inductive logic programming, London, UK, pp 112–129

  • MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 1967 Berkeley symposium on mathematics, statistics and probability, Berkeley, CA, pp 281–298

  • Mitchell TM (1997) Machine learning. McGraw Hill

  • Mitra P, Murthy CA and Pal SK (2002). Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24: 301–312

    Article  Google Scholar 

  • Ng RT, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 1994 international conference on very large data bases, Santiago de Chile, Chile, pp 144–155

  • Quinlan JR, Cameron-Jones RM (1993) FOIL: a midterm report. In: Proceedings of the 1993 European conference on machine learning, Vienna, Austria, pp 3–20

  • Tan P-N, Steinbach M, Kumar W (2005) Introdution to data mining. Addison-Wesley

  • Wagstaff K, Cardie C, Rogers S, Schroedl S (2001) Constrained k-means clustering with background knowledge. In: Proceedings of the 2001 international conference on machine learning, Williamstown, MA, pp 577–584

  • Xing EP, Ng AY, Jordan MI, Russell S (2002) Distance metric learning, with application to clustering with side-information. In: Proceedings of the 2002 neural information processing systems, Vancouver, Canada, pp 505–512

  • Yin X, Han J, Yang J, Yu PS (2004) CrossMine: efficient classification across multiple database relations. In: Proceedings of the 2004 international conference on data engineering, Boston, MA, pp 399–411

  • Yin X, Han J, Yu PS (2005) Cross-relational clustering with user’s guidance. In: Proceedings of the 2005 ACM SIGKDD international conference on knowledge discovery and data mining, Chicago, IL, pp 344–353

  • Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data, Montreal, Canada, pp 103–114

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaoxin Yin.

Additional information

Responsible editor: Eamonn Keogh.

The work was supported in part by the U.S. National Science Foundation NSF IIS-03-13678 and NSF BDI-05-15813, and an IBM Faculty Award. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect views of the funding agencies.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yin, X., Han, J. & Yu, P.S. CrossClus: user-guided multi-relational clustering. Data Min Knowl Disc 15, 321–348 (2007). https://doi.org/10.1007/s10618-007-0072-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-007-0072-z

Keywords

Navigation