Density-based semi-supervised clustering

Ruiz, Carlos; Spiliopoulou, Myra; Menasalvas, Ernestina

doi:10.1007/s10618-009-0157-y

Density-based semi-supervised clustering

Published: 21 November 2009

Volume 21, pages 345–370, (2010)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Carlos Ruiz¹,
Myra Spiliopoulou² &
Ernestina Menasalvas¹

943 Accesses
41 Citations
Explore all metrics

Abstract

Semi-supervised clustering methods guide the data partitioning and grouping process by exploiting background knowledge, among else in the form of constraints. In this study, we propose a semi-supervised density-based clustering method. Density-based algorithms are traditionally used in applications, where the anticipated groups are expected to assume non-spherical shapes and/or differ in cardinality or density. Many such applications, among else those on GIS, lend themselves to constraint-based clustering, because there is a priori knowledge on the group membership of some records. In fact, constraints might be the only way to prevent the formation of clusters that do not conform to the applications’ semantics. For example, geographical objects, e.g. houses, separated by a borderline or a river may not be assigned to the same cluster, independently of their physical proximity. We first provide an overview of constraint-based clustering for different families of clustering algorithms. Then, we concentrate on the density-based algorithms’ family and select the algorithm DBSCAN, which we enhance with Must-Link and Cannot-Link constraints. Our enhancement is seamless: we allow DBSCAN to build temporary clusters, which we then split or merge according to the constraints. Our experiments on synthetic and real datasets show that our approach improves the performance of the algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semi-supervised Clustering Method for Multi-density Data

Constraint-Based Clustering Algorithm for Multi-density Data and Arbitrary Shapes

Semi-supervised DenPeak Clustering with Pairwise Constraints

References

Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD´98: proceedings ACM SIGMOD international conference on management of data, pp 94–105
Anand SS, Bell DA, Hughes JG (1995) The role of domain knowledge in data mining. In: CIKM ’95: proceedings of the fourth international conference on information and knowledge management, pp 37–43
Angiulli F, Pizzuti C, Ruffolo M (2004) DESCRY: a density based clustering algorithm for very large data sets. In: IDEAL’04: proceedings of intelligent data engineering and automated learning, pp 203–210
Ankerst M, Breunig MM, Kriegel H-P, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: SIGMOD’99: proceedings of the 1999 ACM SIGMOD international conference on management of data, pp 49–60
Basu S, Banerjee A, Mooney RJ (2002) Semi-supervised clustering by seeding. In: ICML’02: proceedings international conference on machine learning, pp 19–26
Basu S, Banerjee A, Mooney RJ (2004a) Active semi-supervision for pairwise constrained clustering. In: SDM’04: proceedings of the 4th SIAM international conference
Basu S, Bilenko M, Mooney RJ (2004b) A probabilistic framework for semi-supervised clustering. In: KDD’04: proceedings of 10th international conference on knowledge discovery in databases and data mining, pp 59–68
Bennett K, Bradley P, Demiriz A (2000) Constrained K-means clustering. Technical report, Microsoft Research. MSR-TR-2000-65
Bentley JL (1975) Multidimensional binary search trees used for associative searching. Commun ACM 18(9): 509–517
Article MATH MathSciNet Google Scholar
Bilenko M, Basu S, Mooney RJ (2004) Integrating constraints and metric learning in semisupervised clustering. In: ICML’04: proceedings of the 21th international conference on machine learning, pp 11–19
Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: KDD’03: proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, NY, pp 39–48
Cohn D, Caruana R, McCallum A (2003) Semi-supervised clustering with user feedback. Technical report TR2003-1892, Cornell University
Davidson I, Basu S (2005) Clustering with constraints. In: ICDM’05: tutorial at the 5th IEEE international conference on data mining
Davidson I, Basu S (2006) Clustering with constraints: theory and practice. In: KDD’06: tutorial at the international conference on knowledge discovery in databases and data mining
Davidson I, Ravi SS (2005) Agglomerative hierarchical clustering with constraints: theoretical and empirical results. In: PKDD’05: proceedings of principles of knowledge discovery from databases, pp 59–70
Davidson I, Ravi SS (2005) Clustering with constraints: feasibility issues and the K-means algorithm. In: SIAM’05: society for industrial and applied mathematics international conference on data mining
Davidson I, Ravi SS, Ester M (2007) Efficient incremental constrained clustering. In: KDD’07: proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, pp 240–249
Davidson I, Wagstaff K, Basu S (2006) Measuring constraint-set utility for partitional clustering algorithms. In: PKDD’06: proceeding of principles of knowledge discovery from databases, pp 115–126
Demiriz A, Bennett KP, Embrechts MJ (1999) Semi-supervised clustering using genetic algorithms. In: ANNIE’99: artificial neural networks in engineering, pp 809–814
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algortihm for discovering clusters in large spatial database with noise. In: KDD’96: proceedings of 2nd international conference on knowledge discovery in databases and data mining
Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: SIGMOD´98: proceeding of the 1998 ACM SIGMOD international conference on management of data, pp 73–84
Gunopulos D, Vazirgiannis M, Halkidi M (2006) From unsupervised to semi-supervised learning: algorithms and evaluation approaches. In: SIAM’06: tutorial at society for industrial and applied mathematics international conference on data mining
Halkidi M, Gunopulos D, Kumar N, Vazirgiannis M, Domeniconi C (2005) A framework for semi-supervised learning based on subjective and objective clustering criteria. In: ICDM’2005: proceedings of the 5th IEEE international conference on data mining, pp 637–640
Han J, Lakshmanan LVS, Ng RT (1999) Constraint-based, multidimensional data mining. Comput IEEE Computer Soc Press 32(8): 46–50
Google Scholar
Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: KDD’98: proceedings of 4th international conference on knowledge discovery in databases and data mining, pp 58–65
Karypis G, Hang E-H, Kumar V (1999) Chameleon: hierchachical clustering using dynamic modeling. IEEE Comput 12(8): 68–75
Google Scholar
Klein D, Kamvar SD, Manning C (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: ICML’02: proceedings of the 19th international conference on machine learning, pp 307–314
Kopanas I, Avouris NM, Daskalaki S (2002) The role of domain knowledge in a large scale data mining projects. In: Vlahavas IP, Spyropoulos CD (eds) Methods and applications of artificial intelligence. Second hellenic conference on AI, SETN 2002, volume 2308 of Lecture Notes in Computer Science. Springer
Newman D, Hettich S, Blake C, Merz C (1998) UCI repository of machine learning databases
Rand WM (1971) Objective criteria for the evalluation of clustering methods. J Am Stat Assoc 66: 846–850
Article Google Scholar
Ruiz C, Menasalvas E, Spiliopoulou M (2009) C-DenStream: using domain knowledge over a data stream. In: DS’0r97: proceedings of the international discovery science conference porto, Portugal, Oct 2009. to appear
Ruiz C, Spiliopoulou M, Menasalvas E (2006) User constraints over data streams. In: IWKDDS’06: proceedings of the 4th workshop on knowledge discovery from data streams at ECML/PKDD’06, pp 117–226
Ruiz C, Spiliopoulou M, Menasalvas E (2007a) C-DBSCAN: density-based clustering with constraints. In: RSFDGrC’07: proceedings of the international conference on rough sets, fuzzy sets, data mining and granular computing held by JRS’07
Ruiz C, Spiliopoulou M, Menasalvas E (2007b) Constraint-based query clustering. In: AWIC’07: proceedings of the 5th Atlantic web intelligence conference
Sheikholeslami G, Chatterjee S, Zhang A (1998) WaveCluster: a multi-resolution clustering approach for very large spatial databases. In: VLDB’98: proceedings of 24th international conference on very large data bases, pp 428–439
Vazirgiannis M, Halkidi M, Gunopoulos D (2003) Quality assessment and uncertainty handling in data mining. Springer, LNAI Series
Wagstaff K (2002) Intelligent clustering with instance-level constraints. PhD thesis, Universidad de Cornell
Wagstaff K, Cardie C (2000) Clustering with instance-level constraints. In: ICML’00: proceedings of 17th international conference on machine learning, pp 1103–1110
Wagstaff K, Cardie C, Rogers S, Schroedl S (2001) Constrained K-means clustering with background knowledge. In: ICML’01: proceedings of 18th international conference on machine learning, pp 577–584
Wang W, Yang J, Muntz RR (1997) STING: a statistical information grid approach to spatial data mining. In: VLDB’97: proceedings of the 23rd international conference on very large data bases, pp 186–195
Xing EP, Ng AY, Jordan MI, Russell S (2003) Distance metric learning, with application to clustering with side-information. Adv Neural Inf Process Syst 15: 505–512
Google Scholar
Zaïane OR, Lee C-H (2002) Clustering spatial data in the presence of obstacles: a density-based approach. In: IDEAS’02: proceedings of the 2002 international symposium on database engineering and applications. IEEE Computer Society, Washington, DC, pp 214–223
Zaïane OR, Lee C-H (2002) Clustering spatial data when facing physical constraints. In: ICDM’02: proceedings of the 2002 IEEE international conference on data mining (ICDM’02). IEEE Computer Society, Washington, DC, p 737

Download references

Author information

Authors and Affiliations

Facultad de Informática, Universidad Politecnica de Madrid, Madrid, Spain
Carlos Ruiz & Ernestina Menasalvas
Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, Magdeburg, Germany
Myra Spiliopoulou

Authors

Carlos Ruiz
View author publications
You can also search for this author in PubMed Google Scholar
Myra Spiliopoulou
View author publications
You can also search for this author in PubMed Google Scholar
Ernestina Menasalvas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Myra Spiliopoulou.

Additional information

Responsible editor: Eamonn Keogh.

Part of Ernestina Menasalvas work was funded by the Spanish ministry under project grant TIN2008-05924.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ruiz, C., Spiliopoulou, M. & Menasalvas, E. Density-based semi-supervised clustering. Data Min Knowl Disc 21, 345–370 (2010). https://doi.org/10.1007/s10618-009-0157-y

Download citation

Received: 02 February 2008
Accepted: 31 October 2009
Published: 21 November 2009
Issue Date: November 2010
DOI: https://doi.org/10.1007/s10618-009-0157-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Density-based semi-supervised clustering

Abstract

Access this article

Similar content being viewed by others

Semi-supervised Clustering Method for Multi-density Data

Constraint-Based Clustering Algorithm for Multi-density Data and Arbitrary Shapes

Semi-supervised DenPeak Clustering with Pairwise Constraints

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Density-based semi-supervised clustering

Abstract

Access this article

Similar content being viewed by others

Semi-supervised Clustering Method for Multi-density Data

Constraint-Based Clustering Algorithm for Multi-density Data and Arbitrary Shapes

Semi-supervised DenPeak Clustering with Pairwise Constraints

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation