The complexity of non-hierarchical clustering with instance and cluster level constraints

Davidson, Ian; Ravi, S. S.

doi:10.1007/s10618-006-0053-7

The complexity of non-hierarchical clustering with instance and cluster level constraints

Published: 26 January 2007

Volume 14, pages 25–61, (2007)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Ian Davidson¹ &
S. S. Ravi¹

402 Accesses
Explore all metrics

Abstract

Recent work has looked at extending clustering algorithms with instance level must-link (ML) and cannot-link (CL) background information. Our work introduces δ and ε cluster level constraints that influence inter-cluster distances and cluster composition. The addition of background information, though useful at providing better clustering results, raises the important feasibility question: Given a collection of constraints and a set of data, does there exist at least one partition of the data set satisfying all the constraints? We study the complexity of the feasibility problem for each of the above constraints separately and also for combinations of constraints. Our results clearly delineate combinations of constraints for which the feasibility problem is computationally intractable (i.e., NP-complete) from those for which the problem is efficiently solvable (i.e., in the computational class P). We also consider the ML and CL constraints in conjunctive and disjunctive normal forms (CNF and DNF respectively). We show that for ML constraints, the feasibility problem is intractable for CNF but efficiently solvable for DNF. Unfortunately, for CL constraints, the feasibility problem is intractable for both CNF and DNF. This effectively means that CL-constraints in a non-trivial form cannot be efficiently incorporated into clustering algorithms. To overcome this, we introduce the notion of a choice-set of constraints and prove that the feasibility problem for choice-sets is efficiently solvable for both ML and CL constraints. We also present empirical results which indicate that the feasibility problem occurs extensively in real world problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering with Lower-Bounded Sizes

Article 19 September 2017

Partition-Based Clustering Using Constraint Optimization

COBRAS: Interactive Clustering with Pairwise Queries

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Bansal N, Blum A, Chawla S (2002) Correlation clustering. In: Proc. 43rd annual IEEE symposium on Foundations of Computer Science (FOCS-2002), pp 238–246
Basu S, Banerjee, A, Mooney, R (2002) Semi-supervised learning by seeding. In: Proc. 19th Intl. Conf. on Machine Learning (ICML-2002). Sydney, Australia pp 19–26
Basu S, Bilenko M, Mooney R (2004a) A probabilistic framework for semi-supervised clustering. In: Proc. 10th ACM SIGKDD intl. conf. on knowledge discovery and data mining (KDD-2004). Seattle, WA, pp 59–68
Basu S, Bilenko M, Mooney R (2004b) Active semi-supervision for pairwise constrained clustering. In: Proc. 4th SIAM intl. conf. on data mining (SDM-2004) pp 333–344
Bilenko M, Basu S, Mooney R (2004) Integrating constraints and metric learning in semi-supervised clustering. In: Proc. 21st international conference on on machine learning (ICML-2004), pp 11–18
Bradley P, Fayyad U (1998) Refining initial points for K-Means clustering. In: Proc. 15th intl. conf. on machine learning (ICML-1998), pp 91–99
Campers G, Henkes O, Leclerq P (1987) Graph coloring heuristics: a survey, some new propositions and computational experiences on random and Leighton’s graphs. In: Proc. Operational Research ’87. Buenos Aires, pp 917–932
Charikar M, Guruswami V, Wirth A (2003) Clustering with qualitative information. In: Proc. 44th Annual IEEE symposium on foundations of computer science (FOCS-2003), pp 524–533
Cooper GF (1990) The computational complexity of probabilistic inference using bayesian belief networks. In: Artif Intell 42(2–3):393–405
Cormen T, Leiserson C, Rivest R, Stein C (2001) Introduction to algorithms 2nd edn. MIT Press and McGraw-Hill, Cambridge, MA
MATH Google Scholar
Davidson I, Ravi SS (2005a) Clustering with constraints feasibility issues and the k-Means algorithm. In: Proc. 2005 SIAM International Conference on Data Mining (SDM’05). Newport Beach, CA, pp 138–149
Davidson I, Ravi SS (2005b) Hierarchical clustering with constraints: theory and practice. In: Proc. 9th European principles and practice of KDD (PKDD’05). Porto, Portugal pp 59–70
Dyer M, Frieze A (1986) Planar 3DM is NP-Complete. J Algorithms :174–184
Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. 2nd intl. conf. on knowledge discovery and data mining (KDD-96). Portland, OR, pp 226–231
Feige U, Kilian J (1998) Zero knowledge and the chromatic number. J Comput Syst Sci 57:187–199
Article MATH MathSciNet Google Scholar
Garey MR, Johnson DJ (1979) Computers and intractability: a guide to the theory of NP-completeness. W H Freeman and Co., San Francisco, CA
MATH Google Scholar
Gonzalez T (1985) Clustering to minimize the maximum intercluster distance. Theor Comput Sci 38(2–3):293–306
Article MATH Google Scholar
Hansen P, Jaumard B (1997) Cluster analysis and mathematical programming. Math Program 79:191–215
Article MathSciNet Google Scholar
Hertz A, de Werra D (1987) Using Tabu search techniques for graph coloring. Computing 39:345–351
Article MATH MathSciNet Google Scholar
Klein D, Kamvar S, Manning C (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: Proc. 19th intl. conf. on machine learning (ICML 2002). Sydney, Australia, July pp 307–314
Pelleg D, Moore A (1999) Accelerating exact k-means algorithms with geometric reasoning. In: Proc. ACM SIGKDD Intl. conf. on knowledge discovery and data mining. San Diego, CA pp 277–281
Tamassia R, Tollis I (1989) Planar grid embedding in linear time. In: IEEE Trans Circuits Syst CAS-36(9):1230–1234
Article MathSciNet Google Scholar
Wagstaff K, Cardie C (2000) Clustering with instance-level constraints. In: Proc. 17th intl. conf. on machine learning (ICML 2000). Stanford, CA, pp 1103–1110
Wagstaff K, Cardie C, Rogers S, Schroedl S (2001) Constrained K-means clustering with background knowledge. In: Proc. 18th intl. conf. on machine learning (ICML 2001). Williamstown, MA, pp 577–584
Wagstaff K (2002) Intelligent clustering with instance-level constraints. Ph.D Thesis, Department of Computer Science, Cornell University, Ithaca, NY, Chapter 3, pp 50–51
West DB (2001) Introduction to Graph Theory 2nd edn. Prentice Hall, Inc., Englewood Cliffs, NJ
Google Scholar
Wijsen J, Meersman R (1998) On the complexity of mining quantitative association rules. In: J Data Mining Knowl Discovery 2(3):263–281
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University at Albany - State University of New York, Albany, NY, 12222, USA
Ian Davidson & S. S. Ravi

Authors

Ian Davidson
View author publications
You can also search for this author inPubMed Google Scholar
S. S. Ravi
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Ian Davidson.

Additional information

Responsible editor: Charu Aggarwal.

A conference version containing some of the results in this paper appeared as Davidson and Ravi (2005a)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Davidson, I., Ravi, S.S. The complexity of non-hierarchical clustering with instance and cluster level constraints. Data Min Knowl Disc 14, 25–61 (2007). https://doi.org/10.1007/s10618-006-0053-7

Download citation

Received: 24 October 2005
Accepted: 13 June 2006
Published: 26 January 2007
Issue Date: February 2007
DOI: https://doi.org/10.1007/s10618-006-0053-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The complexity of non-hierarchical clustering with instance and cluster level constraints

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Clustering with Lower-Bounded Sizes

Partition-Based Clustering Using Constraint Optimization

COBRAS: Interactive Clustering with Pairwise Queries

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now