Skip to main content
Log in

Overlapping correlation clustering

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

We introduce a new approach for finding overlapping clusters given pairwise similarities of objects. In particular, we relax the problem of correlation clustering by allowing an object to be assigned to more than one cluster. At the core of our approach is an optimization problem in which each data point is mapped to a small set of labels, representing membership in different clusters. The objective is to find a mapping so that the given similarities between objects agree as much as possible with similarities taken over their label sets. The number of labels can vary across objects. To define a similarity between label sets, we consider two measures: (i) a 0–1 function indicating whether the two label sets have non-zero intersection and (ii) the Jaccard coefficient between the two label sets. The algorithm we propose is an iterative local-search method. The definitions of label set similarity give rise to two non-trivial optimization problems, which, for the measures of set-intersection and Jaccard, we solve using a greedy strategy and non-negative least squares, respectively. We also develop a distributed version of our algorithm based on the BSP model and implement it using a Pregel framework. Our algorithm uses as input pairwise similarities of objects and can thus be applied when clustering structured objects for which feature vectors are not available. As a proof of concept, we apply our algorithms on three different and complex application domains: trajectories, amino-acid sequences, and textual documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ailon N, Charikar M, Newman A (2005) Aggregating inconsistent information: ranking and clustering. In: Proceedings of the ACM symposium on theory of computing (STOC)

  2. Ailon N, Liberty E (2009) Correlation clustering revisited: the “true“ cost of error minimization problems. In: Automata, languages and programming, 36th international colloquium (ICALP)

  3. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3): 403–410

    Google Scholar 

  4. Arabie P, Carroll JD, DeSarbo W, Wind J (1981) Overlapping clustering: a new method for product positioning. J Mark Res 18(3):310–317

    Google Scholar 

  5. Banerjee A, Krumpelman C, Ghosh J, Basu S, Mooney RJ (2005) Model-based overlapping clustering. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining (KDD)

  6. Bansal N, Blum A, Chawla S (2004) Correlation clustering. Mach Learn 56(1–3):89–113

    Google Scholar 

  7. Basu S, Banerjee A, Mooney RJ (2004) Active semi-supervision for pairwise constrained clustering. In: Proceedings of the Fourth SIAM international conference on data mining (SDM)

  8. Basu S, Bilenko M, Mooney RJ (2004) A probabilistic framework for semi-supervised clustering. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (KDD)

  9. Battle A, Segal E, Koller D (2004) Probabilistic discovery of overlapping cellular processes and their regulation. In: Proceedings of the 8th international conference on research in computational molecular biology (RECOMB)

  10. Bezdek, JC, Pal, SK (eds) (1992) Fuzzy models for pattern recognition—methods that search for structures in data. IEEE Press, New York

    Google Scholar 

  11. Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw 30(1–7): 107–117

    Google Scholar 

  12. Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (1998) Min-wise independent permutations. In: Proceedings of the 13th annual ACM symposium on theory of computing (STOC)

  13. Charikar M, Guruswami V, Wirth A (2003) Clustering with qualitative information. In: Proceedings of the IEEE symposium on foundations of computer science (FOCS)

  14. Chen L, Özsu MT, Oria V (2005) Robust and fast similarity search for moving object trajectories. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data (SIGMOD’05)

  15. Cheng Y, Church GM (2000) Biclustering of expression data. In: Proceedings of the eighth international conference on intelligent systems for molecular biology (ISMB)

  16. Chierichetti F, Kumar R, Pandey S, Vassilvitskii S (2010) Finding the jaccard median. In: Proceedings of the 21st annual ACM-SIAM symposium on discrete algorithms (SODA)

  17. Coe PK, Johnson BK, Stewart KM, Kie JG (2004) Spatial and temporal interactions of elk, mule deer, and cattle. In: Transactions of the 69th North American wildlife and natural resources conference, pp 656–669

  18. Davidson I, Ravi SS (2005) Clustering with constraints: feasibility issues and the k-means algorithm. In: Proceedings of the Fifth SIAM international conference on data mining (SDM)

  19. Davidson I, Ravi SS (2007) Intractability and clustering with constraints. In: Proceedings of the 24th international conference on machine learning (ICML)

  20. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1): 107–113

    Article  Google Scholar 

  21. Demaine ED, Emanuel D, Fiat A, Immorlica N (2006) Correlation clustering in general weighted graphs. Theor Comput Sci 361:172–187

    Google Scholar 

  22. Ding C, He X, Simon HD (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proceedings of the SIAM data mining conference

  23. Fortunato S (2010) Community detection in graphs. Phys Rep 486:75–174

    Google Scholar 

  24. Fu Q, Banerjee A (2008) Multiplicative mixture models for overlapping clustering. In: Proceedings of the 8th IEEE international conference on data mining (ICDM)

  25. Fu Q, Banerjee A (2009) Bayesian overlapping subspace clustering. In: Proceedings of the 9th IEEE international conference on data mining (ICDM)

  26. Gaffney S, Smyth P (1999) Trajectory clustering with mixtures of regression models. In: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’99

  27. Garey MR, Johnson DS (1979) Computers and intractability: a guide to the theory of NP-completeness. W. H. Freeman & Co., San Francisco

    MATH  Google Scholar 

  28. Gionis A, Mannila H, Tsaparas P (2007) Clustering aggregation. TKDD 1(1):Article 4

  29. Giotis I, Guruswami V (2006) Correlation clustering with a fixed number of clusters. In: Proceedings of the seventeenth annual ACM-SIAM symposium on discrete algorithms (SODA)

  30. Hasan M, Salem S, Zaki M (2011) Simclus: an effective algorithm for clustering with a lower bound on similarity. Knowl Inf Syst 28: 665–685

    Article  Google Scholar 

  31. Hathaway RJ, Davenport JW, Bezdek JC (1989) Relational duals of the c-means clustering algorithms. Pattern Recognit 22(2): 205–212

    Article  MathSciNet  MATH  Google Scholar 

  32. Hathaway RJ, Hu Y (2009) Density-weighted fuzzy c-means clustering. IEEE T Fuzzy Syst 17(1): 243–252

    Article  Google Scholar 

  33. He Z, Xie S, Zdunek R, Zhou G, Cichocki A (2011) Symmetric nonnegative matrix factorization: algorithms and applications to probabilistic clustering. IEEE Trans Neural Netw 22(12): 2117–2131

    Article  Google Scholar 

  34. Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the 13th annual ACM symposium on theory of computing (STOC)

  35. Klein D, Kamvar SD, Manning CD (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: Proceedings of the nineteenth international conference on machine learning (ICML)

  36. Kobayashi M, Aono M (2006) Exploring overlapping clusters using dynamic re-scaling and sampling. Knowl Inf Syst 10: 295–313

    Article  Google Scholar 

  37. Lee DD, Seung HS (2001) Algorithms for Non-negative Matrix Factorization. In: Advances in neural information processing systems 13:556–562

  38. Lee JG, Han J, Whang KY (2007) Trajectory clustering: a partition-and-group framework. In: Proceedings of the 2007 ACM SIGMOD international conference on management of data, SIGMOD ’07

  39. Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 1(1):24–45

    Google Scholar 

  40. Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: SIGMOD conference, pp 135–146

  41. Mei JP, Chen L (2010) Fuzzy clustering with weighted medoids for relational data. Pattern Recognit 43(5): 1964–1974

    Article  MATH  Google Scholar 

  42. Miettinen P (2008) On the positive-negative partial set cover problem. Inf Process Lett 108(4):219–221

    Google Scholar 

  43. Murzin A, Brenner S, Hubbard T, Chothia C (1995) Scop—a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247(4): 536–540

    Google Scholar 

  44. Nanni M, Pedreschi D (2006) Time-focused clustering of trajectories of moving objects. J Intell Inf Syst 27(3): 267–289

    Article  Google Scholar 

  45. Nepusz T, Sasidharan R, Paccanaro A (2010) Scps: a fast implementation of a spectral method for detecting protein families on a genome-wide scale. BMC Bioinform 11(1): 120

    Article  Google Scholar 

  46. Paccanaro A, Casbon JA, Saqi MAS (2006) Spectral clustering of protein sequences. Nucleic Acids Res 34(5): 1571–1580

    Article  Google Scholar 

  47. Palla G, Derenyi I, Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature

  48. Scheinerman ER, Tucker K (2010) Modeling graphs using dot product representations. Comput Stat 25(1):1–16

    Google Scholar 

  49. Scripps J, Tan PN (2006) Clustering in the presence of bridge-nodes. In: Proceedings of the sixth SIAM international conference on data mining (SDM)

  50. Segal E, Battle A, Koller D (2003) Decomposing gene expression into cellular processes. In: Proceedings of the 8th Pacific symposium on biocomputing (PSB)

  51. Shafiei MM, Milios EE (2006) Model-based overlapping co-clustering. In: Proceedings of the fourth workshop on text mining

  52. Shepard RN, Arabie P (1979) Additive clustering: representation of similarities as combinations of discrete overlapping properties. Psychol Rev 86(2):87–123

    Google Scholar 

  53. Swamy C (2004) Correlation clustering: maximizing agreements via semidefinite programming. In: Proceedings of the ACM-SIAM symposium on discrete algorithms (SODA)

  54. Tang L, Liu H (2009) Scalable learning of collective behavior based on sparse social dimensions. In: Proceedings of the 18th ACM conference on information and knowledge management (CIKM)

  55. Valiant LG (1990) A bridging model for parallel computation. Commun ACM 33(8): 103–111

    Article  Google Scholar 

  56. Wagstaff K, Cardie C (2000) Clustering with instance-level constraints. In: Proceedings of the 17th international conference on machine learning (ICML)

  57. Wagstaff K, Cardie C, Rogers S, Schrödl S (2001) Constrained k-means clustering with background knowledge. In: Proceedings of the 18th international conference on machine learning (ICML)

  58. Wang X, Tang L, Gao H, Liu H (2010) Discovering overlapping groups in social media. In: The 10th IEEE international conference on data mining (ICDM)

  59. Xiong H, Steinbach M, Ruslim A, Kumar V (2009) Characterizing pattern preserving clustering. Knowl Inf Syst 19: 311–336

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antti Ukkonen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bonchi, F., Gionis, A. & Ukkonen, A. Overlapping correlation clustering. Knowl Inf Syst 35, 1–32 (2013). https://doi.org/10.1007/s10115-012-0522-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-012-0522-9

Keywords

Navigation