Skip to main content
Log in

Scalability of correlation clustering

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

The problem of scalability of correlation clustering (CC) is addressed by reducing the number of variables involved in the SDP formulation. A nonlinear programming formulation is obtained instead of SDP formulation, which reduces the number of variables. The new formulation is solved through limited memory Broyden Fletcher Goldfarb Shanno method. We demonstrate the potential of the nonlinear formulation on large graph datasets having more than ten thousand vertices and nine million edges. The proposed scalable formulation is experimentally shown not to compromise on quality of the obtained clusters. We compare the scalable formulation results with those of the original CC formulation. We compare the scalable formulation results with the original CC formulation, a constrained spectral clustering method which uses edge labels of the graph and differs only in the way clusters are obtained by defining the cut on the given graph and with a variant of constraint spectral clustering known as self-taught spectral clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. In maximization problem a \(\lambda\)-approximation algorithm A for the problem with cost function c is an algorithm such that, for every input I, the solution A(I) to satisfy \(c(A(I))\ge \lambda \mathrm{max}_S c(S)\) for some \(\lambda \le 1\).

  2. Approximation value(\(\lambda\)) in the case of a maximization problem is the ratio between the cost of an optimal solution to the cost of the solution produced by the approximation algorithm, in maximization problem \(\lambda \le 1\) and in minimization problem \(\lambda \ge 1.\)

  3. By large dataset we mean graph with large number of nodes as well as edges.

References

  1. Arasu A, Ré C, Suciu D (2009) Large-scale deduplication with constraints using dedupalog. In: International conference on data engineering, pp. 952–963

  2. Bansal N, Blum A, Chawla S (2002) Correlation clustering. Found Comput Sci (FOCS) 2002:238–247

    MATH  Google Scholar 

  3. Ben-David S, Long PM, Mansour Y (2001) Agnostic boosting. In: Proceedings of the 14th annual conference on computational learning theory and 5th European conference on computational learning theory, COLT ’01/EuroCOLT ’01. Springer, pp. 507–516

  4. Burer S, Monteiro R (2003) A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Math Program 95(2):329–357

    Article  MathSciNet  MATH  Google Scholar 

  5. Cesa-Bianchi N, Gentile C, Vitale F, Zappella G (2012) A correlation clustering approach to link classification in signed networks. J Mach Learn. Res. 23:1–34

    Google Scholar 

  6. Charikar M, Guruswami V, Wirth A (2005) Clustering with qualitative information. J Comput Syst Sci 71(3):360–383

    Article  MathSciNet  MATH  Google Scholar 

  7. Chen Y, Sanghavi S, Xu H (2012) Clustering sparse graphs. Adv Neural Inf Process Syst 25:2204–2212

    Google Scholar 

  8. Chierichetti F, Dalvi N, Kumar R (2014) Correlation clustering in mapreduce. In: KDD ’14 proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining

  9. Cohn D, Caruana R, Mccallum A (2007) Constrained clustering: advances in algorithms, chap. Semi-supervised clustering with user feedback. Data mining and knowledge discovery series, pp. 17–31

  10. Giotis I, Guruswami V (2006) Correlation clustering with a fixed number of clusters. Theory Comput Open Access J 2:249–266

    MathSciNet  MATH  Google Scholar 

  11. Goemans MX, Williamson DP (1995) Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J Assoc Comput Mach 42(6):1115–1145

    Article  MathSciNet  MATH  Google Scholar 

  12. Immorlica N, Wirth A (2007) Constrained clustering: advances in algorithms, theory and applications, chap. correlation clustering. In: Wagstaff KL, Davidson I, Basu S (eds) Data mining and knowledge discovery series. Chapman & Hall, pp. 313–327

  13. Kamvar SD, Klein D, Manning CD (2003) Spectral learning. In: Proceedings of the 18th international joint conference on artificial intelligence, IJCAI’03. Morgan Kaufmann Publishers Inc, pp. 561–566

  14. Kearns MJ, Schapire RE, Sellie LM (1992) Toward efficient agnostic learning. In: Proceedings of the fifth annual workshop on Computational learning theory, COLT ’92. ACM, pp. 341–352

  15. Liu DC, Nocedal J (1989) On the limited memory BFGS method for large scale optimization. Math Program 45(3):503–528

    Article  MathSciNet  MATH  Google Scholar 

  16. Lu Z, Leen TK (2007) Constrained clustering: advances in algorithms, chap. pairwise constraints as priors in probabilistic clustering. In: Wagstaff KL, Davidson I, Basu S (eds) Data mining and knowledge discovery series. Chapman & Hall, pp. 59–90

  17. Pan X, Papailiopoulos D, Oymak S, Recht B, Ramchandran K, Jordan MI (2015) Parallel correlation clustering on big graphs. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in Neural Information Processing Systems 28. Curran Associates, Inc, pp. 82–90

  18. Pensa R, Robardet C, Boulicaut JF (2007) Constrained clustering: advances in algorithms, chap. constraint-driven co-clustering of 0/1 Data. In: Wagstaff KL, Davidson I, Basu S (eds) Data mining and knowledge discovery series. Chapman & Hall, pp. 123–148

  19. Rand WM (1971) Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336):846–850

    Article  Google Scholar 

  20. Rebagliati N, Verri A (2011) Spectral clustering with more than k eigenvectors. Neurocomputing http://staffweb.cms.gre.ac.uk/~wc06/partition/

  21. Shanno DF, Phua KH (1980) Minimization of unconstrained multivariate functions. ACM Trans Math Softw 6:618–622

    Article  MATH  Google Scholar 

  22. Shental N, Bar-Hillel A, Hertz T, Weinshal D (2004) Computing Gaussian mixture models with EM using equivalence constraints. In: Thrun S, Saul L, Schölkopf B (eds) Advances in neural information processing systems 16. MIT press, Cambridge

    Google Scholar 

  23. Swamy C (2004) Correlation clustering: maximizing agreements via semidefinite programming. In: Proceedings of the fifteenth annual ACM-SIAM symposium on discrete algorithms, pp. 526–527

  24. Tang W, Zhong S (2008) Computational methods of feature selection, chap. pairwise constraints-guided dimensionality reduction. In: Liu H, Motoda H (eds) Data mining and knowledge discovery series. Taylor & Francis Group CRC, pp. 295–312

  25. Vandenberghe L, Boyd S (1996) Semidefinite programming. SIAM Rev 38(1):49–95

    Article  MathSciNet  MATH  Google Scholar 

  26. Wacquet G, Caillault EP, Hamad D, HéBert PA (2013) Constrained spectral embedding for K-way data clustering. Pattern Recognit Lett 34:1009–1017

    Article  Google Scholar 

  27. Wagstaff K, Cardie C, Rogers S, Schrödl S (2001) Constrained k-means clustering with background knowledge. In: Proceedings of the eighteenth international conference on machine learning, ICML ’01. Morgan Kaufmann Publishers Inc, pp. 577–584

  28. Wang X, Davidson I (2010) Flexible constrained spectral clustering. In: KDD ’10: proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 563–572

  29. Wang X, Qian B, Davidson I (2012) On constrained spectral clustering and its applications. Data min knowl discov 28(1):1–30

    Article  MathSciNet  MATH  Google Scholar 

  30. Wang X, Wang J, Qian B, Wang F, Davidson I (2014) Self-taught spectral clustering via constraint augmentation. In: Proceedings of the 2014 SIAM international conference on data mining, Philadelphia, Pennsylvania, USA, April 24–26, pp. 416–424

  31. Wang YX, Xu H (2016) Noisy sparse subspace clustering. J Mach Learn Res 17(1):320–360

    MathSciNet  MATH  Google Scholar 

  32. Xu C, Tao D, Xu C (2015) Multi-view self-paced learning for clustering. In: Proceedings of the 24th international conference on artificial intelligence, IJCAI’15. AAAI Press, pp. 3974–3980

  33. Xu Q, Desjardins M (2005) Constrained spectral clustering under a local proximity structure assumption. In: Proceedings of the 18th international conference of the Florida artificial intelligence research society (FLAIRS-05). AAAI Press

  34. Zhang XL (2014) Convex discriminative multitask clustering. IEEE Trans Pattern Anal Mach Intell 37:28–40

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mamata Samal.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Samal, M., Saradhi, V.V. & Nandi, S. Scalability of correlation clustering. Pattern Anal Applic 21, 703–719 (2018). https://doi.org/10.1007/s10044-017-0598-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-017-0598-7

Keywords

Navigation