Skip to main content

Advertisement

Log in

Cluster ranking with an application to mining mailbox networks

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

We initiate the study of a new clustering framework, called cluster ranking. Rather than simply partitioning a network into clusters, a cluster ranking algorithm also orders the clusters by their strength. To this end, we introduce a novel strength measure for clusters—the integrated cohesion—which is applicable to arbitrary weighted networks. We then present a new cluster ranking algorithm, called C-Rank. We provide extensive theoretical and empirical analysis of C-Rank and show that it is likely to have high precision and recall. A main component of C-Rank is a heuristic algorithm for finding sparse vertex separators. At the core of this algorithm is a new connection between vertex betweenness and multicommodity flow. Our experiments focus on mining mailbox networks. A mailbox network is an egocentric social network, consisting of contacts with whom an individual exchanges email. Edges between contacts represent the frequency of their co–occurrence on message headers. C-Rank is well suited to mine such networks, since they are abundant with overlapping communities of highly variable strengths. We demonstrate the effectiveness of C-Rank on the Enron data set, consisting of 130 mailbox networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Amir E, Krauthgamer R, Rao S (2003) Constant factor approximation of vertex-cuts in planar graphs. In: Proceedings of the 35th ACM symposium on theory of computing (STOC), San Diego, pp 90–99

  2. Banerjee A, Krumpelman C, Ghosh J, Basu S, Mooney RJ (2005) Model-based overlapping clustering. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining, Chicago, pp 532–537

  3. Banfield JD and Raftery AE (1993). Model-based gaussian and non-gaussian clustering. Biometrics 49: 803–821

    Article  MATH  MathSciNet  Google Scholar 

  4. Baumes J, Goldberg MK, Krishnamoorthy MS, Magdon-Ismail M, Preston N (2005) Finding communities by clustering a graph into overlapping subgraphs. In: Proceedings of the IADIS international conference on applied computing, Algarve, pp 97–104

  5. Boykin PO and Roychowdhury V (2005). Personal email networks: an effective anti-spam tool. IEEE Comput 38(4): 61–68

    MathSciNet  Google Scholar 

  6. Bui TN and Jones C (1992). Finding good approximate vertex and edge partitions is NP-hard. Inf Proces Lett 42: 153–159

    Article  MATH  MathSciNet  Google Scholar 

  7. Dunn JC (1973). A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3: 32–57

    Article  MATH  MathSciNet  Google Scholar 

  8. Farnham S, Portnoy W, Turski A, Cheng L, Vronay D (2003) Personal map: automatically modeling the user’s online social network. In: Proceedings of the international conference on human–computer interaction (INTERACT), Zurich, pp 567–574

  9. Fasulo D (1999) An analysis of recent work on clustering algorithms. Technical Report 01-03-02, Department of Computer Science and Engineering, University of Washington, Seattle

  10. Feige U, Hajiaghayi MT, Lee JR (2005) Improved approximation algorithms for minimum-weight vertex separators. In: Proceedings of the 37th ACM symposium on theory of computing (STOC), Baltimore, pp 563–572

  11. Fisher D (2005). Using egocentric networks to understand communication. IEEE Internet Comput 9(5): 20–28

    Article  Google Scholar 

  12. Fisher D, Dourish P (2004) Social and temporal structures in everyday collaboration. In: Proceedings of the 2004 conference on human factors in computing systems (CHI), Vienna, pp 551–558

  13. Flake GW, Lawrence S, Giles CL (2000) Efficient identification of Web communities. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, Boston pp 150–160

  14. Flake GW, Lawrence S, Giles CL and Coetzee F (2002). Self-organization and identification of web communities. IEEE Comput 35(3): 66–71

    Google Scholar 

  15. Fraley C and Raftery AE (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41(8): 578–588

    Article  MATH  Google Scholar 

  16. Fraley C, Raftery AE (2000) Model-based clustering, discriminant analysis, density estimation. Technical Report 380, University of Washington, Department of Statistics

  17. Freeman LC (1977). A set of measures of centrality based on betweenness. Sociometry 40: 35–41

    Article  Google Scholar 

  18. Freeman LC (2004). The development of social network analysis: a study in the sociology of science. Empirical Press, Vancouver

    Google Scholar 

  19. Girvans M and Newman MEJ (2002). Community structure in social and biological networks. In: Proceedings of the National Academy of Sciences of the United States of America (PNAS) 99(12): 7821–7826

    Article  Google Scholar 

  20. Höppner F, Klawonn F, Kruse R and Runkler T (1999). Fuzzy cluster analysis: Methods for classification, data analysis and image Recognition. Wiley, New York

    MATH  Google Scholar 

  21. Ino H, Kudo M, Nakamura A (2005) Partitioning of Web graphs by community topology. In: Proceedings of the 14th international conference on World Wide Web (WWW), Chiba, pp 661–669

  22. Jain AK and Dubes RC (1998). Algorithms for clustering data. Prentice-Hall, New Jersey

    Google Scholar 

  23. Jain AK, Topchy AP, Law MHC, Buhmann JM (2004) Landscape of clustering algorithms. In: Proceedings of the 17th international conference on pattern recognition (ICPR), Cambridge, Vol. 1, pp 260–263

  24. Kannan R, Vempala S and Vetta A (2004). On clusterings: good, bad and spectral. J ACM 51(3): 497–515

    Article  MathSciNet  Google Scholar 

  25. Kaufman L and Rousseeuw PJ (1990). Finding groups in data: an introduction to cluster analysis. John Wiley, New York

    Google Scholar 

  26. Kleinberg JM (2002) An impossibility theorem for clustering. In: Proceedings of the 15th annual conference on neural information processing systems (NIPS), Vancouver, pp 446–453

  27. Klimt B, Yang Y (2004) The enron corpus: a new dataset for email classification research. In: Proceedings of the 15th European conference on machine learning (ECML), Pisa, pp 217–226

  28. Kobayashi M and Aono M (2006). Exploring overlapping clusters using dynamic re-scaling and sampling. Knowl Inf Systems 10(3): 295–313

    Article  Google Scholar 

  29. Leighton T and Rao S (1999). Multicommodity max-flow min-cut theorems and their use in designing approximation algorithms. J ACM 46(6): 787–832

    Article  MATH  MathSciNet  Google Scholar 

  30. Macqueen JB (1967) Some methods of classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathemtical statistics and probability, Berkeley, pp 281–297

  31. Mar JC, McLachlan GJ (2003) Model-based clustering in gene expression microarrays: an application to breast cancer data. In: Proceedings of the first asia-pacific bioinformatics conference (APBC), Adelaide, Vol 19, pp 139–144

  32. McCallum A, Corrada-Emmanuel A, Wang X (2005) Topic and role discovery in social networks. In: Proceedings of the 19th international joint conference on artificial intelligence (IJCAI), Edinburgh, pp 786–791

  33. Newman MEJ (2001) Scientific collaboration networks: II. Shortest paths, weighted networks, and centrality. Phys Rev E 64(016132)

  34. Newman MEJ (2004) Analysis of weighted networks. Phys Rev E 70(056131)

  35. Newman MEJ, Girvans M (2004) Finding and evaluating community structure in networks. Phys Rev E 69(026113)

  36. Palla G, Derényi I, Farkas I and Vicsek T (2005). Uncovering the overlapping community structure of complex networks in nature and society. Nature 435: 814–818

    Article  Google Scholar 

  37. Pereira FCN, Tishby N, Lee L (1993) Distributional clustering of english words. In: Proceedings of the 31st annual meeting of the association for computational linguistics (ACL), Ohio, pp 183–190

  38. Scott J (1991). Social network analysis: a handbook. Sage, London

    Google Scholar 

  39. Segal E, Battle A, Koller D (2003) Decomposing gene expression into cellular processes. In: Proceedings of the 8th pacific symposium on biocomputing (PSB), Lihue, pp 89–100

  40. Shi J and Malik J (2000). Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8): 888–905

    Article  Google Scholar 

  41. Sinclair AJ (1992). Improved bounds for mixing rates of Markov chains and multicommodity flow. Combin Probab Comput 1: 351–370

    Article  MATH  MathSciNet  Google Scholar 

  42. Sinclair AJ and Jerrum MR (1989). Approximate counting, uniform generation and rapidly mixing Markov chains. Inf Comput 82: 93–133

    Article  MATH  MathSciNet  Google Scholar 

  43. Slonim N (2002) The information bottleneck: theory and applications. PhD thesis, The Hebrew University of Jerusalem

  44. Slonim N, Atwal GS, Tkacik G and Bialek W (2005). Information based clustering. In: Proc Natl Acad Sci USA 102(12): 18297–18302

    Article  MathSciNet  MATH  Google Scholar 

  45. Tishby N, Pereira F, Bialek W (1999) The information bottleneck method. In: Proceedings of the 37th annual allerton conference on communication, control and computing, University of Illinois, Urbana-Champaign, pp 368–377

  46. Tyler J, Wilkinson D, Huberman BA (2003) Email as spectroscopy: automated discovery of community structure within organizations. In: Proceedings of the 1st international conference on communities and technologies, Amsterdam, pp 81–96

  47. Wellman B (1993). An egocentric network tale. Soc Netw 15: 423–436

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ido Guy.

About this article

Cite this article

Bar-Yossef, Z., Guy, I., Lempel, R. et al. Cluster ranking with an application to mining mailbox networks. Knowl Inf Syst 14, 101–139 (2008). https://doi.org/10.1007/s10115-007-0096-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-007-0096-0

Keywords