Abstract
We initiate the study of a new clustering framework, called cluster ranking. Rather than simply partitioning a network into clusters, a cluster ranking algorithm also orders the clusters by their strength. To this end, we introduce a novel strength measure for clusters—the integrated cohesion—which is applicable to arbitrary weighted networks. We then present a new cluster ranking algorithm, called C-Rank. We provide extensive theoretical and empirical analysis of C-Rank and show that it is likely to have high precision and recall. A main component of C-Rank is a heuristic algorithm for finding sparse vertex separators. At the core of this algorithm is a new connection between vertex betweenness and multicommodity flow. Our experiments focus on mining mailbox networks. A mailbox network is an egocentric social network, consisting of contacts with whom an individual exchanges email. Edges between contacts represent the frequency of their co–occurrence on message headers. C-Rank is well suited to mine such networks, since they are abundant with overlapping communities of highly variable strengths. We demonstrate the effectiveness of C-Rank on the Enron data set, consisting of 130 mailbox networks.
Similar content being viewed by others
References
Amir E, Krauthgamer R, Rao S (2003) Constant factor approximation of vertex-cuts in planar graphs. In: Proceedings of the 35th ACM symposium on theory of computing (STOC), San Diego, pp 90–99
Banerjee A, Krumpelman C, Ghosh J, Basu S, Mooney RJ (2005) Model-based overlapping clustering. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining, Chicago, pp 532–537
Banfield JD and Raftery AE (1993). Model-based gaussian and non-gaussian clustering. Biometrics 49: 803–821
Baumes J, Goldberg MK, Krishnamoorthy MS, Magdon-Ismail M, Preston N (2005) Finding communities by clustering a graph into overlapping subgraphs. In: Proceedings of the IADIS international conference on applied computing, Algarve, pp 97–104
Boykin PO and Roychowdhury V (2005). Personal email networks: an effective anti-spam tool. IEEE Comput 38(4): 61–68
Bui TN and Jones C (1992). Finding good approximate vertex and edge partitions is NP-hard. Inf Proces Lett 42: 153–159
Dunn JC (1973). A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3: 32–57
Farnham S, Portnoy W, Turski A, Cheng L, Vronay D (2003) Personal map: automatically modeling the user’s online social network. In: Proceedings of the international conference on human–computer interaction (INTERACT), Zurich, pp 567–574
Fasulo D (1999) An analysis of recent work on clustering algorithms. Technical Report 01-03-02, Department of Computer Science and Engineering, University of Washington, Seattle
Feige U, Hajiaghayi MT, Lee JR (2005) Improved approximation algorithms for minimum-weight vertex separators. In: Proceedings of the 37th ACM symposium on theory of computing (STOC), Baltimore, pp 563–572
Fisher D (2005). Using egocentric networks to understand communication. IEEE Internet Comput 9(5): 20–28
Fisher D, Dourish P (2004) Social and temporal structures in everyday collaboration. In: Proceedings of the 2004 conference on human factors in computing systems (CHI), Vienna, pp 551–558
Flake GW, Lawrence S, Giles CL (2000) Efficient identification of Web communities. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, Boston pp 150–160
Flake GW, Lawrence S, Giles CL and Coetzee F (2002). Self-organization and identification of web communities. IEEE Comput 35(3): 66–71
Fraley C and Raftery AE (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41(8): 578–588
Fraley C, Raftery AE (2000) Model-based clustering, discriminant analysis, density estimation. Technical Report 380, University of Washington, Department of Statistics
Freeman LC (1977). A set of measures of centrality based on betweenness. Sociometry 40: 35–41
Freeman LC (2004). The development of social network analysis: a study in the sociology of science. Empirical Press, Vancouver
Girvans M and Newman MEJ (2002). Community structure in social and biological networks. In: Proceedings of the National Academy of Sciences of the United States of America (PNAS) 99(12): 7821–7826
Höppner F, Klawonn F, Kruse R and Runkler T (1999). Fuzzy cluster analysis: Methods for classification, data analysis and image Recognition. Wiley, New York
Ino H, Kudo M, Nakamura A (2005) Partitioning of Web graphs by community topology. In: Proceedings of the 14th international conference on World Wide Web (WWW), Chiba, pp 661–669
Jain AK and Dubes RC (1998). Algorithms for clustering data. Prentice-Hall, New Jersey
Jain AK, Topchy AP, Law MHC, Buhmann JM (2004) Landscape of clustering algorithms. In: Proceedings of the 17th international conference on pattern recognition (ICPR), Cambridge, Vol. 1, pp 260–263
Kannan R, Vempala S and Vetta A (2004). On clusterings: good, bad and spectral. J ACM 51(3): 497–515
Kaufman L and Rousseeuw PJ (1990). Finding groups in data: an introduction to cluster analysis. John Wiley, New York
Kleinberg JM (2002) An impossibility theorem for clustering. In: Proceedings of the 15th annual conference on neural information processing systems (NIPS), Vancouver, pp 446–453
Klimt B, Yang Y (2004) The enron corpus: a new dataset for email classification research. In: Proceedings of the 15th European conference on machine learning (ECML), Pisa, pp 217–226
Kobayashi M and Aono M (2006). Exploring overlapping clusters using dynamic re-scaling and sampling. Knowl Inf Systems 10(3): 295–313
Leighton T and Rao S (1999). Multicommodity max-flow min-cut theorems and their use in designing approximation algorithms. J ACM 46(6): 787–832
Macqueen JB (1967) Some methods of classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathemtical statistics and probability, Berkeley, pp 281–297
Mar JC, McLachlan GJ (2003) Model-based clustering in gene expression microarrays: an application to breast cancer data. In: Proceedings of the first asia-pacific bioinformatics conference (APBC), Adelaide, Vol 19, pp 139–144
McCallum A, Corrada-Emmanuel A, Wang X (2005) Topic and role discovery in social networks. In: Proceedings of the 19th international joint conference on artificial intelligence (IJCAI), Edinburgh, pp 786–791
Newman MEJ (2001) Scientific collaboration networks: II. Shortest paths, weighted networks, and centrality. Phys Rev E 64(016132)
Newman MEJ (2004) Analysis of weighted networks. Phys Rev E 70(056131)
Newman MEJ, Girvans M (2004) Finding and evaluating community structure in networks. Phys Rev E 69(026113)
Palla G, Derényi I, Farkas I and Vicsek T (2005). Uncovering the overlapping community structure of complex networks in nature and society. Nature 435: 814–818
Pereira FCN, Tishby N, Lee L (1993) Distributional clustering of english words. In: Proceedings of the 31st annual meeting of the association for computational linguistics (ACL), Ohio, pp 183–190
Scott J (1991). Social network analysis: a handbook. Sage, London
Segal E, Battle A, Koller D (2003) Decomposing gene expression into cellular processes. In: Proceedings of the 8th pacific symposium on biocomputing (PSB), Lihue, pp 89–100
Shi J and Malik J (2000). Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8): 888–905
Sinclair AJ (1992). Improved bounds for mixing rates of Markov chains and multicommodity flow. Combin Probab Comput 1: 351–370
Sinclair AJ and Jerrum MR (1989). Approximate counting, uniform generation and rapidly mixing Markov chains. Inf Comput 82: 93–133
Slonim N (2002) The information bottleneck: theory and applications. PhD thesis, The Hebrew University of Jerusalem
Slonim N, Atwal GS, Tkacik G and Bialek W (2005). Information based clustering. In: Proc Natl Acad Sci USA 102(12): 18297–18302
Tishby N, Pereira F, Bialek W (1999) The information bottleneck method. In: Proceedings of the 37th annual allerton conference on communication, control and computing, University of Illinois, Urbana-Champaign, pp 368–377
Tyler J, Wilkinson D, Huberman BA (2003) Email as spectroscopy: automated discovery of community structure within organizations. In: Proceedings of the 1st international conference on communities and technologies, Amsterdam, pp 81–96
Wellman B (1993). An egocentric network tale. Soc Netw 15: 423–436
Author information
Authors and Affiliations
Corresponding author
About this article
Cite this article
Bar-Yossef, Z., Guy, I., Lempel, R. et al. Cluster ranking with an application to mining mailbox networks. Knowl Inf Syst 14, 101–139 (2008). https://doi.org/10.1007/s10115-007-0096-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-007-0096-0