Cluster ranking with an application to mining mailbox networks

Bar-Yossef, Ziv; Guy, Ido; Lempel, Ronny; Maarek, Yoëlle S.; Soroka, Vladimir

doi:10.1007/s10115-007-0096-0

Cluster ranking with an application to mining mailbox networks

Regular Paper
Published: 25 August 2007

Volume 14, pages 101–139, (2008)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Ziv Bar-Yossef^1,4,
Ido Guy^2,3,
Ronny Lempel³,
Yoëlle S. Maarek⁴ &
…
Vladimir Soroka³

215 Accesses
Explore all metrics

Abstract

We initiate the study of a new clustering framework, called cluster ranking. Rather than simply partitioning a network into clusters, a cluster ranking algorithm also orders the clusters by their strength. To this end, we introduce a novel strength measure for clusters—the integrated cohesion—which is applicable to arbitrary weighted networks. We then present a new cluster ranking algorithm, called C-Rank. We provide extensive theoretical and empirical analysis of C-Rank and show that it is likely to have high precision and recall. A main component of C-Rank is a heuristic algorithm for finding sparse vertex separators. At the core of this algorithm is a new connection between vertex betweenness and multicommodity flow. Our experiments focus on mining mailbox networks. A mailbox network is an egocentric social network, consisting of contacts with whom an individual exchanges email. Edges between contacts represent the frequency of their co–occurrence on message headers. C-Rank is well suited to mine such networks, since they are abundant with overlapping communities of highly variable strengths. We demonstrate the effectiveness of C-Rank on the Enron data set, consisting of 130 mailbox networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Amir E, Krauthgamer R, Rao S (2003) Constant factor approximation of vertex-cuts in planar graphs. In: Proceedings of the 35th ACM symposium on theory of computing (STOC), San Diego, pp 90–99
Banerjee A, Krumpelman C, Ghosh J, Basu S, Mooney RJ (2005) Model-based overlapping clustering. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining, Chicago, pp 532–537
Banfield JD and Raftery AE (1993). Model-based gaussian and non-gaussian clustering. Biometrics 49: 803–821
Article MATH MathSciNet Google Scholar
Baumes J, Goldberg MK, Krishnamoorthy MS, Magdon-Ismail M, Preston N (2005) Finding communities by clustering a graph into overlapping subgraphs. In: Proceedings of the IADIS international conference on applied computing, Algarve, pp 97–104
Boykin PO and Roychowdhury V (2005). Personal email networks: an effective anti-spam tool. IEEE Comput 38(4): 61–68
MathSciNet Google Scholar
Bui TN and Jones C (1992). Finding good approximate vertex and edge partitions is NP-hard. Inf Proces Lett 42: 153–159
Article MATH MathSciNet Google Scholar
Dunn JC (1973). A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3: 32–57
Article MATH MathSciNet Google Scholar
Farnham S, Portnoy W, Turski A, Cheng L, Vronay D (2003) Personal map: automatically modeling the user’s online social network. In: Proceedings of the international conference on human–computer interaction (INTERACT), Zurich, pp 567–574
Fasulo D (1999) An analysis of recent work on clustering algorithms. Technical Report 01-03-02, Department of Computer Science and Engineering, University of Washington, Seattle
Feige U, Hajiaghayi MT, Lee JR (2005) Improved approximation algorithms for minimum-weight vertex separators. In: Proceedings of the 37th ACM symposium on theory of computing (STOC), Baltimore, pp 563–572
Fisher D (2005). Using egocentric networks to understand communication. IEEE Internet Comput 9(5): 20–28
Article Google Scholar
Fisher D, Dourish P (2004) Social and temporal structures in everyday collaboration. In: Proceedings of the 2004 conference on human factors in computing systems (CHI), Vienna, pp 551–558
Flake GW, Lawrence S, Giles CL (2000) Efficient identification of Web communities. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, Boston pp 150–160
Flake GW, Lawrence S, Giles CL and Coetzee F (2002). Self-organization and identification of web communities. IEEE Comput 35(3): 66–71
Google Scholar
Fraley C and Raftery AE (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41(8): 578–588
Article MATH Google Scholar
Fraley C, Raftery AE (2000) Model-based clustering, discriminant analysis, density estimation. Technical Report 380, University of Washington, Department of Statistics
Freeman LC (1977). A set of measures of centrality based on betweenness. Sociometry 40: 35–41
Article Google Scholar
Freeman LC (2004). The development of social network analysis: a study in the sociology of science. Empirical Press, Vancouver
Google Scholar
Girvans M and Newman MEJ (2002). Community structure in social and biological networks. In: Proceedings of the National Academy of Sciences of the United States of America (PNAS) 99(12): 7821–7826
Article Google Scholar
Höppner F, Klawonn F, Kruse R and Runkler T (1999). Fuzzy cluster analysis: Methods for classification, data analysis and image Recognition. Wiley, New York
MATH Google Scholar
Ino H, Kudo M, Nakamura A (2005) Partitioning of Web graphs by community topology. In: Proceedings of the 14th international conference on World Wide Web (WWW), Chiba, pp 661–669
Jain AK and Dubes RC (1998). Algorithms for clustering data. Prentice-Hall, New Jersey
Google Scholar
Jain AK, Topchy AP, Law MHC, Buhmann JM (2004) Landscape of clustering algorithms. In: Proceedings of the 17th international conference on pattern recognition (ICPR), Cambridge, Vol. 1, pp 260–263
Kannan R, Vempala S and Vetta A (2004). On clusterings: good, bad and spectral. J ACM 51(3): 497–515
Article MathSciNet Google Scholar
Kaufman L and Rousseeuw PJ (1990). Finding groups in data: an introduction to cluster analysis. John Wiley, New York
Google Scholar
Kleinberg JM (2002) An impossibility theorem for clustering. In: Proceedings of the 15th annual conference on neural information processing systems (NIPS), Vancouver, pp 446–453
Klimt B, Yang Y (2004) The enron corpus: a new dataset for email classification research. In: Proceedings of the 15th European conference on machine learning (ECML), Pisa, pp 217–226
Kobayashi M and Aono M (2006). Exploring overlapping clusters using dynamic re-scaling and sampling. Knowl Inf Systems 10(3): 295–313
Article Google Scholar
Leighton T and Rao S (1999). Multicommodity max-flow min-cut theorems and their use in designing approximation algorithms. J ACM 46(6): 787–832
Article MATH MathSciNet Google Scholar
Macqueen JB (1967) Some methods of classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathemtical statistics and probability, Berkeley, pp 281–297
Mar JC, McLachlan GJ (2003) Model-based clustering in gene expression microarrays: an application to breast cancer data. In: Proceedings of the first asia-pacific bioinformatics conference (APBC), Adelaide, Vol 19, pp 139–144
McCallum A, Corrada-Emmanuel A, Wang X (2005) Topic and role discovery in social networks. In: Proceedings of the 19th international joint conference on artificial intelligence (IJCAI), Edinburgh, pp 786–791
Newman MEJ (2001) Scientific collaboration networks: II. Shortest paths, weighted networks, and centrality. Phys Rev E 64(016132)
Newman MEJ (2004) Analysis of weighted networks. Phys Rev E 70(056131)
Newman MEJ, Girvans M (2004) Finding and evaluating community structure in networks. Phys Rev E 69(026113)
Palla G, Derényi I, Farkas I and Vicsek T (2005). Uncovering the overlapping community structure of complex networks in nature and society. Nature 435: 814–818
Article Google Scholar
Pereira FCN, Tishby N, Lee L (1993) Distributional clustering of english words. In: Proceedings of the 31st annual meeting of the association for computational linguistics (ACL), Ohio, pp 183–190
Scott J (1991). Social network analysis: a handbook. Sage, London
Google Scholar
Segal E, Battle A, Koller D (2003) Decomposing gene expression into cellular processes. In: Proceedings of the 8th pacific symposium on biocomputing (PSB), Lihue, pp 89–100
Shi J and Malik J (2000). Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8): 888–905
Article Google Scholar
Sinclair AJ (1992). Improved bounds for mixing rates of Markov chains and multicommodity flow. Combin Probab Comput 1: 351–370
Article MATH MathSciNet Google Scholar
Sinclair AJ and Jerrum MR (1989). Approximate counting, uniform generation and rapidly mixing Markov chains. Inf Comput 82: 93–133
Article MATH MathSciNet Google Scholar
Slonim N (2002) The information bottleneck: theory and applications. PhD thesis, The Hebrew University of Jerusalem
Slonim N, Atwal GS, Tkacik G and Bialek W (2005). Information based clustering. In: Proc Natl Acad Sci USA 102(12): 18297–18302
Article MathSciNet MATH Google Scholar
Tishby N, Pereira F, Bialek W (1999) The information bottleneck method. In: Proceedings of the 37th annual allerton conference on communication, control and computing, University of Illinois, Urbana-Champaign, pp 368–377
Tyler J, Wilkinson D, Huberman BA (2003) Email as spectroscopy: automated discovery of community structure within organizations. In: Proceedings of the 1st international conference on communities and technologies, Amsterdam, pp 81–96
Wellman B (1993). An egocentric network tale. Soc Netw 15: 423–436
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering, Technion, Haifa, Israel
Ziv Bar-Yossef
Department of Computer Science, Technion, Haifa, Israel
Ido Guy
IBM Research Lab in Haifa, 31905, Haifa, Israel
Ido Guy, Ronny Lempel & Vladimir Soroka
Google, Haifa Engineering Center, Haifa, Israel
Ziv Bar-Yossef & Yoëlle S. Maarek

Authors

Ziv Bar-Yossef
View author publications
You can also search for this author inPubMed Google Scholar
Ido Guy
View author publications
You can also search for this author inPubMed Google Scholar
Ronny Lempel
View author publications
You can also search for this author inPubMed Google Scholar
Yoëlle S. Maarek
View author publications
You can also search for this author inPubMed Google Scholar
Vladimir Soroka
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Ido Guy.

About this article

Cite this article

Bar-Yossef, Z., Guy, I., Lempel, R. et al. Cluster ranking with an application to mining mailbox networks. Knowl Inf Syst 14, 101–139 (2008). https://doi.org/10.1007/s10115-007-0096-0

Download citation

Received: 04 March 2007
Accepted: 28 April 2007
Published: 25 August 2007
Issue Date: January 2008
DOI: https://doi.org/10.1007/s10115-007-0096-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cluster ranking with an application to mining mailbox networks

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Graph Clustering Via Intra-Cluster Density Maximization

Ranking-Based Community Detection for Social Networks

Comparison of Graph Node Distances on Clustering Tasks

References

Author information

Authors and Affiliations

Corresponding author

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Cluster ranking with an application to mining mailbox networks

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Graph Clustering Via Intra-Cluster Density Maximization

Ranking-Based Community Detection for Social Networks

Comparison of Graph Node Distances on Clustering Tasks

References

Author information

Authors and Affiliations

Corresponding author

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now