ABSTRACT
Graph clustering has generally concerned itself with clustering undirected graphs; however the graphs from a number of important domains are essentially directed, e.g. networks of web pages, research papers and Twitter users. This paper investigates various ways of symmetrizing a directed graph into an undirected graph so that previous work on clustering undirected graphs may subsequently be leveraged. Recent work on clustering directed graphs has looked at generalizing objective functions such as conductance to directed graphs and minimizing such objective functions using spectral methods. We show that more meaningful clusters (as measured by an external ground truth criterion) can be obtained by symmetrizing the graph using measures that capture in- and out-link similarity, such as bibliographic coupling and co-citation strength. However, direct application of these similarity measures to modern large-scale power-law networks is problematic because of the presence of hub nodes, which become connected to the vast majority of the network in the transformed undirected graph. We carefully analyze this problem and propose a Degree-discounted similarity measure which is much more suitable for large-scale networks. We show extensive empirical validation.
- R. Andersen, F. R. K. Chung, and K. J. Lang. Local partitioning for directed graphs using pagerank. In WAW, pages 166--178, 2007. Google ScholarDigital Library
- R. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proceedings of the 16th international conference on World Wide Web, pages 131--140. ACM, 2007. Google ScholarDigital Library
- D. Chakrabarti and C. Faloutsos. Graph mining: Laws, generators, and algorithms. ACM Comput. Surv., 38(1):2, 2006. Google ScholarDigital Library
- F. Chung. Laplacians and the Cheeger inequality for directed graphs. Annals of Combinatorics, 9(1):1--19, 2005.Google ScholarCross Ref
- I. S. Dhillon, Y. Guan, and B. Kulis. Weighted Graph Cuts without Eigenvectors: A Multilevel Approach. IEEE Trans. Pattern Anal. Mach. Intell., 29(11):1944--1957, 2007. Google ScholarDigital Library
- C. Ding, X. He, P. Husbands, H. Zha, and H. Simon. Pagerank, hits and a unified framework for link analysis. In SIAM Conference on Data Mining, 2003.Google ScholarCross Ref
- M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication. ACM, 1999. Google ScholarDigital Library
- S. Fortunato. Community detection in graphs. Physics Reports, 486:75--174, 2010.Google ScholarCross Ref
- D. Gleich. Hierarchical Directed Spectral Graph Partitioning. 2006.Google Scholar
- J. Huang, T. Zhu, and D. Schuurmans. Web communities identification from random walks. Lecture Notes in Computer Science, 4213:187, 2006. Google ScholarDigital Library
- R. Kannan, S. Vempala, and A. Veta. On clusterings-good, bad and spectral. In FOCS '00, page 367. IEEE Computer Society, 2000. Google ScholarDigital Library
- G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20, 1999. Google ScholarDigital Library
- M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14:10--25, 1963.Google ScholarCross Ref
- J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani. Kronecker graphs: An approach to modeling networks. The Journal of Machine Learning Research, 11:985--1042, 2010. Google ScholarDigital Library
- J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters. CoRR, abs/0810.1355, 2008.Google Scholar
- C. Manning, P. Raghavan, and H. Schutze. An introduction to information retrieval. 2008. Google ScholarDigital Library
- M. Meila and W. Pentney. Clustering by Weighted Cuts in Directed Graphs. In SDM, 2007.Google ScholarCross Ref
- M. Meila and J. Shi. A random walks view of spectral segmentation. In Artificial Intelligence and Statistics AISTATS, 2001.Google Scholar
- A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and Analysis of Online Social Networks. In Proceedings of the 5th ACM/Usenix Internet Measurement Conference (IMC'07), San Diego, CA, October 2007. Google ScholarDigital Library
- V. Satuluri and S. Parthasarathy. Scalable graph clustering using stochastic flows: applications to community discovery. In KDD '09, pages 737--746, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- J. Shi and J. Malik. Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000. Google ScholarDigital Library
- H. Small. Co-citation in the scientific literature: A new measure of the relationship between documents. Journal of the American Society for Information Science, 24:265--269, 1973.Google ScholarCross Ref
- E. Spertus, M. Sahami, and O. Buyukkokten. Evaluating similarity measures: a large-scale study in the orkut social network. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, KDD '05, pages 678--684, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- D. Zhou, J. Huang, and B. Scholkopf. Learning from labeled and unlabeled data on a directed graph. In ICML '05, pages 1036--1043, 2005. Google ScholarDigital Library
- D. Zhou, B. Scholkopf, and T. Hofmann. Semi-supervised learning on directed graphs. Advances in neural information processing systems, 17:1633--1640, 2005.Google Scholar
Index Terms
- Symmetrizations for clustering directed graphs
Recommendations
Spanning trees in dense directed graphs
AbstractIn 2001, Komlós, Sárközy and Szemerédi proved that, for each α > 0, there is some c > 0 and n 0 such that, if n ≥ n 0, then every n-vertex graph with minimum degree at least ( 1 / 2 + α ) n contains a copy of every n-vertex tree with ...
Testing subgraphs in directed graphs
STOC '03: Proceedings of the thirty-fifth annual ACM symposium on Theory of computingLet H be a fixed directed graph on h vertices, let G be a directed graph on n vertices and suppose that at least ε n2 edges have to be deleted from it to make it H-free. We show that in this case G contains at least f(ε,H) nh copies of H. This is proved ...
Parameterized complexity of the induced subgraph problem in directed graphs
In this Letter, we consider the parameterized complexity of the following problem: Given a hereditary property P on digraphs, an input digraph D and a positive integer k, does D have an induced subdigraph on k vertices with property P? We completely ...
Comments