Abstract
In this paper, we describe PeGaSus, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node, finding the connected components, and computing the importance score of nodes. As the size of graphs reaches several Giga-, Tera- or Peta-bytes, the necessity for such a library grows too. To the best of our knowledge, PeGaSus is the first such library, implemented on the top of the Hadoop platform, the open source version of MapReduce. Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components, etc.) are essentially a repeated matrix-vector multiplication. In this paper, we describe a very important primitive for PeGaSus, called GIM-V (generalized iterated matrix-vector multiplication). GIM-V is highly optimized, achieving (a) good scale-up on the number of available machines, (b) linear running time on the number of edges, and (c) more than 5 times faster performance over the non-optimized version of GIM-V. Our experiments ran on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web graphs, thanks to Yahoo!, with ≈ 6.7 billion edges.
Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Aggarwal G, Data M, Rajagopalan S, Ruhl M (2004) On the streaming model augmented with a sorting primitive. In: Proceedings of FOCS
Awerbuch B, Shiloach A (1983) New Connectivity and MSF Algorithms for Ultracomputer and PRAM. In: ICPP
Brin S, Page L (1998) The anatomy of a large-scale hypertextual (Web) search engine. In: WWW
Broder A, Kumar R, Maghoul F, Prabhakar R, Rajagopalan S, Stata R, Tomkins A, Wiener J (2000) Graph structure in the Web. In: Computer Networks 33
Chaiken R, Jenkins B, Larson P, Ramsey B, Shakib D, Weaver S, Zhou J (2008) SCOPE: easy and efficient parallel processing of massive data sets. In VLDB
Chen C, Yan X, Zhu F, Han J (2007) gApprox: mining frequent approximate patterns from a massive network. In: IEEE international conference on data mining
Chen J, Zaiane O, Goebel R (2009) Detecting communities in social networks using max- min modularity. In: SIAM international conference on data mining
Cheng J, Yu J, Ding B, Yu P, Wang H (2008) Fast graph pattern matching. In: ICDE
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: OSDI
Dunbar R (1998) Grooming, gossip, and the evolution of language. Harvard University Press
Falkowski T, Barth A, Spiliopoulou M (2007) DENGRAPH: a density-based community detection algorithm. In: Web intelligence
Greiner J (1994) A comparison of parallel algorithms for connected components. In: Proceedings of the 6th ACM symposium on parallel algorithms and architectures
Grossman R, Gu Y (2008) Data mining using high performance data clouds: experimental studies using sector and sphere. In: ACM SIGKDD international conference on knowledge discovery and data mining
Hintsanen P, Toivonen H (2008) Finding reliable subgraphs from large probabilistic graphs. In: PKDD
Hirschberg D, Chandra A, Sarwate D (1979) Computing connected components on parallel computers. In: Communications of the ACM
Kang U, Tsourakakis C, Faloutsos C (2009) PEGASUS: a peta-scale graph mining system—implementation and observations. In: IEEE international conference on data mining
Kang U, Tsourakakis C, Appel A, Faloutsos C, Leskovec J (2010) Radius plots for mining tera-byte scale graphs: algorithms, patterns, and observations. In: SIAM international conference on data mining
Karypis G, Kumar V (1999) Parallel multilevel k-way partitioning for irregular graphs. In: SIAM Review
Ke Y, Cheng J, Yu J (2009) Top-k correlative graph mining. In: SIAM international conference on data mining
Ketkar N, Holder L, Cook D (2005) Subdue: compression-based frequent pattern discovery in graph data. In: OSDM
Kleinberg J (1998) Authoritative sources in a hyperlinked environment. In: Proceedings of the 9th ACM-SIAM SODA
Kolda T, Sun J (2008) Scalable tensor decompositions for multi-aspect data mining. In: IEEE international conference on data mining
Kuramochi M, Karypis G (2004) Finding frequent patterns in a large sparse graph. In: SIAM data mining conference
Lahiri M, Berger-Wolf T (2010) Periodic subgraph mining in dynamic networks. In: Knowledge and information systems (KAIS). doi:10.1007/s10115-009-0253-8
Leskovec J, Chakrabarti D, Kleinberg J, Faloutsos C (2005) Realistic, mathematically tractable graph generation and evolution, using Kronecker multiplication. In: practice of knowledge discovery in databases (PKDD)
Long B, Zhang Z, Yu P (2010) A general framework for relation graph clustering. In: Knowledge and information systems (KAIS). doi:10.1007/s10115-009-0255-6
McGlohon M, Akoglu L, Faloutsos C(2008) Weighted graphs and disconnected components: patterns and a generator. In: ACM SIGKDD international conference on knowledge discovery and data mining
Narasimhamurthy A, Greene D, Hurley N, Cunningham P (2010) Partitioning large networks without breaking communities. In: Knowledge and information systems (KAIS). doi:10.1007/s10115-009-0251-x
Newman M (2005) Power laws, Pareto distributions and Zipf’s law. In: Contemporary Physics
Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig latin: a not-so-foreign language for data processing. In: SIGMOD
Pan J, Yang H, Faloutsos C, Duygulu P (2004) Automatic multimedia cross-modal correlation discovery. In: ACM SIGKDD international conference on knowledge discovery and data mining
Pandurangan G, Raghavan P, Upfal E (2002) Using pagerank to characterize web structure. In: COCOON
Papadimitriou S, Sun J (2008) DisCo: distributed co-clustering with map-reduce. In: IEEE international conference on data mining
Peng W, Li T (2010) Temporal relation co-clustering on directional social network and author-topic evolution. In: Knowledge and information systems (KAIS). doi:10.1007/s10115-010-0289-92
Pike R, Dorward S, Griesemer R, Quinlan S (2005) Interpreting the data: parallel analysis with Sawzall. In: Scientific Programming Journal
Qian T, Srivastava J, Peng Z, Sheu P (2009) Simultaneously finding fundamental articles and new topics using a community tracking method. In: PAKDD
Ralf L (2008) Google’s MapReduce programming model—Revisited. In: Science of computer programming
Ranu S, Singh A (2009) GraphSig: a scalable approach to mining significant subgraphs in large graph databases. In: ICDE
Shiloach Y, Vishkin U (1982) An O(logn) parallel connectivity algorithm. J Algorithm
Shrivastava N, Majumder A, Rastogi R (2008) Mining (social) network graphs to detect random link attacks. In: ICDE
Tsourakakis C, Kang U, Miller GL, Faloutsos C (2009) DOULION: counting triangles in massive graphs with a coin. In: Knowledge discovery and data mining (KDD)
Tsourakakis C, Kolountzakis M, Miller GL (2009) Approximate triangle counting. In: Arxiv 0904.3761
Tsourakakis C (2010) Counting triangles in real-world networks using projections. In: Knowledge and information systems (KAIS). doi:10.1007/s10115-010-0291-2
Wang C, Wang W, Pei J, Zhu Y, Shi B (2004) Scalable mining of large disk-based graph databases. In: ACM SIGKDD international conference on knowledge discovery and data mining
Wang N, Parthasarathy S, Tan K, Tung A (2008) CSV: visualizing and mining cohesive subgraph. In: SIGMOD
Yan X, Han J (2002) gSpan: graph-based substructure pattern mining. In: IEEE international conference on data mining
Zhu F, Yan X, Han J, Yu P (2007) gPrune: a constraint pushing framework for graph pattern mining. In: PAKDD
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kang, U., Tsourakakis, C.E. & Faloutsos, C. PEGASUS: mining peta-scale graphs. Knowl Inf Syst 27, 303–325 (2011). https://doi.org/10.1007/s10115-010-0305-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-010-0305-0