Mining Tera-Scale Graphs with “Pegasus”: Algorithms and Discoveries

Kang, U; Faloutsos, Christos

doi:10.1007/978-1-4614-9242-9_3

U Kang³ &
Christos Faloutsos⁴

3056 Accesses
1 Citations

Abstract

How do we find patterns and anomalies, on graphs with billions of nodes and edges, which do not fit in memory? How to use parallelism for such Tera- or Peta-scale graphs? We propose a carefully selected set of fundamental operations, that help answer those questions, including diameter estimation, connected components, and eigenvalues. We package all these operations in Pegasus, which, to the best of our knowledge, is the first such library, implemented on the top of the Hadoop platform, the open source version of MapReduce. One of the key observations in this work is that many graph mining operations are essentially repeated matrix-vector multiplications. We describe a very important primitive for Pegasus, called GIM-V (Generalized Iterative Matrix-Vector multiplication). GIM-V is highly optimized, achieving (a) good scale-up on the number of available machines, (b) linear running time on the number of edges, and (c) more than nine times faster performance over the non-optimized version of GIM-V.Finally, we run experiments on real graphs. Our experiments run on M45, one of the largest Hadoop clusters available to academia. We report our findings on several real graphs, including one of the largest publicly available Web graphs with 6,7 billion edges. Some of our most impressive findings are (a) the discovery of adult advertisers in the who-follows-whom on Twitter, and (b) the 7-degrees of separation in the Web graph.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aggarwal, G., Data, M., Rajagopalan, S., Ruhl, M.: On the streaming model augmented with a sorting primitive. In: Proceedings of FOCS, Rome (2004)
Google Scholar
Albert, R., Jeong, H., Barabasi, A.L.: Diameter of the World Wide Web. Nature 401, 130–131 (1999)
Article Google Scholar
Bartell, B.T., Cottrell, G.W., Belew, R.K.: Latent semantic indexing is an optimal special case of multidimensional scaling. In: SIGIR, Copenhagen (1992)
Google Scholar
Berry, M.W.: Large scale singular value computations. Int. J. Supercomput. Appl. 6, 13–49 (1992)
Google Scholar
Brin, S., Page, L.: The anatomy of a large-scale hypertextual (Web) search engine. In: WWW, Brisbane (1998)
Google Scholar
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J.: Graph structure in the Web. Comput. Netw. 33, 309–320 (2000)
Article Google Scholar
Chierichetti, F., Kumar, R., Lattanzi, S., Mitzenmacher, M., Panconesi, A., Raghavan, P.: On compressing social networks. In: KDD, Paris (2009)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI, San Francisco (2004)
Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Article Google Scholar
Demmel, J.W.: Applied Numerical Linear Algebra. SIAM, Philadelphia (1997)
Book MATH Google Scholar
Dunbar, R.: Grooming, Gossip, and the Evolution of Language. Harvard University Press, Cambridge (1998)
Google Scholar
Dunlavy, D.M., Kolda, T.G., Acar, E.: Temporal link prediction using matrix and tensor factorizations. TKDD 5(2), Article 10 (2011)
Google Scholar
Elias, P.: Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theory 21(2), 194–203 (1975)
Article MATH MathSciNet Google Scholar
Kamel, M.: Computing the singular value decomposition in image processing. In: Proceedings of Conference on Information Systems, Tucson (1984)
Google Scholar
Kang, U., Faloutsos, C.: Beyond ‘caveman communities’: hubs and spokes for graph compression and mining. In: ICDM, Vancouver (2011)
Google Scholar
Kang, U., Tsourakakis, C.E., Faloutsos, C.: PEGASUS: a peta-scale graph mining system – implementation and observations. In: IEEE International Conference on Data Mining, Miami (2009)
Google Scholar
Kang, U., Tsourakakis, C.E., Appel, A., Faloutsos, C., Leskovec, J.: Radius plots for mining tera-byte scale graphs: algorithms, patterns, and observations. In: SDM, Columbus (2010)
Google Scholar
Kang, U., Meeder, B., Faloutsos, C.: Spectral analysis for billion-scale graphs: discoveries and implementation. In: PAKDD, Shenzhen (2011)
Google Scholar
Kang, U., Tsourakakis, C.E., Appel, A., Faloutsos, C., Lekovec, J.: HADI: mining radii of large graphs. ACM Trans. Knowl. Disc. Data 5, 8:1–8:24 (2011)
Google Scholar
Kang, U., Tsourakakis, C.E., Faloutsos, C.: PEGASUS: mining peta-scale graphs. Knowl. Inf. Syst. 27(2), 303–325 (2011)
Article Google Scholar
Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)
Article MATH MathSciNet Google Scholar
Kolda, T.G., Sun, J.: Scalable tensor decompsitions for multi-aspect data mining. In: ICDM, Pisa (2008)
Google Scholar
Kruskal, J.B., Wish, M.: Multidimensional Scaling. SAGE, Newbury Park (1978)
Google Scholar
Lämmel, R.: Google’s MapReduce programming model – revisited. Sci. Comput. Program. 70, 1–30 (2008)
Article MATH Google Scholar
Lanczos, C.: An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J. Res. Nat. Bur. Stand. 45, 255 (1950)
Article MathSciNet Google Scholar
Leskovec, J., Chakrabarti, D., Kleinberg, J.M., Faloutsos, C.: Realistic, mathematically tractable graph generation and evolution, using kronecker multiplication. In: PKDD, Porto (2005)
Google Scholar
Leskovec, J., Kleinberg, J., Faloutsos, C.: Graphs over time: densification laws, shrinking diameters and possible explanations. In: SIGKDD, Chicago (2005)
Google Scholar
Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Statistical properties of community structure in large social and information networks. In: WWW, Beijing (2008)
Google Scholar
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: GraphLab: a new framework for parallel machine learning. In: UAI, Catalina Island (2010)
Google Scholar
Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: SIGMOD Conference, Indianapolis (2010)
Google Scholar
Mcglohon, M., Akoglu, L., Faloutsos, C.: Weighted graphs and disconnected components: patterns and a generator. In: KDD, Las Vegas (2008)
Google Scholar
Newman, M.E.J.: Power laws, Pareto distributions and Zipf’s law. Contemp. Phys. 46, 323–351 (2005)
Google Scholar
Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: IPS, Vancouver (2002)
Google Scholar
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: SIGMOD, Vancouver (2008)
Google Scholar
Pan, J., Yang, H., Faloutsos, C., Duygulu, P.: Automatic multimedia cross-modal correlation discovery. In: KDD, Seattle (2004)
Google Scholar
Papadimitriou, S., Sun, J.: DisCo: distributed co-clustering with Map-Reduce. In: IEEE International Conference on Data Mining, Pisa (2008)
Google Scholar
Pearson, K.: On lines and planes of closest fit to systems of points in space. Philos. Mag. 2(6), 559–572 (1901)
Article Google Scholar
Prakash, B.A., Seshadri, M., Sridharan, A., Machiraju, S., Faloutsos, C.: EigenSpokes: surprising patterns and community structure in large graphs. In: PAKDD, Hyderabad (2010)
Google Scholar
Shi, J., Malik, J.: Normalized cuts and image segmentation. In: CVPR, San Juan (1997)
Google Scholar
Sun, J., Tao, D., Faloutsos, C.: Beyond streams and graphs: dynamic tensor analysis. In: KDD, Philadelphia (2006)
Google Scholar
Trefethen, L.N., Bau III, D.: Numerical Linear Algebra. SIAM, Philadelphia (1997)
Book MATH Google Scholar
Tsourakakis, C.E.: Fast counting of triangles in large real networks without counting: algorithms and laws. In: ICDM, Pisa (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, KAIST University, 291 Daehak-ro, Yuseong-gu, Daejeon, 305-701, Republic of Korea
U Kang
School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA, 15213, USA
Christos Faloutsos

Authors

U Kang
View author publications
You can also search for this author in PubMed Google Scholar
Christos Faloutsos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to U Kang .

Editor information

Editors and Affiliations

IBM Research - Ireland, Mulhuddart, Ireland
Aris Gkoulalas-Divanis
IBM Research - Zurich, Rüschlikon, Switzerland
Abderrahim Labbi

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kang, U., Faloutsos, C. (2014). Mining Tera-Scale Graphs with “Pegasus”: Algorithms and Discoveries. In: Gkoulalas-Divanis, A., Labbi, A. (eds) Large-Scale Data Analytics. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-9242-9_3

Download citation

DOI: https://doi.org/10.1007/978-1-4614-9242-9_3
Published: 28 November 2013
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-9241-2
Online ISBN: 978-1-4614-9242-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics