skip to main content
10.1145/3366423.3380035acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Scaling PageRank to 100 Billion Pages

Published: 20 April 2020 Publication History

Abstract

Distributed graph processing frameworks formulate tasks as sequences of supersteps within which communication is performed asynchronously by sending messages over the graph edges. PageRank’s communication pattern is identical across all its supersteps since each vertex sends messages to all its edges. We exploit this pattern to develop a new communication paradigm that allows us to exchange messages that include only edge payloads, dramatically reducing bandwidth requirements. Experiments on a web graph of 38 billion vertices and 3.1 trillion edges yield execution times of 34.4 seconds per iteration, suggesting more than an order of magnitude improvement over the state-of-the-art.

References

[1]
2016. Common Crawl. http://commoncrawl.org/.
[2]
2016. Twitter User Graph. http://konect.uni-koblenz.de/networks/twitter_mpi.
[3]
2016. Web Data Commons. http://webdatacommons.org/.
[4]
2017. Apache Giraph. http://giraph.apache.org/.
[5]
2020. Facebook Company Info. http://newsroom.fb.com/company-info/.
[6]
2020. How Search organizes information. https://www.google.com/search /howsearchworks/crawling-indexing/.
[7]
Paolo Boldi, Massimo Santini, and Sebastiano Vigna. 2008. A Large Time-Aware Graph. SIGIR Forum 42, 2 (2008), 33–38.
[8]
RobertS. Boyer and J.Strother Moore. 1991. MJRTY-A Fast Majority Vote Algorithm. In Automated Reasoning, RobertS. Boyer (Ed.). Automated Reasoning Series, Vol. 1. Springer Netherlands, 105–117.
[9]
Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D Ernst. 2010. HaLoop: Efficient iterative data processing on large clusters. Proceedings of the VLDB Endowment 3, 1-2 (2010), 285–296.
[10]
Aydın Buluç and John R Gilbert. 2011. The Combinatorial BLAS: Design, implementation, and applications. The International Journal of High Performance Computing Applications 25, 4(2011), 496–509.
[11]
Rong Chen, Jiaxin Shi, Yanzhe Chen, and Haibo Chen. 2015. Powerlyra: Differentiated graph computation and partitioning on skewed graphs. In Proceedings of the Tenth European Conference on Computer Systems. ACM, 1.
[12]
Avery Ching, Sergey Edunov, Maja Kabiljo, Dionysios Logothetis, and Sambavi Muthukrishnan. 2015. One Trillion Edges: Graph Processing at Facebook-Scale. Proceedings of the VLDB Endowment 8, 12 (2015).
[13]
Graham Cormode and Marios Hadjieleftheriou. 2008. Finding Frequent Items in Data Streams. Proceedings of the VLDB Endowment 1, 2 (2008), 1530–1541.
[14]
JE Gonzalez, Y Low, H Gu, D Bickson, and C Guestrin. 2012. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs.OSDI (2012).
[15]
JE Gonzalez, RS Xin, and A Dave. [n.d.]. GraphX: Graph Processing in a Distributed Dataflow Framework. In OSDI’14 Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation.
[16]
Minyang Han, Khuzaima Daudjee, Khaled Ammar, M. Tamer Özsu, Xingfang Wang, and Tianqi Jin. 2014. An experimental comparison of pregel-like graph processing systems. Proceedings of the VLDB Endowment 7, 12 (Aug. 2014), 1047–1058. https://doi.org/10.14778/2732977.2732980
[17]
Wassily Hoeffding. 1963. Probability Inequalities for Sums of Bounded Random Variables. Journal of the American Statistical Association 58, 301(1963), 13–30.
[18]
Shan Jiang, Yuening Hu, Changsung Kang, Tim Daly Jr, Dawei Yin, Yi Chang, and Chengxiang Zhai. 2016. Learning Query and Document Relevance from a Web-scale Click Graph. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 185–194.
[19]
U Kang, Charalampos E Tsourakakis, and Christos Faloutsos. 2009. Pegasus: A peta-scale graph mining system implementation and observations. In Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on. IEEE, 229–238.
[20]
Zuhair Khayyat, Karim Awara, Amani Alonazi, Hani Jamjoom, Dan Williams, and Panos Kalnis. 2013. Mizan: a system for dynamic load balancing in large-scale graph processing. In Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 169–182.
[21]
Aapo Kyrola, Guy E Blelloch, Carlos Guestrin, 2012. GraphChi: Large-Scale Graph Computation on Just a PC. In OSDI, Vol. 12. 31–46.
[22]
AN Langville and CD Meyer. 2011. Google’s PageRank and beyond: The science of search engine rankings.
[23]
Yibei Ling and Wei Sun. 1992. A Supplement to Sampling-based Methods for Query Size Estimation in a Database System. SIGMOD Record 21, 4 (Dec. 1992), 12–15.
[24]
G Malewicz, MH Austern, and AJC Bik. 2010. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data.
[25]
Derek G Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: a timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 439–455.
[26]
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Stanford InfoLab (1999).
[27]
Amitabha Roy, Laurent Bindschaedler, Jasmina Malicevic, and Willy Zwaenepoel. 2015. Chaos: Scale-out graph processing from secondary storage. In Proceedings of the 25th Symposium on Operating Systems Principles. ACM, 410–424.
[28]
Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. 2013. X-Stream: edge-centric graph processing using streaming partitions. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 472–488.
[29]
Semih Salihoglu and Jennifer Widom. 2013. GPS: a graph processing system. In Proceedings of the 25th International Conference on Scientific and Statistical Database Management - SSDBM. ACM Press, 1. https://doi.org/10.1145/2484838.2484843
[30]
Stergios Stergiou, Dipen Rughwani, and Kostas Tsioutsiouliklis. 2018. Shortcutting label propagation for distributed connected components. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 540–546.
[31]
Stergios Stergiou, Zygimantas Straznickas, Rolina Wu, and Kostas Tsioutsiouliklis. [n.d.]. Distributed Negative Sampling For Word Embeddings. In AAAI’17 Thirty-First AAAI Conference on Artificial Intelligence.
[32]
Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, and John McPherson. 2013. From think like a vertex to think like a graph. Proceedings of the VLDB Endowment 7, 3 (2013), 193–204.
[33]
Da Yan, James Cheng, Yi Lu, and Wilfred Ng. 2014. Blogel: A block-centric framework for distributed computation on real-world graphs. Proceedings of the VLDB Endowment 7, 14 (2014), 1981–1992.
[34]
Da Yan, James Cheng, Yi Lu, and Wilfred Ng. 2015. Effective techniques for message reduction and load balancing in distributed graph computation. In Proceedings of the 24th International Conference on World Wide Web. 1307–1317.
[35]
H Zhao and J Canny. 2014. Kylix: A Sparse Allreduce for Commodity Clusters. In 43rd International Conference on Parallel Processing (ICPP).

Cited By

View all
  • (2020)Computing PageRank Scores of Web Crawl Data Using DGX A100 Clusters2020 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC43674.2020.9286216(1-4)Online publication date: 22-Sep-2020

Index Terms

  1. Scaling PageRank to 100 Billion Pages
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          WWW '20: Proceedings of The Web Conference 2020
          April 2020
          3143 pages
          ISBN:9781450370233
          DOI:10.1145/3366423
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 20 April 2020

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. graph processing
          2. implicit targets
          3. pagerank

          Qualifiers

          • Research-article
          • Research
          • Refereed limited

          Conference

          WWW '20
          Sponsor:
          WWW '20: The Web Conference 2020
          April 20 - 24, 2020
          Taipei, Taiwan

          Acceptance Rates

          Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)25
          • Downloads (Last 6 weeks)1
          Reflects downloads up to 05 Mar 2025

          Other Metrics

          Citations

          Cited By

          View all
          • (2020)Computing PageRank Scores of Web Crawl Data Using DGX A100 Clusters2020 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC43674.2020.9286216(1-4)Online publication date: 22-Sep-2020

          View Options

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format.

          HTML Format

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media