Skip to main content

A Semi-clustering Scheme for High Performance PageRank on Hadoop

  • Conference paper
Advances in Conceptual Modeling (ER 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8823))

Included in the following conference series:

  • 1562 Accesses

Abstract

As global Internet business has been evolving, large-scale graphs are becoming popular. PageRank computation on the large-scale graphs using Hadoop with default data partitioning method suffers from poor performance because Hadoop scatters even a set of directly connected vertices to arbitrary multiple nodes. In this paper we propose a semi-clustering scheme to address this problem and improve the performance of PageRank on Hadoop. Our scheme divides a graph into a set of semi-clusters, each of which consists of connected vertices, and assigns a semi-cluster to a single data partition in order to reduce the cost of data exchange between nodes during the computation of PageRank. The semi-clusters are merged and split before the PageRank computation, in order to evenly distribute a large-scale graph into a number of data partitions. Our semi-clustering scheme drastically improves the performance: total elapsed time including the cost of the semi-clustering computation reduced by up to 36%. Furthermore, the effectiveness of our scheme increases as the size of the graph increases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web, Technical Report, Stanford InfoLab (1999)

    Google Scholar 

  2. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Word Wide Web (1998)

    Google Scholar 

  3. Avrachenkov, K., Dobrynin, K.V., Nemirovsky, D., Pham, S., Smirnova, E.: PageRank based clustering of hypertext document collections. SIGIR (2008)

    Google Scholar 

  4. Pedroche, F.: Modeling social network sites with PageRank and social competences. International Journal of Complex Systems in Science 1, 65–68 (2011)

    Google Scholar 

  5. Ivn, G., Grolmusz, V.: When the web meets the cell: Using personalized PageRank for analyzing protein interaction networks. Bioinformatics Advance Access (2010)

    Google Scholar 

  6. Busa, N., Jagtap, U., Prateek, U., Arms, W.: PageRank calculation using MapReduce. Technical Report, Cornell University (2008)

    Google Scholar 

  7. Chang, S.-H., Zhu, Y., Malshe, P., Li, H.: Large scale PageRank with MapReduce. In: CloudCom (2010)

    Google Scholar 

  8. Abdullah, I.B.: Incremental PageRank for Twitter data using Hadoop. Technical Report, University of Edinburgh (2010)

    Google Scholar 

  9. Chen, Y., Ganapathi, A., Griffith, R., Katz, R.: The case for evaluating MapReduce performance using workload suites, MASCOTS (2011)

    Google Scholar 

  10. Lin, J., Schatz, M.: Design pattern for efficient graph algorithms in MapReduce, MLG 2010 (2010)

    Google Scholar 

  11. Rastogi, V., Machanavajjhala, A., Chitnis, L., Das Sarma, A.: Finding Connected Components on Map-reduce in Logarithmic Rounds. Computing Research Repository (CoRR), abs/1203.5387 (2012)

    Google Scholar 

  12. Hadoop, http://hadoop.apache.org/

  13. Malewicz, G., Austern, M., Bik, A., Dehnert, J., Horn, I.: Pregel: A system for large-scale graph processing, SIGMOD (2010)

    Google Scholar 

  14. Shinnar, A., Cunningham, D., Herta, B., Saraswat, V.: M3R: Increased performance for in-memory Hadoop jobs, VLDB (2012)

    Google Scholar 

  15. Salihoglu, S., Widom, J.: GPS: A graph processing system, SSDBM (2013)

    Google Scholar 

  16. Giraph, http://incubator.apache.org/giraph/

  17. Chakrabarti, D., Faloutsos, C.: Graph mining: Laws, generators, and algorithms. ACM Computing Survey 38 (March 2006)

    Google Scholar 

  18. Joycrawler, http://code.google.com/p/joycrawler/

  19. Stanford Large Network Dataset Collection, http://snap.stanford.edu/data/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Hong, S., Lee, J., Chang, J., Choi, D.H. (2014). A Semi-clustering Scheme for High Performance PageRank on Hadoop. In: Indulska, M., Purao, S. (eds) Advances in Conceptual Modeling. ER 2014. Lecture Notes in Computer Science, vol 8823. Springer, Cham. https://doi.org/10.1007/978-3-319-12256-4_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12256-4_4

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12255-7

  • Online ISBN: 978-3-319-12256-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics