A Semi-clustering Scheme for High Performance PageRank on Hadoop

Hong, Seungtae; Lee, Jeonghoon; Chang, Jaewoo; Choi, Dong Hoon

doi:10.1007/978-3-319-12256-4_4

Seungtae Hong¹⁷,
Jeonghoon Lee¹⁸,
Jaewoo Chang¹⁷ &
…
Dong Hoon Choi¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8823))

Included in the following conference series:

International Conference on Conceptual Modeling

1592 Accesses

Abstract

As global Internet business has been evolving, large-scale graphs are becoming popular. PageRank computation on the large-scale graphs using Hadoop with default data partitioning method suffers from poor performance because Hadoop scatters even a set of directly connected vertices to arbitrary multiple nodes. In this paper we propose a semi-clustering scheme to address this problem and improve the performance of PageRank on Hadoop. Our scheme divides a graph into a set of semi-clusters, each of which consists of connected vertices, and assigns a semi-cluster to a single data partition in order to reduce the cost of data exchange between nodes during the computation of PageRank. The semi-clusters are merged and split before the PageRank computation, in order to evenly distribute a large-scale graph into a number of data partitions. Our semi-clustering scheme drastically improves the performance: total elapsed time including the cost of the semi-clustering computation reduced by up to 36%. Furthermore, the effectiveness of our scheme increases as the size of the graph increases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

DPISCAN: Distributed and parallel architecture with indexing for structural clustering of massive dynamic graphs

Article 12 January 2022

SparkSCAN: A Structure Similarity Clustering Algorithm on Spark

PGCAS: A Parallelized Graph Clustering Algorithm Based on Spark

References

Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web, Technical Report, Stanford InfoLab (1999)
Google Scholar
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Word Wide Web (1998)
Google Scholar
Avrachenkov, K., Dobrynin, K.V., Nemirovsky, D., Pham, S., Smirnova, E.: PageRank based clustering of hypertext document collections. SIGIR (2008)
Google Scholar
Pedroche, F.: Modeling social network sites with PageRank and social competences. International Journal of Complex Systems in Science 1, 65–68 (2011)
Google Scholar
Ivn, G., Grolmusz, V.: When the web meets the cell: Using personalized PageRank for analyzing protein interaction networks. Bioinformatics Advance Access (2010)
Google Scholar
Busa, N., Jagtap, U., Prateek, U., Arms, W.: PageRank calculation using MapReduce. Technical Report, Cornell University (2008)
Google Scholar
Chang, S.-H., Zhu, Y., Malshe, P., Li, H.: Large scale PageRank with MapReduce. In: CloudCom (2010)
Google Scholar
Abdullah, I.B.: Incremental PageRank for Twitter data using Hadoop. Technical Report, University of Edinburgh (2010)
Google Scholar
Chen, Y., Ganapathi, A., Griffith, R., Katz, R.: The case for evaluating MapReduce performance using workload suites, MASCOTS (2011)
Google Scholar
Lin, J., Schatz, M.: Design pattern for efficient graph algorithms in MapReduce, MLG 2010 (2010)
Google Scholar
Rastogi, V., Machanavajjhala, A., Chitnis, L., Das Sarma, A.: Finding Connected Components on Map-reduce in Logarithmic Rounds. Computing Research Repository (CoRR), abs/1203.5387 (2012)
Google Scholar
Hadoop, http://hadoop.apache.org/
Malewicz, G., Austern, M., Bik, A., Dehnert, J., Horn, I.: Pregel: A system for large-scale graph processing, SIGMOD (2010)
Google Scholar
Shinnar, A., Cunningham, D., Herta, B., Saraswat, V.: M3R: Increased performance for in-memory Hadoop jobs, VLDB (2012)
Google Scholar
Salihoglu, S., Widom, J.: GPS: A graph processing system, SSDBM (2013)
Google Scholar
Giraph, http://incubator.apache.org/giraph/
Chakrabarti, D., Faloutsos, C.: Graph mining: Laws, generators, and algorithms. ACM Computing Survey 38 (March 2006)
Google Scholar
Joycrawler, http://code.google.com/p/joycrawler/
Stanford Large Network Dataset Collection, http://snap.stanford.edu/data/

Download references

Author information

Authors and Affiliations

Dept. of Computer Engineering, Chonbuk National University, Jeonju, South Korea
Seungtae Hong & Jaewoo Chang
Korea Institute of Science and Technology Information (KISTI), Daejeon, South Korea
Jeonghoon Lee & Dong Hoon Choi

Authors

Seungtae Hong
View author publications
You can also search for this author in PubMed Google Scholar
Jeonghoon Lee
View author publications
You can also search for this author in PubMed Google Scholar
Jaewoo Chang
View author publications
You can also search for this author in PubMed Google Scholar
Dong Hoon Choi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

UQ Business School, The University of Queensland, 4072, St Lucia, QLD, Australia
Marta Indulska
316B Information Sciences and Technology Building, Penn State University, 16802, University Park, PA, USA
Sandeep Purao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hong, S., Lee, J., Chang, J., Choi, D.H. (2014). A Semi-clustering Scheme for High Performance PageRank on Hadoop. In: Indulska, M., Purao, S. (eds) Advances in Conceptual Modeling. ER 2014. Lecture Notes in Computer Science, vol 8823. Springer, Cham. https://doi.org/10.1007/978-3-319-12256-4_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-12256-4_4
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12255-7
Online ISBN: 978-3-319-12256-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics