Abstract
As global Internet business has been evolving, large-scale graphs are becoming popular. PageRank computation on the large-scale graphs using Hadoop with default data partitioning method suffers from poor performance because Hadoop scatters even a set of directly connected vertices to arbitrary multiple nodes. In this paper we propose a semi-clustering scheme to address this problem and improve the performance of PageRank on Hadoop. Our scheme divides a graph into a set of semi-clusters, each of which consists of connected vertices, and assigns a semi-cluster to a single data partition in order to reduce the cost of data exchange between nodes during the computation of PageRank. The semi-clusters are merged and split before the PageRank computation, in order to evenly distribute a large-scale graph into a number of data partitions. Our semi-clustering scheme drastically improves the performance: total elapsed time including the cost of the semi-clustering computation reduced by up to 36%. Furthermore, the effectiveness of our scheme increases as the size of the graph increases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web, Technical Report, Stanford InfoLab (1999)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Word Wide Web (1998)
Avrachenkov, K., Dobrynin, K.V., Nemirovsky, D., Pham, S., Smirnova, E.: PageRank based clustering of hypertext document collections. SIGIR (2008)
Pedroche, F.: Modeling social network sites with PageRank and social competences. International Journal of Complex Systems in Science 1, 65–68 (2011)
Ivn, G., Grolmusz, V.: When the web meets the cell: Using personalized PageRank for analyzing protein interaction networks. Bioinformatics Advance Access (2010)
Busa, N., Jagtap, U., Prateek, U., Arms, W.: PageRank calculation using MapReduce. Technical Report, Cornell University (2008)
Chang, S.-H., Zhu, Y., Malshe, P., Li, H.: Large scale PageRank with MapReduce. In: CloudCom (2010)
Abdullah, I.B.: Incremental PageRank for Twitter data using Hadoop. Technical Report, University of Edinburgh (2010)
Chen, Y., Ganapathi, A., Griffith, R., Katz, R.: The case for evaluating MapReduce performance using workload suites, MASCOTS (2011)
Lin, J., Schatz, M.: Design pattern for efficient graph algorithms in MapReduce, MLG 2010 (2010)
Rastogi, V., Machanavajjhala, A., Chitnis, L., Das Sarma, A.: Finding Connected Components on Map-reduce in Logarithmic Rounds. Computing Research Repository (CoRR), abs/1203.5387 (2012)
Hadoop, http://hadoop.apache.org/
Malewicz, G., Austern, M., Bik, A., Dehnert, J., Horn, I.: Pregel: A system for large-scale graph processing, SIGMOD (2010)
Shinnar, A., Cunningham, D., Herta, B., Saraswat, V.: M3R: Increased performance for in-memory Hadoop jobs, VLDB (2012)
Salihoglu, S., Widom, J.: GPS: A graph processing system, SSDBM (2013)
Chakrabarti, D., Faloutsos, C.: Graph mining: Laws, generators, and algorithms. ACM Computing Survey 38 (March 2006)
Joycrawler, http://code.google.com/p/joycrawler/
Stanford Large Network Dataset Collection, http://snap.stanford.edu/data/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Hong, S., Lee, J., Chang, J., Choi, D.H. (2014). A Semi-clustering Scheme for High Performance PageRank on Hadoop. In: Indulska, M., Purao, S. (eds) Advances in Conceptual Modeling. ER 2014. Lecture Notes in Computer Science, vol 8823. Springer, Cham. https://doi.org/10.1007/978-3-319-12256-4_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-12256-4_4
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12255-7
Online ISBN: 978-3-319-12256-4
eBook Packages: Computer ScienceComputer Science (R0)