skip to main content
10.1145/1830252.1830263acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Design patterns for efficient graph algorithms in MapReduce

Published: 24 July 2010 Publication History

Abstract

Graphs are analyzed in many important contexts, including ranking search results based on the hyperlink structure of the world wide web, module detection of proteinprotein interaction networks, and privacy analysis of social networks. Many graphs of interest are difficult to analyze because of their large size, often spanning millions of vertices and billions of edges. As such, researchers have increasingly turned to distributed solutions. In particular, MapReduce has emerged as an enabling technology for large-scale graph processing. However, existing best practices for MapReduce graph algorithms have significant shortcomings that limit performance, especially with respect to partitioning, serializing, and distributing the graph. In this paper, we present three design patterns that address these issues and can be used to accelerate a large class of graph algorithms based on message passing, exemplified by PageRank. Experiments show that the application of our design patterns reduces the running time of PageRank on a web graph with 1.4 billion edges by 69%.

References

[1]
T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 858--867, Prague, Czech Republic, 2007.
[2]
S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the Seventh International World Wide Web Conference (WWW 7), pages 107--117, Brisbane, Australia, 1998.
[3]
C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Ng, and K. Olukotun. Map-Reduce for machine learning on multicore. In Advances in Neural Information Processing Systems 19 (NIPS 2006), pages 281--288, Vancouver, British Columbia, Canada, 2006.
[4]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages 137--150, San Francisco, California, 2004.
[5]
C. Dyer, A. Cordova, A. Mont, and J. Lin. Fast, easy, and cheap: Construction of statistical machine translation models with MapReduce. In Proceedings of the Third Workshop on Statistical Machine Translation at ACL 2008, pages 199--207, Columbus, Ohio, 2008.
[6]
S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP 2003), pages 29--43, Bolton Landing, New York, 2003.
[7]
U. Kang, C. Tsourakakis, A. P. Appel, C. Faloutsos, and J. Leskovec. HADI: Fast diameter estimation and mining in massive graphs with Hadoop. Technical Report CMU-ML-08-117, School of Computer Science, Carnegie Mellon University, 2008.
[8]
U. Kang, C. E. Tsourakakis, and C. Faloutsos. PEGASUS: A peta-scale graph mining system---implementation and observations. In Proceedings of the 2009 Ninth IEEE International Conference on Data Mining (ICDM 2009), pages 229--238, Miami, Floria, 2009.
[9]
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46:668--677, 1999.
[10]
B. Langmead, M. C. Schatz, J. Lin, M. Pop, and S. L. Salzberg. Searching for SNPs with cloud computing. Genome Biology, 10(R134), 2009.
[11]
J. Lin. Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. In Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2009), pages 155--162, Boston, Massachusetts, 2009.
[12]
J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce. Morgan & Claypool Publishers, 2010.
[13]
S. Navlakha and C. Kingsford. The power of protein interaction networks for associating genes with diseases. Bioinformatics, 26:1057--1063, 2010.
[14]
S. Navlakha, M. C. Schatz, and C. Kingsford. Revealing biological modules via graph summarization. J Comput Biol, 16:253--264, 2009.
[15]
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical Report 1999--66, Stanford InfoLab, November 1999.
[16]
B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo. PLANET: Massively parallel learning of tree ensembles with MapReduce. In Proceedings of the 35th International Conference on Very Large Data Base (VLDB 2009), pages 1426--1437, Lyon, France, 2009.
[17]
P. A. Pevzner, H. Tang, and M. S. Waterman. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA, 98:9748--9753, 2001.
[18]
M. C. Schatz. CloudBurst: Highly sensitive read mapping with MapReduce. Bioinformatics, 25(11):1363--1369, 2009.
[19]
D. A. Schneider and D. J. DeWitt. A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. In Proceedings of the 1989 ACM SIGMOD International Conference on Management of Data, pages 110--121, Portland, Oregon, 1989.
[20]
J. Wolfe, A. Haghighi, and D. Klein. Fully distributed EM for very large datasets. In Proceedings of the 25th International Conference on Machine Learning (ICML 2008), pages 1184--1191, Helsinki, Finland, 2008.
[21]
D. R. Zerbino and E. Birney. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res, 18:821--829, 2008.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MLG '10: Proceedings of the Eighth Workshop on Mining and Learning with Graphs
July 2010
185 pages
ISBN:9781450302142
DOI:10.1145/1830252
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 July 2010

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

KDD '10
Sponsor:

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)26
  • Downloads (Last 6 weeks)1
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Hierarchical data structures for flowchartScientific Reports10.1038/s41598-023-31968-z13:1Online publication date: 9-Apr-2023
  • (2022)IMapC: Inner MAPping Combiner to Enhance the Performance of MapReduce in HadoopElectronics10.3390/electronics1110159911:10(1599)Online publication date: 17-May-2022
  • (2021)Budget Constraint Scheduler for Big Data Using Hadoop MapReduceSN Computer Science10.1007/s42979-021-00638-02:4Online publication date: 30-Apr-2021
  • (2021)IDCOS: optimization strategy for parallel complex expression computation on big dataThe Journal of Supercomputing10.1007/s11227-021-03674-yOnline publication date: 4-Mar-2021
  • (2020)Coded Computing for Distributed Graph AnalyticsIEEE Transactions on Information Theory10.1109/TIT.2020.299967566:10(6534-6554)Online publication date: Oct-2020
  • (2020)Efficient Execution of Dynamic Programming Algorithms on Apache Spark2020 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER49012.2020.00044(337-348)Online publication date: Sep-2020
  • (2020)Entity deduplication in big data graphs for scholarly communicationData Technologies and Applications10.1108/DTA-09-2019-0163ahead-of-print:ahead-of-printOnline publication date: 30-Jun-2020
  • (2020)In-Memory Cache and Intra-Node Combiner Approaches for Optimizing Execution Time in High-Performance ComputingSN Computer Science10.1007/s42979-020-0089-61:2Online publication date: 24-Mar-2020
  • (2019)Semantic++ Electronic Commerce Architecture and Models in CloudCloud Security10.4018/978-1-5225-8176-5.ch040(787-811)Online publication date: 2019
  • (2019)PCTL model checking based on GiraphJournal of Physics: Conference Series10.1088/1742-6596/1237/5/0520221237(052022)Online publication date: 12-Jul-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media