research-article

Design patterns for efficient graph algorithms in MapReduce

Authors:

Michael SchatzAuthors Info & Claims

MLG '10: Proceedings of the Eighth Workshop on Mining and Learning with Graphs

Pages 78 - 85

https://doi.org/10.1145/1830252.1830263

Published: 24 July 2010 Publication History

Abstract

Graphs are analyzed in many important contexts, including ranking search results based on the hyperlink structure of the world wide web, module detection of proteinprotein interaction networks, and privacy analysis of social networks. Many graphs of interest are difficult to analyze because of their large size, often spanning millions of vertices and billions of edges. As such, researchers have increasingly turned to distributed solutions. In particular, MapReduce has emerged as an enabling technology for large-scale graph processing. However, existing best practices for MapReduce graph algorithms have significant shortcomings that limit performance, especially with respect to partitioning, serializing, and distributing the graph. In this paper, we present three design patterns that address these issues and can be used to accelerate a large class of graph algorithms based on message passing, exemplified by PageRank. Experiments show that the application of our design patterns reduces the running time of PageRank on a web graph with 1.4 billion edges by 69%.

References

[1]

T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 858--867, Prague, Czech Republic, 2007.

[2]

S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the Seventh International World Wide Web Conference (WWW 7), pages 107--117, Brisbane, Australia, 1998.

Digital Library

[3]

C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Ng, and K. Olukotun. Map-Reduce for machine learning on multicore. In Advances in Neural Information Processing Systems 19 (NIPS 2006), pages 281--288, Vancouver, British Columbia, Canada, 2006.

[4]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages 137--150, San Francisco, California, 2004.

Digital Library

[5]

C. Dyer, A. Cordova, A. Mont, and J. Lin. Fast, easy, and cheap: Construction of statistical machine translation models with MapReduce. In Proceedings of the Third Workshop on Statistical Machine Translation at ACL 2008, pages 199--207, Columbus, Ohio, 2008.

Digital Library

[6]

S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP 2003), pages 29--43, Bolton Landing, New York, 2003.

Digital Library

[7]

U. Kang, C. Tsourakakis, A. P. Appel, C. Faloutsos, and J. Leskovec. HADI: Fast diameter estimation and mining in massive graphs with Hadoop. Technical Report CMU-ML-08-117, School of Computer Science, Carnegie Mellon University, 2008.

[8]

U. Kang, C. E. Tsourakakis, and C. Faloutsos. PEGASUS: A peta-scale graph mining system---implementation and observations. In Proceedings of the 2009 Ninth IEEE International Conference on Data Mining (ICDM 2009), pages 229--238, Miami, Floria, 2009.

Digital Library

[9]

J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46:668--677, 1999.

Digital Library

[10]

B. Langmead, M. C. Schatz, J. Lin, M. Pop, and S. L. Salzberg. Searching for SNPs with cloud computing. Genome Biology, 10(R134), 2009.

[11]

J. Lin. Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. In Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2009), pages 155--162, Boston, Massachusetts, 2009.

Digital Library

[12]

J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce. Morgan & Claypool Publishers, 2010.

Digital Library

[13]

S. Navlakha and C. Kingsford. The power of protein interaction networks for associating genes with diseases. Bioinformatics, 26:1057--1063, 2010.

Digital Library

[14]

S. Navlakha, M. C. Schatz, and C. Kingsford. Revealing biological modules via graph summarization. J Comput Biol, 16:253--264, 2009.

[15]

L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical Report 1999--66, Stanford InfoLab, November 1999.

[16]

B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo. PLANET: Massively parallel learning of tree ensembles with MapReduce. In Proceedings of the 35th International Conference on Very Large Data Base (VLDB 2009), pages 1426--1437, Lyon, France, 2009.

Digital Library

[17]

P. A. Pevzner, H. Tang, and M. S. Waterman. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA, 98:9748--9753, 2001.

[18]

M. C. Schatz. CloudBurst: Highly sensitive read mapping with MapReduce. Bioinformatics, 25(11):1363--1369, 2009.

Digital Library

[19]

D. A. Schneider and D. J. DeWitt. A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. In Proceedings of the 1989 ACM SIGMOD International Conference on Management of Data, pages 110--121, Portland, Oregon, 1989.

Digital Library

[20]

J. Wolfe, A. Haghighi, and D. Klein. Fully distributed EM for very large datasets. In Proceedings of the 25th International Conference on Machine Learning (ICML 2008), pages 1184--1191, Helsinki, Finland, 2008.

Digital Library

[21]

D. R. Zerbino and E. Birney. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res, 18:821--829, 2008.

Cited By

Zhang PDou WLiu H(2023)Hierarchical data structures for flowchartScientific Reports10.1038/s41598-023-31968-z13:1Online publication date: 9-Apr-2023
https://doi.org/10.1038/s41598-023-31968-z
Kavitha CSrividhya SLai WMani V(2022)IMapC: Inner MAPping Combiner to Enhance the Performance of MapReduce in HadoopElectronics10.3390/electronics1110159911:10(1599)Online publication date: 17-May-2022
https://doi.org/10.3390/electronics11101599
Vinutha DRaju G(2021)Budget Constraint Scheduler for Big Data Using Hadoop MapReduceSN Computer Science10.1007/s42979-021-00638-02:4Online publication date: 30-Apr-2021
https://doi.org/10.1007/s42979-021-00638-0
Show More Cited By

Index Terms

Design patterns for efficient graph algorithms in MapReduce
1. Mathematics of computing
  1. Discrete mathematics
    1. Graph theory
      1. Graph algorithms
2. Theory of computation
  1. Models of computation
    1. Concurrency
      1. Parallel computing models

Recommendations

Scalable big graph processing in MapReduce
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

MapReduce has become one of the most popular parallel computing paradigms in cloud, due to its high scalability, reliability, and fault-tolerance achieved for a large variety of applications in big data processing. In the literature, there are MapReduce ...
MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems
Hardness Results and Efficient Algorithms for Graph Powers
Graph-Theoretic Concepts in Computer Science

The k -th power H ^k of a graph H is obtained from H by adding new edges between every two distinct vertices having distance at most k in H . Lau [Bipartite roots of graphs, ACM Transactions on Algorithms 2 (2006) 178---208] conjectured ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MLG '10: Proceedings of the Eighth Workshop on Mining and Learning with Graphs

July 2010

185 pages

ISBN:9781450302142

DOI:10.1145/1830252

Conference Chairs:
Ulf Brefeld,
Lise Getoor,
Sofus A. Macskassy

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 July 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

KDD '10

Sponsor:

KDD '10: The 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

July 24 - 25, 2010

Washington, D.C.

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

117
Total Citations
View Citations
2,231
Total Downloads

Downloads (Last 12 months)26
Downloads (Last 6 weeks)1

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang PDou WLiu H(2023)Hierarchical data structures for flowchartScientific Reports10.1038/s41598-023-31968-z13:1Online publication date: 9-Apr-2023
https://doi.org/10.1038/s41598-023-31968-z
Kavitha CSrividhya SLai WMani V(2022)IMapC: Inner MAPping Combiner to Enhance the Performance of MapReduce in HadoopElectronics10.3390/electronics1110159911:10(1599)Online publication date: 17-May-2022
https://doi.org/10.3390/electronics11101599
Vinutha DRaju G(2021)Budget Constraint Scheduler for Big Data Using Hadoop MapReduceSN Computer Science10.1007/s42979-021-00638-02:4Online publication date: 30-Apr-2021
https://doi.org/10.1007/s42979-021-00638-0
Song YJin HWang HLiu Y(2021)IDCOS: optimization strategy for parallel complex expression computation on big dataThe Journal of Supercomputing10.1007/s11227-021-03674-yOnline publication date: 4-Mar-2021
https://doi.org/10.1007/s11227-021-03674-y
Prakash SReisizadeh APedarsani RAvestimehr A(2020)Coded Computing for Distributed Graph AnalyticsIEEE Transactions on Information Theory10.1109/TIT.2020.299967566:10(6534-6554)Online publication date: Oct-2020
https://doi.org/10.1109/TIT.2020.2999675
Mahdi Javanmard MAhmad ZZola JPouchet LChowdhury RHarrison R(2020)Efficient Execution of Dynamic Programming Algorithms on Apache Spark2020 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER49012.2020.00044(337-348)Online publication date: Sep-2020
https://doi.org/10.1109/CLUSTER49012.2020.00044
Manghi PAtzori CDe Bonis MBardi A(2020)Entity deduplication in big data graphs for scholarly communicationData Technologies and Applications10.1108/DTA-09-2019-0163ahead-of-print:ahead-of-printOnline publication date: 30-Jun-2020
https://doi.org/10.1108/DTA-09-2019-0163
Vinutha DRaju G(2020)In-Memory Cache and Intra-Node Combiner Approaches for Optimizing Execution Time in High-Performance ComputingSN Computer Science10.1007/s42979-020-0089-61:2Online publication date: 24-Mar-2020
https://dl.acm.org/doi/10.1007/s42979-020-0089-6
Zhang GLi CZhang YXing CXue SLiu Y(2019)Semantic++ Electronic Commerce Architecture and Models in CloudCloud Security10.4018/978-1-5225-8176-5.ch040(787-811)Online publication date: 2019
https://doi.org/10.4018/978-1-5225-8176-5.ch040
Lu YZhao YWang YWang X(2019)PCTL model checking based on GiraphJournal of Physics: Conference Series10.1088/1742-6596/1237/5/0520221237(052022)Online publication date: 12-Jul-2019
https://doi.org/10.1088/1742-6596/1237/5/052022
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten