skip to main content
10.1145/1806596.1806638acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

FlumeJava: easy, efficient data-parallel pipelines

Published: 05 June 2010 Publication History

Abstract

MapReduce and similar systems significantly ease the task of writing data-parallel code. However, many real-world computations require a pipeline of MapReduces, and programming and managing such pipelines can be difficult. We present FlumeJava, a Java library that makes it easy to develop, test, and run efficient data-parallel pipelines. At the core of the FlumeJava library are a couple of classes that represent immutable parallel collections, each supporting a modest number of operations for processing them in parallel. Parallel collections and their operations present a simple, high-level, uniform abstraction over different data representations and execution strategies. To enable parallel operations to run efficiently, FlumeJava defers their evaluation, instead internally constructing an execution plan dataflow graph. When the final results of the parallel operations are eventually needed, FlumeJava first optimizes the execution plan, and then executes the optimized operations on appropriate underlying primitives (e.g., MapReduces). The combination of high-level abstractions for parallel data and computation, deferred evaluation and optimization, and efficient parallel primitives yields an easy-to-use system that approaches the efficiency of hand-optimized pipelines. FlumeJava is in active use by hundreds of pipeline developers within Google.

References

[1]
Cascading. http://www.cascading.org.
[2]
Hadoop. http://hadoop.apache.org.
[3]
Pig. http://hadoop.apache.org/pig.
[4]
R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. PVLDB, 1 (2), 2008.
[5]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In OSDI, 2006.
[6]
J. Dean. Experiences with MapReduce, an abstraction for large-scale computation. In PACT, 2006.
[7]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51, no. 1, 2008.
[8]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004.
[9]
S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In SOSP, 2003.
[10]
R. H. Halstead Jr. New ideas in parallel Lisp: Language design, implementation, and programming tools. In Workshop on Parallel Lisp, 1989.
[11]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In EuroSys, 2007.
[12]
J. R. Larus. C**: A large-grain, object-oriented, data-parallel programming language. In LCPC, 1992.
[13]
C. Lasser and S. M. Omohundro. The essential Star-lisp manual. Technical Report 86.15, Thinking Machines, Inc., 1986.
[14]
E. Meijer, B. Beckman, and G. Bierman. LINQ: reconciling objects, relations and XML in the .NET framework. In SIGMOD Conference, 2006.
[15]
R. S. Nikhil and Arvind. Implicit Parallel Programming in pH. Academic Press, 2001.
[16]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In SIGMOD Conference, 2008.
[17]
R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming, 13 (4), 2005.
[18]
J. R. Rose and G. L. Steele Jr. C*: An extended C language. In C Workshop, 1987.
[19]
H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In SIGMOD Conference, 2007.
[20]
Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, 2008.

Cited By

View all
  • (2024)An Overview of Continuous Querying in (Modern) Data SystemsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654679(605-612)Online publication date: 9-Jun-2024
  • (2024)Bridging Between Active Objects: Multitier Programming for Distributed, Concurrent SystemsActive Object Languages: Current Research Trends10.1007/978-3-031-51060-1_4(92-122)Online publication date: 29-Jan-2024
  • (2023)TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge GraphsProceedings of the ACM on Management of Data10.1145/36173411:3(1-27)Online publication date: 13-Nov-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PLDI '10: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2010
514 pages
ISBN:9781450300193
DOI:10.1145/1806596
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 45, Issue 6
    PLDI '10
    June 2010
    496 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/1809028
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 June 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data-parallel programming
  2. java
  3. mapreduce

Qualifiers

  • Research-article

Conference

PLDI '10
Sponsor:

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)106
  • Downloads (Last 6 weeks)10
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)An Overview of Continuous Querying in (Modern) Data SystemsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654679(605-612)Online publication date: 9-Jun-2024
  • (2024)Bridging Between Active Objects: Multitier Programming for Distributed, Concurrent SystemsActive Object Languages: Current Research Trends10.1007/978-3-031-51060-1_4(92-122)Online publication date: 29-Jan-2024
  • (2023)TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge GraphsProceedings of the ACM on Management of Data10.1145/36173411:3(1-27)Online publication date: 13-Nov-2023
  • (2023)Robust and Efficient Sorting with Offset-value CodingACM Transactions on Database Systems10.1145/357095648:1(1-23)Online publication date: 13-Mar-2023
  • (2023)Massively Parallel Tree Embeddings for High Dimensional SpacesProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591096(77-88)Online publication date: 17-Jun-2023
  • (2022)StarsProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601830(21470-21481)Online publication date: 28-Nov-2022
  • (2022)A Review of Multi-Modal Learning from the Text-Guided Visual Processing ViewpointSensors10.3390/s2218681622:18(6816)Online publication date: 8-Sep-2022
  • (2021)Handling Iterations in Distributed Dataflow SystemsACM Computing Surveys10.1145/347760254:9(1-38)Online publication date: 8-Oct-2021
  • (2021)ClamorProceedings of the ACM Symposium on Cloud Computing10.1145/3472883.3486996(654-669)Online publication date: 1-Nov-2021
  • (2021)Massively Parallel Computation via Remote Memory AccessACM Transactions on Parallel Computing10.1145/34706318:3(1-25)Online publication date: 20-Sep-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media