demonstration

Iterative parallel data processing with stratosphere: an inside look

Authors:

Volker MarklAuthors Info & Claims

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Pages 1053 - 1056

https://doi.org/10.1145/2463676.2463693

Published: 22 June 2013 Publication History

Get Access

Abstract

Iterative algorithms occur in many domains of data analysis, such as machine learning or graph analysis. With increasing interest to run those algorithms on very large data sets, we see a need for new techniques to execute iterations in a massively parallel fashion. In prior work, we have shown how to extend and use a parallel data flow system to efficiently run iterative algorithms in a shared-nothing environment. Our approach supports the incremental processing nature of many of those algorithms.

In this demonstration proposal we illustrate the process of implementing, compiling, optimizing, and executing iterative algorithms on Stratosphere using examples from graph analysis and machine learning. For the first step, we show the algorithm's code and a visualization of the produced data flow programs. The second step shows the optimizer's execution plan choices, while the last phase monitors the execution of the program, visualizing the state of the operators and additional metrics, such as per-iteration runtime and number of updates.

To show that the data flow abstraction supports easy creation of custom programming APIs, we also present programs written against a lightweight Pregel API that is layered on top of our system with a small programming effort.

References

[1]

D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: A Programming Model and Execution Framework for Web-Scale Analytical Processing. In Symposium on Cloud Computing, 2010.

Digital Library

Google Scholar

[2]

J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages 137--150, 2004.

Digital Library

Google Scholar

[3]

S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl. Spinning fast iterative data flows. PVLDB, 5(11):1268--1279, 2012.

Digital Library

Google Scholar

[4]

E. Friedman, P. Pawlowski, and J. Cieslewicz. SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. Proc. VLDB Endow., 2(2):1402--1413, 2009.

Digital Library

Google Scholar

[5]

F. Hueske, M. Peters, M. Sax, A. Rheinl\"ander, R. Bergmann, A. Krettek, and K. Tzoumas. Opening the black boxes in data flow optimization. PVLDB, 5(11):1256--1267, 2012.

Digital Library

Google Scholar

[6]

G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD Conference, 2010.

Digital Library

Google Scholar

[7]

Stratosphere Project. http://stratosphere.eu/, 2013.

Google Scholar

[8]

L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, 1990.

Digital Library

Google Scholar

Cited By

View all

Cao QShe PChai X(2023)Flink Task Scheduling Based on LBGA2023 4th International Conference on Information Science, Parallel and Distributed Systems (ISPDS)10.1109/ISPDS58840.2023.10235469(216-221)Online publication date: 14-Jul-2023
https://doi.org/10.1109/ISPDS58840.2023.10235469
Gévay GSoto JMarkl V(2021)Handling Iterations in Distributed Dataflow SystemsACM Computing Surveys10.1145/347760254:9(1-38)Online publication date: 8-Oct-2021
https://dl.acm.org/doi/10.1145/3477602
Romero OWrembel R(2020)Data Engineering for Data Science: Two Sides of the Same CoinBig Data Analytics and Knowledge Discovery10.1007/978-3-030-59065-9_13(157-166)Online publication date: 11-Sep-2020
https://doi.org/10.1007/978-3-030-59065-9_13
Show More Cited By

Index Terms

Iterative parallel data processing with stratosphere: an inside look
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs

Recommendations

Load Balancing for Parallel Query Execution on NUMA Multiprocessors

To scale up to high-end configurations, shared-memory multiprocessors are evolving towards Non Uniform Memory Access (NUMA) architectures. In this paper, we address the central problem of load balancing during parallel query execution in NUMA ...
Performance analysis of "Groupby-After-Join" query processing in parallel database systems

Queries containing aggregate functions often combine multiple tables through join operations. This query is subsequently called "Groupby-Join". There is a special category of this query whereby the group-by operation can only be performed after the join ...
The Stratosphere platform for big data analytics

We present Stratosphere, an open-source software stack for parallel data analysis. Stratosphere brings together a unique set of features that allow the expressive, easy, and efficient programming of analytical applications at very large scale. ...

Comments

Information & Contributors

Information

Published In

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

June 2013

1322 pages

ISBN:9781450320375

DOI:10.1145/2463676

General Chairs:
Kenneth Ross
Columbia University
,
Divesh Srivastava
AT&T Research
,
Program Chair:
Dimitris Papadias
HKUST

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Demonstration

Conference

SIGMOD/PODS'13

Sponsor:

SIGMOD

SIGMOD/PODS'13: International Conference on Management of Data

June 22 - 27, 2013

New York, New York, USA

Acceptance Rates

SIGMOD '13 Paper Acceptance Rate 76 of 372 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
454
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)2

Reflects downloads up to 30 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Cao QShe PChai X(2023)Flink Task Scheduling Based on LBGA2023 4th International Conference on Information Science, Parallel and Distributed Systems (ISPDS)10.1109/ISPDS58840.2023.10235469(216-221)Online publication date: 14-Jul-2023
https://doi.org/10.1109/ISPDS58840.2023.10235469
Gévay GSoto JMarkl V(2021)Handling Iterations in Distributed Dataflow SystemsACM Computing Surveys10.1145/347760254:9(1-38)Online publication date: 8-Oct-2021
https://dl.acm.org/doi/10.1145/3477602
Romero OWrembel R(2020)Data Engineering for Data Science: Two Sides of the Same CoinBig Data Analytics and Knowledge Discovery10.1007/978-3-030-59065-9_13(157-166)Online publication date: 11-Sep-2020
https://doi.org/10.1007/978-3-030-59065-9_13
Heidari SSimmhan YCalheiros RBuyya R(2018)Scalable Graph Processing FrameworksACM Computing Surveys10.1145/319952351:3(1-53)Online publication date: 12-Jun-2018
https://dl.acm.org/doi/10.1145/3199523
To QSoto JMarkl V(2018)A survey of state management in big data processing systemsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-018-0514-927:6(847-872)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.1007/s00778-018-0514-9
Schwarzenberg RHennig LHemsen H(2018)In-Memory Distributed Training of Linear-Chain Conditional Random Fields with an Application to Fine-Grained Named Entity RecognitionLanguage Technologies for the Challenges of the Digital Age10.1007/978-3-319-73706-5_13(155-167)Online publication date: 6-Jan-2018
https://doi.org/10.1007/978-3-319-73706-5_13
Salmon LRay C(2017)Design principles of a stream-based framework for mobility analysisGeoinformatica10.1007/s10707-016-0256-z21:2(237-261)Online publication date: 1-Apr-2017
https://dl.acm.org/doi/10.1007/s10707-016-0256-z
Junghanns MPetermann ANeumann MRahm E(2017)Management and Analysis of Big Graph Data: Current Systems and Open ChallengesHandbook of Big Data Technologies10.1007/978-3-319-49340-4_14(457-505)Online publication date: 26-Feb-2017
https://doi.org/10.1007/978-3-319-49340-4_14
Chatzimilioudis GCosta CZeinalipour-Yazti DLee WPitoura E(2016)Distributed In-Memory Processing of All k Nearest Neighbor QueriesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2015.250376828:4(925-938)Online publication date: 1-Apr-2016
https://dl.acm.org/doi/10.1109/TKDE.2015.2503768
Alexandre da Silva VJulio C.S. dos AEdison Pignaton dThomas J. LClaudio F. G(2016)Strategies for Big Data Analytics through Lambda Architectures in Volatile EnvironmentsIFAC-PapersOnLine10.1016/j.ifacol.2016.11.13849:30(114-119)Online publication date: 2016
https://doi.org/10.1016/j.ifacol.2016.11.138
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Load Balancing for Parallel Query Execution on NUMA Multiprocessors

Performance analysis of "Groupby-After-Join" query processing in parallel database systems

The Stratosphere platform for big data analytics

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations