skip to main content
10.1145/2463676.2463693acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
demonstration

Iterative parallel data processing with stratosphere: an inside look

Published: 22 June 2013 Publication History

Abstract

Iterative algorithms occur in many domains of data analysis, such as machine learning or graph analysis. With increasing interest to run those algorithms on very large data sets, we see a need for new techniques to execute iterations in a massively parallel fashion. In prior work, we have shown how to extend and use a parallel data flow system to efficiently run iterative algorithms in a shared-nothing environment. Our approach supports the incremental processing nature of many of those algorithms.
In this demonstration proposal we illustrate the process of implementing, compiling, optimizing, and executing iterative algorithms on Stratosphere using examples from graph analysis and machine learning. For the first step, we show the algorithm's code and a visualization of the produced data flow programs. The second step shows the optimizer's execution plan choices, while the last phase monitors the execution of the program, visualizing the state of the operators and additional metrics, such as per-iteration runtime and number of updates.
To show that the data flow abstraction supports easy creation of custom programming APIs, we also present programs written against a lightweight Pregel API that is layered on top of our system with a small programming effort.

References

[1]
D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: A Programming Model and Execution Framework for Web-Scale Analytical Processing. In Symposium on Cloud Computing, 2010.
[2]
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages 137--150, 2004.
[3]
S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl. Spinning fast iterative data flows. PVLDB, 5(11):1268--1279, 2012.
[4]
E. Friedman, P. Pawlowski, and J. Cieslewicz. SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. Proc. VLDB Endow., 2(2):1402--1413, 2009.
[5]
F. Hueske, M. Peters, M. Sax, A. Rheinl\"ander, R. Bergmann, A. Krettek, and K. Tzoumas. Opening the black boxes in data flow optimization. PVLDB, 5(11):1256--1267, 2012.
[6]
G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD Conference, 2010.
[7]
Stratosphere Project. http://stratosphere.eu/, 2013.
[8]
L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, 1990.

Cited By

View all

Index Terms

  1. Iterative parallel data processing with stratosphere: an inside look

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
    June 2013
    1322 pages
    ISBN:9781450320375
    DOI:10.1145/2463676
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 June 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. graph processing
    2. iterative algorithms
    3. machine learning
    4. parallel databases
    5. query execution
    6. query optimization

    Qualifiers

    • Demonstration

    Conference

    SIGMOD/PODS'13
    Sponsor:

    Acceptance Rates

    SIGMOD '13 Paper Acceptance Rate 76 of 372 submissions, 20%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 27 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Flink Task Scheduling Based on LBGA2023 4th International Conference on Information Science, Parallel and Distributed Systems (ISPDS)10.1109/ISPDS58840.2023.10235469(216-221)Online publication date: 14-Jul-2023
    • (2021)Handling Iterations in Distributed Dataflow SystemsACM Computing Surveys10.1145/347760254:9(1-38)Online publication date: 8-Oct-2021
    • (2020)Data Engineering for Data Science: Two Sides of the Same CoinBig Data Analytics and Knowledge Discovery10.1007/978-3-030-59065-9_13(157-166)Online publication date: 11-Sep-2020
    • (2018)Scalable Graph Processing FrameworksACM Computing Surveys10.1145/319952351:3(1-53)Online publication date: 12-Jun-2018
    • (2018)A survey of state management in big data processing systemsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-018-0514-927:6(847-872)Online publication date: 1-Dec-2018
    • (2018)In-Memory Distributed Training of Linear-Chain Conditional Random Fields with an Application to Fine-Grained Named Entity RecognitionLanguage Technologies for the Challenges of the Digital Age10.1007/978-3-319-73706-5_13(155-167)Online publication date: 6-Jan-2018
    • (2017)Design principles of a stream-based framework for mobility analysisGeoinformatica10.1007/s10707-016-0256-z21:2(237-261)Online publication date: 1-Apr-2017
    • (2017)Management and Analysis of Big Graph Data: Current Systems and Open ChallengesHandbook of Big Data Technologies10.1007/978-3-319-49340-4_14(457-505)Online publication date: 26-Feb-2017
    • (2016)Distributed In-Memory Processing of All k Nearest Neighbor QueriesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2015.250376828:4(925-938)Online publication date: 1-Apr-2016
    • (2016)Strategies for Big Data Analytics through Lambda Architectures in Volatile EnvironmentsIFAC-PapersOnLine10.1016/j.ifacol.2016.11.13849:30(114-119)Online publication date: 2016
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media