research-article

ScootR: Scaling R Dataframes on Dataflow Systems

Authors:

Daniele Bonetta,

Sebastian Breß,

Volker MarklAuthors Info & Claims

SoCC '18: Proceedings of the ACM Symposium on Cloud Computing

Pages 288 - 300

https://doi.org/10.1145/3267809.3267813

Published: 11 October 2018 Publication History

Abstract

To cope with today's large scale of data, parallel dataflow engines such as Hadoop, and more recently Spark and Flink, have been proposed. They offer scalability and performance, but require data scientists to develop analysis pipelines in unfamiliar programming languages and abstractions. To overcome this hurdle, dataflow engines have introduced some forms of multi-language integrations, e.g., for Python and R. However, this results in data exchange between the dataflow engine and the integrated language runtime, which requires inter-process communication and causes high runtime overheads. In this paper, we present ScootR, a novel approach to execute R in dataflow systems. ScootR tightly integrates the dataflow and R language runtime by using the Truffle framework and the Graal compiler. As a result, ScootR executes R scripts directly in the Flink data processing engine, without serialization and inter-process communication. Our experimental study reveals that ScootR outperforms state-of-the-art systems by up to an order of magnitude.

References

[1]

2017. Apache Arrow. (2017). https://arrow.apache.org/ Accessed: 2018-8-27.

[2]

2017. IEEE Spectrum, The 2017 Top Programming Languages. (2017). https://spectrum.ieee.org/computing/software/the-2017-top-programming-languages Accessed: 2017-10-23.

[3]

Alexander Alexandrov et al. 2014. The stratosphere platform for big data analytics. The VLDB Journal---The International Journal on Very Large Data Bases 23, 6 (2014), 939--964.

Digital Library

[4]

Alexander Alexandrov et al. 2015. Implicit parallelism through deep language embedding. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 47--61.

Digital Library

[5]

K Beyer, Vuk Ercegovac, Jun Rao, and Eugene J Shekita. 2011. JAQL: Query Language for JavaScript (r) Object Notation (JSON). (2011).

Digital Library

[6]

Matthias Boehm et al. 2016. SystemML: Declarative Machine Learning on Spark. VLDB 9, 13 (2016), 1425--1436.

Digital Library

[7]

Paul G Brown. 2010. Overview of SciDB: large scale array storage, processing and analysis. In SIGMOD. ACM, 963--968.

Digital Library

[8]

Andrew Crotty, Alex Galakatos, Kayhan Dursun, Tim Kraska, Carsten Binnig, Ugur Cetintemel, and Stan Zdonik. 2015. An architecture for compiling udfcentric workflows. Proceedings of the VLDB Endowment 8, 12 (2015), 1466--1477.

Digital Library

[9]

Sudipto Das, Yannis Sismanis, Kevin S Beyer, Rainer Gemulla, Peter J Haas, and John McPherson. 2010. Ricardo: integrating R and Hadoop. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 987--998.

Digital Library

[10]

Juan José Fumero, Michel Steuwer, Lukas Stadler, and Christophe Dubach. 2017. Just-In-Time GPU Compilation for Interpreted Languages with Partial Evaluation. In VEE, Vol. 17. 60--73.

Digital Library

[11]

Matthias Grimmer, Chris Seaton, Roland Schatz, Thomas Würthinger, and Hanspeter Mössenböck. 2015. High-performance cross-language interoperability in a multi-language runtime. In ACM SIGPLAN Notices, Vol. 51. ACM, 78--90.

Digital Library

[12]

Saptarshi Guha. 2010. Computing environment for the statistical analysis of large and complex data. (2010).

[13]

Saptarshi Guha, Ryan Hafen, Jeremiah Rounds, Jin Xia, Jianfu Li, Bowei Xi, and William S Cleveland. 2012. Large complex data: divide and recombine (d&r) with rhipe. Stat 1, 1 (2012), 53--67.

[14]

Jochen Knaus. 2015. snowfall: Easier cluster computing (based on snow). https://CRAN.R-project.org/package=snowfall R package version 1.84-6.1.

[15]

Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization. IEEE Computer Society, 75.

Digital Library

[16]

Yinan Li et al. 2017. Mison: A Fast JSON Parser for Data Analytics. PVLDB 10, 10 (2017).

Digital Library

[17]

Microsoft Corporation and Stephen Weston. 2017. doSNOW: Foreach Parallel Adaptor for the 'snow' Package. https://CRAN.R-project.org/package=doSNOW R package version 1.0.15.

[18]

Stefan C Müller, Gustavo Alonso, Adam Amara, and André Csillaghy. 2014. Pydron: Semi-Automatic Parallelization for Multi-Core and the Cloud. In OSDI. 645--659.

Digital Library

[19]

Michael Paleczny, Christopher Vick, and Cliff Click. 2001. The java hotspot TM server compiler. In Proceedings of the 2001 Symposium on Java TM Virtual Machine Research and Technology Symposium-Volume 1. USENIX Association, 1--1.

Digital Library

[20]

Shoumik Palkar et al. 2017. Weld: A common runtime for high performance data analytics. In Conference on Innovative Data Systems Research (CIDR).

[21]

R Core Team. 2015. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org

[22]

Konstantin Shvachko et al. 2010. The hadoop distributed file system. In Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on. IEEE, 1--10.

Digital Library

[23]

David Smith. 2017. R, Then and Now. (2017). useR!, Brussels, 2017.

[24]

Lukas Stadler, Adam Welc, Christian Humer, and Mick Jordan. 2016. Optimizing R language execution via aggressive speculation. In Proceedings of the 12th Symposium on Dynamic Languages. ACM, 84--95.

Digital Library

[25]

Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2, 2 (2009), 1626--1629.

Digital Library

[26]

Luke Tierney, A. J. Rossini, Na Li, and H. Sevcikova. 2016. snow: Simple Network of Workstations. https://CRAN.R-project.org/package=snow R package version 0.4-2.

[27]

Shivaram Venkataraman, Zongheng Yang, Davies Liu, Eric Liang, Hossein Falaki, Xiangrui Meng, Reynold Xin, Ali Ghodsi, Michael Franklin, Ion Stoica, et al. 2016. Sparkr: Scaling r programs with spark. In Proceedings of the 2016 International Conference on Management of Data. ACM, 1099--1104.

Digital Library

[28]

Stephen Weston. 2017. doMPI: Foreach Parallel Adaptor for the 'Rmpi' Package. https://CRAN.R-project.org/package=doMPI R package version 0.2.2.

[29]

Hadley Wickham et al. 2017. A Grammer of Data Manipulation. (2017). https://cran.r-project.org/web/packages/dplyr/dplyr.pdf CRAN.

[30]

Thomas Würthinger et al. 2013. One VM to rule them all. In Proceedings of the 2013 ACM international symposium on New ideas, new paradigms, and reflections on programming & software. ACM, 187--204.

Digital Library

[31]

Thomas Würthinger, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Doug Simon, and Christian Wimmer. 2012. Self-optimizing AST interpreters. In ACM SIGPLAN Notices, Vol. 48. ACM, 73--82.

Digital Library

[32]

Oscar D Lara Yejas, Weiqiang Zhuang, and Adarsh Pannu. 2014. Big R: large-scale analytics on Hadoop using R. In Big Data (BigData Congress), 2014 IEEE International Congress on. IEEE, 570--577.

Digital Library

[33]

Matei Zaharia et al. 2010. Spark: Cluster Computing with Working Sets. HotCloud 10, 10-10 (2010), 95.

Digital Library

[34]

Ce Zhang, Arun Kumar, and Christopher Ré. 2016. Materialization optimizations for feature selection workloads. TODS 41, 1 (2016), 2.

Digital Library

Cited By

Grulich PZeuch SMarkl V(2022)BabelfishProceedings of the VLDB Endowment10.14778/3489496.348950115:2(196-210)Online publication date: 4-Feb-2022
https://dl.acm.org/doi/10.14778/3489496.3489501
Rehman MElmore ARigger MTözün P(2022)FuzzyDataProceedings of the 2022 workshop on 9th International Workshop of Testing Database Systems10.1145/3531348.3532178(17-24)Online publication date: 17-Jun-2022
https://dl.acm.org/doi/10.1145/3531348.3532178
Abedjan ZBreß SMarkl VRabl TSoto J(2019)Data Management Systems Research at TU BerlinACM SIGMOD Record10.1145/3335409.333541547:4(23-28)Online publication date: 17-May-2019
https://dl.acm.org/doi/10.1145/3335409.3335415

Index Terms

ScootR: Scaling R Dataframes on Dataflow Systems
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Record and buffer management
    2. Query languages
      1. Query languages for non-relational engines

Recommendations

A model-based approach to language integration
MiSE '13: Proceedings of the 5th International Workshop on Modeling in Software Engineering

The interactions of several languages within a software system pose a number of problems. There is several anecdotal and empirical evidence supporting such concerns. This paper presents a solution to achieve proper language integration in the context of ...
Exploring Collective DSL Integration in a Large Situated IS: Towards Comprehensive Language Integration in Information Systems
ECSAW '14: Proceedings of the 2014 European Conference on Software Architecture Workshops

In large situated information system instances, a great variety of stakeholders interact with each other via technology, constantly shaping and refining the information system. In the course of such a system's history, a range of domain-specific ...
Multi-paradigm Java-Prolog integration in tuProlog

tuProlog is a Java-based Prolog engine explicitly designed to be minimal, dynamically configurable, and support full and clean Prolog/Java integration. In this paper, we discuss the tuProlog approach to Prolog/Java multi-paradigm integration. After ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SoCC '18: Proceedings of the ACM Symposium on Cloud Computing

October 2018

546 pages

ISBN:9781450360111

DOI:10.1145/3267809

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 October 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

SoCC '18

Sponsor:

SoCC '18: ACM Symposium on Cloud Computing

October 11 - 13, 2018

CA, Carlsbad, USA

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
197
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Grulich PZeuch SMarkl V(2022)BabelfishProceedings of the VLDB Endowment10.14778/3489496.348950115:2(196-210)Online publication date: 4-Feb-2022
https://dl.acm.org/doi/10.14778/3489496.3489501
Rehman MElmore ARigger MTözün P(2022)FuzzyDataProceedings of the 2022 workshop on 9th International Workshop of Testing Database Systems10.1145/3531348.3532178(17-24)Online publication date: 17-Jun-2022
https://dl.acm.org/doi/10.1145/3531348.3532178
Abedjan ZBreß SMarkl VRabl TSoto J(2019)Data Management Systems Research at TU BerlinACM SIGMOD Record10.1145/3335409.333541547:4(23-28)Online publication date: 17-May-2019
https://dl.acm.org/doi/10.1145/3335409.3335415

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents