skip to main content
10.1145/3267809.3267813acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

ScootR: Scaling R Dataframes on Dataflow Systems

Published: 11 October 2018 Publication History

Abstract

To cope with today's large scale of data, parallel dataflow engines such as Hadoop, and more recently Spark and Flink, have been proposed. They offer scalability and performance, but require data scientists to develop analysis pipelines in unfamiliar programming languages and abstractions. To overcome this hurdle, dataflow engines have introduced some forms of multi-language integrations, e.g., for Python and R. However, this results in data exchange between the dataflow engine and the integrated language runtime, which requires inter-process communication and causes high runtime overheads. In this paper, we present ScootR, a novel approach to execute R in dataflow systems. ScootR tightly integrates the dataflow and R language runtime by using the Truffle framework and the Graal compiler. As a result, ScootR executes R scripts directly in the Flink data processing engine, without serialization and inter-process communication. Our experimental study reveals that ScootR outperforms state-of-the-art systems by up to an order of magnitude.

References

[1]
2017. Apache Arrow. (2017). https://arrow.apache.org/ Accessed: 2018-8-27.
[2]
2017. IEEE Spectrum, The 2017 Top Programming Languages. (2017). https://spectrum.ieee.org/computing/software/the-2017-top-programming-languages Accessed: 2017-10-23.
[3]
Alexander Alexandrov et al. 2014. The stratosphere platform for big data analytics. The VLDB Journal---The International Journal on Very Large Data Bases 23, 6 (2014), 939--964.
[4]
Alexander Alexandrov et al. 2015. Implicit parallelism through deep language embedding. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 47--61.
[5]
K Beyer, Vuk Ercegovac, Jun Rao, and Eugene J Shekita. 2011. JAQL: Query Language for JavaScript (r) Object Notation (JSON). (2011).
[6]
Matthias Boehm et al. 2016. SystemML: Declarative Machine Learning on Spark. VLDB 9, 13 (2016), 1425--1436.
[7]
Paul G Brown. 2010. Overview of SciDB: large scale array storage, processing and analysis. In SIGMOD. ACM, 963--968.
[8]
Andrew Crotty, Alex Galakatos, Kayhan Dursun, Tim Kraska, Carsten Binnig, Ugur Cetintemel, and Stan Zdonik. 2015. An architecture for compiling udfcentric workflows. Proceedings of the VLDB Endowment 8, 12 (2015), 1466--1477.
[9]
Sudipto Das, Yannis Sismanis, Kevin S Beyer, Rainer Gemulla, Peter J Haas, and John McPherson. 2010. Ricardo: integrating R and Hadoop. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 987--998.
[10]
Juan José Fumero, Michel Steuwer, Lukas Stadler, and Christophe Dubach. 2017. Just-In-Time GPU Compilation for Interpreted Languages with Partial Evaluation. In VEE, Vol. 17. 60--73.
[11]
Matthias Grimmer, Chris Seaton, Roland Schatz, Thomas Würthinger, and Hanspeter Mössenböck. 2015. High-performance cross-language interoperability in a multi-language runtime. In ACM SIGPLAN Notices, Vol. 51. ACM, 78--90.
[12]
Saptarshi Guha. 2010. Computing environment for the statistical analysis of large and complex data. (2010).
[13]
Saptarshi Guha, Ryan Hafen, Jeremiah Rounds, Jin Xia, Jianfu Li, Bowei Xi, and William S Cleveland. 2012. Large complex data: divide and recombine (d&r) with rhipe. Stat 1, 1 (2012), 53--67.
[14]
Jochen Knaus. 2015. snowfall: Easier cluster computing (based on snow). https://CRAN.R-project.org/package=snowfall R package version 1.84-6.1.
[15]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization. IEEE Computer Society, 75.
[16]
Yinan Li et al. 2017. Mison: A Fast JSON Parser for Data Analytics. PVLDB 10, 10 (2017).
[17]
Microsoft Corporation and Stephen Weston. 2017. doSNOW: Foreach Parallel Adaptor for the 'snow' Package. https://CRAN.R-project.org/package=doSNOW R package version 1.0.15.
[18]
Stefan C Müller, Gustavo Alonso, Adam Amara, and André Csillaghy. 2014. Pydron: Semi-Automatic Parallelization for Multi-Core and the Cloud. In OSDI. 645--659.
[19]
Michael Paleczny, Christopher Vick, and Cliff Click. 2001. The java hotspot TM server compiler. In Proceedings of the 2001 Symposium on Java TM Virtual Machine Research and Technology Symposium-Volume 1. USENIX Association, 1--1.
[20]
Shoumik Palkar et al. 2017. Weld: A common runtime for high performance data analytics. In Conference on Innovative Data Systems Research (CIDR).
[21]
R Core Team. 2015. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org
[22]
Konstantin Shvachko et al. 2010. The hadoop distributed file system. In Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on. IEEE, 1--10.
[23]
David Smith. 2017. R, Then and Now. (2017). useR!, Brussels, 2017.
[24]
Lukas Stadler, Adam Welc, Christian Humer, and Mick Jordan. 2016. Optimizing R language execution via aggressive speculation. In Proceedings of the 12th Symposium on Dynamic Languages. ACM, 84--95.
[25]
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2, 2 (2009), 1626--1629.
[26]
Luke Tierney, A. J. Rossini, Na Li, and H. Sevcikova. 2016. snow: Simple Network of Workstations. https://CRAN.R-project.org/package=snow R package version 0.4-2.
[27]
Shivaram Venkataraman, Zongheng Yang, Davies Liu, Eric Liang, Hossein Falaki, Xiangrui Meng, Reynold Xin, Ali Ghodsi, Michael Franklin, Ion Stoica, et al. 2016. Sparkr: Scaling r programs with spark. In Proceedings of the 2016 International Conference on Management of Data. ACM, 1099--1104.
[28]
Stephen Weston. 2017. doMPI: Foreach Parallel Adaptor for the 'Rmpi' Package. https://CRAN.R-project.org/package=doMPI R package version 0.2.2.
[29]
Hadley Wickham et al. 2017. A Grammer of Data Manipulation. (2017). https://cran.r-project.org/web/packages/dplyr/dplyr.pdf CRAN.
[30]
Thomas Würthinger et al. 2013. One VM to rule them all. In Proceedings of the 2013 ACM international symposium on New ideas, new paradigms, and reflections on programming & software. ACM, 187--204.
[31]
Thomas Würthinger, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Doug Simon, and Christian Wimmer. 2012. Self-optimizing AST interpreters. In ACM SIGPLAN Notices, Vol. 48. ACM, 73--82.
[32]
Oscar D Lara Yejas, Weiqiang Zhuang, and Adarsh Pannu. 2014. Big R: large-scale analytics on Hadoop using R. In Big Data (BigData Congress), 2014 IEEE International Congress on. IEEE, 570--577.
[33]
Matei Zaharia et al. 2010. Spark: Cluster Computing with Working Sets. HotCloud 10, 10-10 (2010), 95.
[34]
Ce Zhang, Arun Kumar, and Christopher Ré. 2016. Materialization optimizations for feature selection workloads. TODS 41, 1 (2016), 2.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SoCC '18: Proceedings of the ACM Symposium on Cloud Computing
October 2018
546 pages
ISBN:9781450360111
DOI:10.1145/3267809
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data Exchange
  2. Dataflow Engines
  3. Language Integration

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

SoCC '18
Sponsor:
SoCC '18: ACM Symposium on Cloud Computing
October 11 - 13, 2018
CA, Carlsbad, USA

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media