ABSTRACT
To cope with today's large scale of data, parallel dataflow engines such as Hadoop, and more recently Spark and Flink, have been proposed. They offer scalability and performance, but require data scientists to develop analysis pipelines in unfamiliar programming languages and abstractions. To overcome this hurdle, dataflow engines have introduced some forms of multi-language integrations, e.g., for Python and R. However, this results in data exchange between the dataflow engine and the integrated language runtime, which requires inter-process communication and causes high runtime overheads. In this paper, we present ScootR, a novel approach to execute R in dataflow systems. ScootR tightly integrates the dataflow and R language runtime by using the Truffle framework and the Graal compiler. As a result, ScootR executes R scripts directly in the Flink data processing engine, without serialization and inter-process communication. Our experimental study reveals that ScootR outperforms state-of-the-art systems by up to an order of magnitude.
- 2017. Apache Arrow. (2017). https://arrow.apache.org/ Accessed: 2018-8-27.Google Scholar
- 2017. IEEE Spectrum, The 2017 Top Programming Languages. (2017). https://spectrum.ieee.org/computing/software/the-2017-top-programming-languages Accessed: 2017-10-23.Google Scholar
- Alexander Alexandrov et al. 2014. The stratosphere platform for big data analytics. The VLDB Journal---The International Journal on Very Large Data Bases 23, 6 (2014), 939--964. Google ScholarDigital Library
- Alexander Alexandrov et al. 2015. Implicit parallelism through deep language embedding. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 47--61. Google ScholarDigital Library
- K Beyer, Vuk Ercegovac, Jun Rao, and Eugene J Shekita. 2011. JAQL: Query Language for JavaScript (r) Object Notation (JSON). (2011).Google ScholarDigital Library
- Matthias Boehm et al. 2016. SystemML: Declarative Machine Learning on Spark. VLDB 9, 13 (2016), 1425--1436. Google ScholarDigital Library
- Paul G Brown. 2010. Overview of SciDB: large scale array storage, processing and analysis. In SIGMOD. ACM, 963--968. Google ScholarDigital Library
- Andrew Crotty, Alex Galakatos, Kayhan Dursun, Tim Kraska, Carsten Binnig, Ugur Cetintemel, and Stan Zdonik. 2015. An architecture for compiling udfcentric workflows. Proceedings of the VLDB Endowment 8, 12 (2015), 1466--1477. Google ScholarDigital Library
- Sudipto Das, Yannis Sismanis, Kevin S Beyer, Rainer Gemulla, Peter J Haas, and John McPherson. 2010. Ricardo: integrating R and Hadoop. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 987--998. Google ScholarDigital Library
- Juan José Fumero, Michel Steuwer, Lukas Stadler, and Christophe Dubach. 2017. Just-In-Time GPU Compilation for Interpreted Languages with Partial Evaluation.. In VEE, Vol. 17. 60--73. Google ScholarDigital Library
- Matthias Grimmer, Chris Seaton, Roland Schatz, Thomas Würthinger, and Hanspeter Mössenböck. 2015. High-performance cross-language interoperability in a multi-language runtime. In ACM SIGPLAN Notices, Vol. 51. ACM, 78--90. Google ScholarDigital Library
- Saptarshi Guha. 2010. Computing environment for the statistical analysis of large and complex data. (2010).Google Scholar
- Saptarshi Guha, Ryan Hafen, Jeremiah Rounds, Jin Xia, Jianfu Li, Bowei Xi, and William S Cleveland. 2012. Large complex data: divide and recombine (d&r) with rhipe. Stat 1, 1 (2012), 53--67.Google ScholarCross Ref
- Jochen Knaus. 2015. snowfall: Easier cluster computing (based on snow). https://CRAN.R-project.org/package=snowfall R package version 1.84-6.1.Google Scholar
- Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization. IEEE Computer Society, 75. Google ScholarDigital Library
- Yinan Li et al. 2017. Mison: A Fast JSON Parser for Data Analytics. PVLDB 10, 10 (2017). Google ScholarDigital Library
- Microsoft Corporation and Stephen Weston. 2017. doSNOW: Foreach Parallel Adaptor for the 'snow' Package. https://CRAN.R-project.org/package=doSNOW R package version 1.0.15.Google Scholar
- Stefan C Müller, Gustavo Alonso, Adam Amara, and André Csillaghy. 2014. Pydron: Semi-Automatic Parallelization for Multi-Core and the Cloud.. In OSDI. 645--659. Google ScholarDigital Library
- Michael Paleczny, Christopher Vick, and Cliff Click. 2001. The java hotspot TM server compiler. In Proceedings of the 2001 Symposium on Java TM Virtual Machine Research and Technology Symposium-Volume 1. USENIX Association, 1--1. Google ScholarDigital Library
- Shoumik Palkar et al. 2017. Weld: A common runtime for high performance data analytics. In Conference on Innovative Data Systems Research (CIDR).Google Scholar
- R Core Team. 2015. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.orgGoogle Scholar
- Konstantin Shvachko et al. 2010. The hadoop distributed file system. In Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on. IEEE, 1--10. Google ScholarDigital Library
- David Smith. 2017. R, Then and Now. (2017). useR!, Brussels, 2017.Google Scholar
- Lukas Stadler, Adam Welc, Christian Humer, and Mick Jordan. 2016. Optimizing R language execution via aggressive speculation. In Proceedings of the 12th Symposium on Dynamic Languages. ACM, 84--95. Google ScholarDigital Library
- Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2, 2 (2009), 1626--1629. Google ScholarDigital Library
- Luke Tierney, A. J. Rossini, Na Li, and H. Sevcikova. 2016. snow: Simple Network of Workstations. https://CRAN.R-project.org/package=snow R package version 0.4-2.Google Scholar
- Shivaram Venkataraman, Zongheng Yang, Davies Liu, Eric Liang, Hossein Falaki, Xiangrui Meng, Reynold Xin, Ali Ghodsi, Michael Franklin, Ion Stoica, et al. 2016. Sparkr: Scaling r programs with spark. In Proceedings of the 2016 International Conference on Management of Data. ACM, 1099--1104. Google ScholarDigital Library
- Stephen Weston. 2017. doMPI: Foreach Parallel Adaptor for the 'Rmpi' Package. https://CRAN.R-project.org/package=doMPI R package version 0.2.2.Google Scholar
- Hadley Wickham et al. 2017. A Grammer of Data Manipulation. (2017). https://cran.r-project.org/web/packages/dplyr/dplyr.pdf CRAN.Google Scholar
- Thomas Würthinger et al. 2013. One VM to rule them all. In Proceedings of the 2013 ACM international symposium on New ideas, new paradigms, and reflections on programming & software. ACM, 187--204. Google ScholarDigital Library
- Thomas Würthinger, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Doug Simon, and Christian Wimmer. 2012. Self-optimizing AST interpreters. In ACM SIGPLAN Notices, Vol. 48. ACM, 73--82. Google ScholarDigital Library
- Oscar D Lara Yejas, Weiqiang Zhuang, and Adarsh Pannu. 2014. Big R: large-scale analytics on Hadoop using R. In Big Data (BigData Congress), 2014 IEEE International Congress on. IEEE, 570--577. Google ScholarDigital Library
- Matei Zaharia et al. 2010. Spark: Cluster Computing with Working Sets. HotCloud 10, 10-10 (2010), 95. Google ScholarDigital Library
- Ce Zhang, Arun Kumar, and Christopher Ré. 2016. Materialization optimizations for feature selection workloads. TODS 41, 1 (2016), 2. Google ScholarDigital Library
Index Terms
- ScootR: Scaling R Dataframes on Dataflow Systems
Recommendations
A model-based approach to language integration
MiSE '13: Proceedings of the 5th International Workshop on Modeling in Software EngineeringThe interactions of several languages within a software system pose a number of problems. There is several anecdotal and empirical evidence supporting such concerns. This paper presents a solution to achieve proper language integration in the context of ...
Exploring Collective DSL Integration in a Large Situated IS: Towards Comprehensive Language Integration in Information Systems
ECSAW '14: Proceedings of the 2014 European Conference on Software Architecture WorkshopsIn large situated information system instances, a great variety of stakeholders interact with each other via technology, constantly shaping and refining the information system. In the course of such a system's history, a range of domain-specific ...
Multi-paradigm Java-Prolog integration in tuProlog
tuProlog is a Java-based Prolog engine explicitly designed to be minimal, dynamically configurable, and support full and clean Prolog/Java integration. In this paper, we discuss the tuProlog approach to Prolog/Java multi-paradigm integration. After ...
Comments