skip to main content
10.1145/3267809.3267813acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

ScootR: Scaling R Dataframes on Dataflow Systems

Published:11 October 2018Publication History

ABSTRACT

To cope with today's large scale of data, parallel dataflow engines such as Hadoop, and more recently Spark and Flink, have been proposed. They offer scalability and performance, but require data scientists to develop analysis pipelines in unfamiliar programming languages and abstractions. To overcome this hurdle, dataflow engines have introduced some forms of multi-language integrations, e.g., for Python and R. However, this results in data exchange between the dataflow engine and the integrated language runtime, which requires inter-process communication and causes high runtime overheads. In this paper, we present ScootR, a novel approach to execute R in dataflow systems. ScootR tightly integrates the dataflow and R language runtime by using the Truffle framework and the Graal compiler. As a result, ScootR executes R scripts directly in the Flink data processing engine, without serialization and inter-process communication. Our experimental study reveals that ScootR outperforms state-of-the-art systems by up to an order of magnitude.

References

  1. 2017. Apache Arrow. (2017). https://arrow.apache.org/ Accessed: 2018-8-27.Google ScholarGoogle Scholar
  2. 2017. IEEE Spectrum, The 2017 Top Programming Languages. (2017). https://spectrum.ieee.org/computing/software/the-2017-top-programming-languages Accessed: 2017-10-23.Google ScholarGoogle Scholar
  3. Alexander Alexandrov et al. 2014. The stratosphere platform for big data analytics. The VLDB Journal---The International Journal on Very Large Data Bases 23, 6 (2014), 939--964. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Alexander Alexandrov et al. 2015. Implicit parallelism through deep language embedding. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 47--61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. K Beyer, Vuk Ercegovac, Jun Rao, and Eugene J Shekita. 2011. JAQL: Query Language for JavaScript (r) Object Notation (JSON). (2011).Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Matthias Boehm et al. 2016. SystemML: Declarative Machine Learning on Spark. VLDB 9, 13 (2016), 1425--1436. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Paul G Brown. 2010. Overview of SciDB: large scale array storage, processing and analysis. In SIGMOD. ACM, 963--968. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Andrew Crotty, Alex Galakatos, Kayhan Dursun, Tim Kraska, Carsten Binnig, Ugur Cetintemel, and Stan Zdonik. 2015. An architecture for compiling udfcentric workflows. Proceedings of the VLDB Endowment 8, 12 (2015), 1466--1477. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Sudipto Das, Yannis Sismanis, Kevin S Beyer, Rainer Gemulla, Peter J Haas, and John McPherson. 2010. Ricardo: integrating R and Hadoop. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 987--998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Juan José Fumero, Michel Steuwer, Lukas Stadler, and Christophe Dubach. 2017. Just-In-Time GPU Compilation for Interpreted Languages with Partial Evaluation.. In VEE, Vol. 17. 60--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Matthias Grimmer, Chris Seaton, Roland Schatz, Thomas Würthinger, and Hanspeter Mössenböck. 2015. High-performance cross-language interoperability in a multi-language runtime. In ACM SIGPLAN Notices, Vol. 51. ACM, 78--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Saptarshi Guha. 2010. Computing environment for the statistical analysis of large and complex data. (2010).Google ScholarGoogle Scholar
  13. Saptarshi Guha, Ryan Hafen, Jeremiah Rounds, Jin Xia, Jianfu Li, Bowei Xi, and William S Cleveland. 2012. Large complex data: divide and recombine (d&r) with rhipe. Stat 1, 1 (2012), 53--67.Google ScholarGoogle ScholarCross RefCross Ref
  14. Jochen Knaus. 2015. snowfall: Easier cluster computing (based on snow). https://CRAN.R-project.org/package=snowfall R package version 1.84-6.1.Google ScholarGoogle Scholar
  15. Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization. IEEE Computer Society, 75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Yinan Li et al. 2017. Mison: A Fast JSON Parser for Data Analytics. PVLDB 10, 10 (2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Microsoft Corporation and Stephen Weston. 2017. doSNOW: Foreach Parallel Adaptor for the 'snow' Package. https://CRAN.R-project.org/package=doSNOW R package version 1.0.15.Google ScholarGoogle Scholar
  18. Stefan C Müller, Gustavo Alonso, Adam Amara, and André Csillaghy. 2014. Pydron: Semi-Automatic Parallelization for Multi-Core and the Cloud.. In OSDI. 645--659. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Michael Paleczny, Christopher Vick, and Cliff Click. 2001. The java hotspot TM server compiler. In Proceedings of the 2001 Symposium on Java TM Virtual Machine Research and Technology Symposium-Volume 1. USENIX Association, 1--1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Shoumik Palkar et al. 2017. Weld: A common runtime for high performance data analytics. In Conference on Innovative Data Systems Research (CIDR).Google ScholarGoogle Scholar
  21. R Core Team. 2015. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.orgGoogle ScholarGoogle Scholar
  22. Konstantin Shvachko et al. 2010. The hadoop distributed file system. In Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on. IEEE, 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. David Smith. 2017. R, Then and Now. (2017). useR!, Brussels, 2017.Google ScholarGoogle Scholar
  24. Lukas Stadler, Adam Welc, Christian Humer, and Mick Jordan. 2016. Optimizing R language execution via aggressive speculation. In Proceedings of the 12th Symposium on Dynamic Languages. ACM, 84--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2, 2 (2009), 1626--1629. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Luke Tierney, A. J. Rossini, Na Li, and H. Sevcikova. 2016. snow: Simple Network of Workstations. https://CRAN.R-project.org/package=snow R package version 0.4-2.Google ScholarGoogle Scholar
  27. Shivaram Venkataraman, Zongheng Yang, Davies Liu, Eric Liang, Hossein Falaki, Xiangrui Meng, Reynold Xin, Ali Ghodsi, Michael Franklin, Ion Stoica, et al. 2016. Sparkr: Scaling r programs with spark. In Proceedings of the 2016 International Conference on Management of Data. ACM, 1099--1104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Stephen Weston. 2017. doMPI: Foreach Parallel Adaptor for the 'Rmpi' Package. https://CRAN.R-project.org/package=doMPI R package version 0.2.2.Google ScholarGoogle Scholar
  29. Hadley Wickham et al. 2017. A Grammer of Data Manipulation. (2017). https://cran.r-project.org/web/packages/dplyr/dplyr.pdf CRAN.Google ScholarGoogle Scholar
  30. Thomas Würthinger et al. 2013. One VM to rule them all. In Proceedings of the 2013 ACM international symposium on New ideas, new paradigms, and reflections on programming & software. ACM, 187--204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Thomas Würthinger, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Doug Simon, and Christian Wimmer. 2012. Self-optimizing AST interpreters. In ACM SIGPLAN Notices, Vol. 48. ACM, 73--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Oscar D Lara Yejas, Weiqiang Zhuang, and Adarsh Pannu. 2014. Big R: large-scale analytics on Hadoop using R. In Big Data (BigData Congress), 2014 IEEE International Congress on. IEEE, 570--577. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Matei Zaharia et al. 2010. Spark: Cluster Computing with Working Sets. HotCloud 10, 10-10 (2010), 95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ce Zhang, Arun Kumar, and Christopher Ré. 2016. Materialization optimizations for feature selection workloads. TODS 41, 1 (2016), 2. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. ScootR: Scaling R Dataframes on Dataflow Systems

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SoCC '18: Proceedings of the ACM Symposium on Cloud Computing
        October 2018
        546 pages
        ISBN:9781450360111
        DOI:10.1145/3267809

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 11 October 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        Overall Acceptance Rate169of722submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader