ABSTRACT
There is a growing utilization gap between modern hardware and modern programming languages for data analysis.Due to power and other constraints, recent processor design has sought improved performance through increased SIMD and multi-core parallelism. At the same time, high-level, dynamically-typed languages for data analysis have become popular. These languages emphasize ease of use and high productivity, but have, in general, low performance and limited support for exploiting hardware parallelism.
In this paper, we describe Riposte, a new runtime for the R language, which bridges this gap. Riposte uses tracing, a technique commonly used to accelerate scalar code, to dynamically discover and extract sequences of vector operations from arbitrary R code. Once extracted, we can fuse traces to eliminate unnecessary memory traffic, compile them to use hardware SIMD units, and schedule them to run across multiple cores, allowing us to fully utilize the available parallelism on modern shared-memory machines. Our evaluation shows that Riposte can run vector R code near the speed of hand-optimized C, 5--50x faster than the open source implementation of R, and can also linearly scale to 32 cores for some tasks. Across 12 different workloads we achieve an overall average speed-up of over 150x without explicit programmer parallelization.
- Google V8 Javascript engine. http://code.google.com/p/v8/.Google Scholar
- The LuaJIT project. http://http://luajit.org/.Google Scholar
- The Ra extension to R. http://www.milbo.users.sonic.net/ra/.Google Scholar
- P. S. Abrams. An APL Machine. PhD thesis, Stanford Linear Accelerator Center, Stanford University, Stanford, CA, USA, 1970. Google ScholarDigital Library
- A. Aslam and L. Hendren. McFLAT: a profile-based framework for Matlab loop analysis and transformations. In Proceedings of the 23rd international conference on Languages and compilers for parallel computing, LCPC'10, pages 1--15, Berlin, Heidelberg, 2011. Springer-Verlag. Google ScholarDigital Library
- V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: a transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation, PLDI '00, pages 1--12, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
- P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-Pipelining Query Execution. In CIDR, pages 225--237, 2005.Google Scholar
- S. Brunthaler. Inline caching meets quickening. In Proceedings of the 24th European conference on Object-oriented programming, ECOOP'10, pages 429--451, Berlin, Heidelberg, 2010. Springer-Verlag. Google ScholarDigital Library
- B. Catanzaro, M. Garland, and K. Keutzer. Copperhead: Compiling an embedded data parallel language. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, pages 47--56, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- M. Chevalier-Boisvert, L. Hendren, and C. Verbrugge. Optimizing Matlab through just-in-time specialization. In R. Gupta, editor, Compiler Construction, volume 6011 of Lecture Notes in Computer Science, pages 46--65. Springer Berlin / Heidelberg, 2010. 10.1007/978-3-642-11970-5_4. Google ScholarDigital Library
- D. Coutts, R. Leshchinskiy, and D. Stewart. Stream fusion: From lists to streams to nothing at all. In Proceedings of the 12th ACM SIGPLAN international conference on Functional programming, ICFP '07, pages 315--326, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- A. Das, W. J. Dally, and P. Mattson. Compiling for stream processing. In Proceedings of the 15th international conference on Parallel architectures and compilation techniques, PACT '06, pages 33--42, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- A. Gal, B. Eich, M. Shaver, D. Anderson, D. Mandelin, M. R. Haghighat, B. Kaplan, G. Hoare, B. Zbarsky, J. Orendorff, J. Ruderman, E. W. Smith, R. Reitmaier, M. Bebenita, M. Chang, and M. Franz. Trace-based just-in-time type specialization for dynamic languages. In Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, PLDI '09, pages 465--478, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- A. Gal, C. W. Probst, and M. Franz. HotpathVM: An effective JIT compiler for resource-constrained devices. In Proceedings of the 2nd international conference on Virtual execution environments, VEE '06, pages 144--153, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- V. Grover and Y. Lin. Compiling CUDA and other languages for GPUs. In GPU Technology Conference (GTC), 2012.Google Scholar
- L. J. Guibas and D. K. Wyatt. Compilation and delayed evaluation in APL. In Proceedings of the 5th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, POPL '78, pages 1--8, New York, NY, USA, 1978. ACM. Google ScholarDigital Library
- G. Keller, M. M. Chakravarty, R. Leshchinskiy, S. Peyton Jones, and B. Lippmeier. Regular, shape-polymorphic, parallel arrays in Haskell. In Proceedings of the 15th ACM SIGPLAN international conference on Functional programming, ICFP '10, pages 261--272, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- N. Lameed and L. Hendren. Staged static techniques to efficiently implement array copy semantics in a Matlab JIT compiler. In Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software, CC'11/ETAPS'11, pages 22--41, Berlin, Heidelberg, 2011. Springer-Verlag. Google ScholarDigital Library
- T. C. Miller. Tentative compilation: A design for an APL compiler. SIGAPL APL Quote Quad, 9:88--95, May 1979. Google ScholarDigital Library
- F. Morandat, B. Hill, L. Osvald, and J. Vitek. Evaluating the design of the R language. In ECOOP 2012 Object-Oriented Programming, Lecture Notes in Computer Science, 2012. Google ScholarDigital Library
- C. Newburn, B. So, Z. Liu, M. McCool, A. Ghuloum, S. Toit, Z. G. Wang, Z. H. Du, Y. Chen, G. Wu, P. Guo, Z. Liu, and D. Zhang. Intel's Array Building Blocks: A retargetable, dynamic compiler and embedded language. In Code Generation and Optimization (CGO) 2011, pages 224--235, April 2011. Google ScholarDigital Library
- M. Papakipos. The PeakStream platform: High productivity software development for multi-core processors. Technical report, 2006.Google Scholar
- S. Peyton Jones. Harnessing the multicores: Nested data parallelism in Haskell. In Proceedings of the 6th Asian Symposium on Programming Languages and Systems, APLAS '08, pages 138--138, Berlin, Heidelberg, 2008. Springer-Verlag. Google ScholarDigital Library
- M. Pharr and W. R. Mark. ispc: A SPMD compiler for high-performance CPU programming. In Proceedings of the 2012 Innovative Parallel Computing: Foundations & Applications of GPU, Manycore, and Heterogeneous Systems, InPar '12, 2012.Google ScholarCross Ref
- R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with Sawzall. Sci. Program., 13(4):277--298, Oct. 2005. Google ScholarDigital Library
- M. Poletto and V. Sarkar. Linear scan register allocation. ACM Trans. Program. Lang. Syst., 21(5):895--913, Sept. 1999. Google ScholarDigital Library
- R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2011. ISBN 3-900051-07-0.Google Scholar
- A. R. Runnalls and C. A. Silles. CXXR: An ideas hatchery for future R development. In Proceedings of the 2011 Joint Statistical Meetings (JSM), 2011.Google Scholar
- M. Schmidberger, M. Morgan, D. Eddelbuettel, H. Yu, L. Tierney, and U. Mansmann. State of the art in parallel computing with R. Journal of Statistical Software, 31(1):1--27, 8 2009.Google ScholarCross Ref
- L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: A many-core x86 architecture for visual computing. In ACM SIGGRAPH 2008 papers, SIGGRAPH '08, pages 18:1--18:15, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- L. Tierney. Code analysis and parallelizing vector operations in R. Computational Statistics, 24:217--223, 2009. 10.1007/s00180-008-0117-9. Google ScholarDigital Library
- L. Tierney. A byte code compiler for R. Technical report, 2012.Google Scholar
- A. Tzannes, G. C. Caragea, R. Barua, and U. Vishkin. Lazy binary-splitting: A run-time adaptive work-stealing scheduler. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '10, pages 179--190, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- D. Wentzlaff and A. Agarwal. Factored operating systems (FOS): The case for a scalable operating system for multicores. SIGOPS Oper. Syst. Rev., 43(2):76--85, Apr. 2009. Google ScholarDigital Library
- M. Wolfe. More iteration space tiling. In Proceedings of the 1989 ACM/IEEE conference on Supercomputing, Supercomputing '89, pages 655--664, New York, NY, USA, 1989. ACM. Google ScholarDigital Library
- Y. Ye, K. A. Ross, and N. Vesdapunt. Scalable aggregation on multicore processors. In Proceedings of the Seventh International Workshop on Data Management on New Hardware, DaMoN '11, pages 1--9, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. von Laszewski, V. Nefedova, I. Raicu, T. Stef-Praun, and M. Wilde. Swift: Fast, reliable, loosely coupled parallel computation. In Services, 2007 IEEE Congress on, pages 199--206, July 2007.Google ScholarCross Ref
Index Terms
- Riposte: a trace-driven compiler and parallel VM for vector code in R
Recommendations
Paraprox: pattern-based approximation for data parallel applications
ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systemsApproximate computing is an approach where reduced accuracy of results is traded off for increased speed, throughput, or both. Loss of accuracy is not permissible in all computing domains, but there are a growing number of data-intensive domains where ...
Paraprox: pattern-based approximation for data parallel applications
ASPLOS '14Approximate computing is an approach where reduced accuracy of results is traded off for increased speed, throughput, or both. Loss of accuracy is not permissible in all computing domains, but there are a growing number of data-intensive domains where ...
Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniquesHeterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles data parallel work by taking advantage of its massive number of cores while the CPU handles non data-parallel work, such as the sequential code ...
Comments