research-article

Public Access

FlashR: parallelize and scale R for machine learning using SSDs

Authors:

Joshua T. Vogelstein,

Carey E. Priebe,

Randal BurnsAuthors Info & Claims

PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 183 - 194

https://doi.org/10.1145/3178487.3178501

Published: 10 February 2018 Publication History

Abstract

R is one of the most popular programming languages for statistics and machine learning, but it is slow and unable to scale to large datasets. The general approach for having an efficient algorithm in R is to implement it in C or FORTRAN and provide an R wrapper. FlashR accelerates and scales existing R code by parallelizing a large number of matrix functions in the R base package and scaling them beyond memory capacity with solid-state drives (SSDs). FlashR performs memory hierarchy aware execution to speed up parallelized R code by (i) evaluating matrix operations lazily, (ii) performing all operations in a DAG in a single execution and with only one pass over data to increase the ratio of computation to I/O, (iii) performing two levels of matrix partitioning and reordering computation on matrix partitions to reduce data movement in the memory hierarchy. We evaluate FlashR on various machine learning and statistics algorithms on inputs of up to four billion data points. Despite the huge performance gap between SSDs and RAM, FlashR on SSDs closely tracks the performance of FlashR in memory for many algorithms. The R implementations in FlashR outperforms H₂O and Spark MLlib by a factor of 3 -- 20.

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16).

Digital Library

[2]

Jeff Bilmes. 1998. A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. Technical Report. International Computer Science Institute.

[3]

Matthias Boehm, Michael W. Dusenberry, Deron Eriksson, Alexandre V. Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold Reinwald, Frederick R. Reiss, Prithviraj Sen, Arvind C. Surve, and Shirish Tatikonda. 2016. SystemML: Declarative Machine Learning on Spark. Proc. VLDB Endow. 9, 13 (Sept. 2016), 1425--1436.

Digital Library

[4]

Matthias Boehm, Shirish Tatikonda, Berthold Reinwald, Prithviraj Sen, Yuanyuan Tian, Douglas R. Burdick, and Shivakumar Vaithyanathan. 2014. Hybrid Parallelization Strategies for Large-scale Machine Learning in SystemML. Proc. VLDB Endow. 7, 7 (March 2014), 553--564.

Digital Library

[5]

Hassan Chafi, Arvind K. Sujeeth, Kevin J. Brown, HyoukJoong Lee, Anand R. Atreya, and Kunle Olukotun. 2011. A Domain-Specific Approach to Heterogeneous Parallelism. In Proceedings of the 16th Annual Symposium on Principles and Practice of Parallel Programming.

Digital Library

[6]

Wai-Mee Ching and Da Zheng. 2012. Automatic Parallelization of Array-oriented Programs for a Multi-core Machine. International Journal of Parallel Programming 40, 5 (2012), 514--531.

[7]

Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Y. Ng, and Kunle Olukotun. 2006. Map-reduce for Machine Learning on Multicore. In Proceedings of the 19th International Conference on Neural Information Processing Systems.

Digital Library

[8]

criteo Accessed 2/11/2017. Criteo's 1TB Click Prediction Dataset. https://blogs.technet.microsoft.com/machinelearning/2015/04/01/now-available-on-azure-ml-criteos-1tb-click-prediction-dataset/. (Accessed 2/11/2017).

[9]

Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6 (OSDI'04). USENIX Association, Berkeley, CA, USA.

Digital Library

[10]

Ahmed Elgohary, Matthias Boehm, Peter J. Haas, Frederick R. Reiss, and Berthold Reinwald. 2016. Compressed Linear Algebra for Large-scale Machine Learning. Proc. VLDB Endow. 9, 12 (Aug. 2016), 960--971.

Digital Library

[11]

Kayvon Fatahalian, Timothy J. Knight, Mike Houston, Mattan Erez, Daniel Reiter Horn, Larkhoon Leem, Ji Young Park, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. 2006. Sequoia: Programming the Memory Hierarchy. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing.

Digital Library

[12]

Amol Ghoting, Rajasekar Krishnamurthy, Edwin Pednault, Berthold Reinwald, Vikas Sindhwani, Shirish Tatikonda, Yuanyuan Tian, and Shivakumar Vaithyanathan. 2011. SystemML: Declarative Machine Learning on MapReduce. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA.

Digital Library

[13]

Leo J. Guibas and Douglas K. Wyatt. 1978. Compilation and Delayed Evaluation in APL. In Proceedings of the 5th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages.

Digital Library

[14]

H2O Accessed 2/7/2017. H2O machine learning library. http://www.h2o.ai/. (Accessed 2/7/2017).

[15]

Richard E. Ladner and Michael J. Fischer. 1980. Parallel Prefix Computation. J. ACM 27, 4 (Oct. 1980), 831--838.

Digital Library

[16]

D. C. Liu and J. Nocedal. 1989. On the limited memory BFGS method for large scale optimization. Mathematical Programming: Series A and B (1989).

[17]

Hang Liu and H. Howie Huang. 2017. Graphene: Fine-Grained IO Management for Graph Computing. In 15th USENIX Conference on File and Storage Technologies (FAST 17). Santa Clara, CA.

Digital Library

[18]

S. Lloyd. 2006. Least Squares Quantization in PCM. IEEE Trans. Inf. Theor. 28, 2 (Sept. 2006).

Digital Library

[19]

Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. Proc. VLDB Endow. 5, 8 (2012).

Digital Library

[20]

mass Accessed 2/12/2017. Package MASS. https://cran.r-project.org/web/packages/MASS/index.html. (Accessed 2/12/2017).

[21]

Alexander Matveev, Yaron Meirovitch, Hayk Saribekyan, Wiktor Jakubiuk, Tim Kaler, Gergely Odor, David Budden, Aleksandar Zlateski, and Nir Shavit. 2017. A Multicore Path to Connectomics-on-Demand. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.

Digital Library

[22]

Frank McSherry, Michael Isard, and Derek G. Murray. 2015. Scalability! But at what COST?. In 15th Workshop on Hot Topics in Operating Systems (HotOS XV).

Digital Library

[23]

Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. 2015. MLlib: Machine Learning in Apache Spark. The Journal of Machine Learning Research 17, 1 (2015).

Digital Library

[24]

Stavros Papadopoulos, Kushal Datta, Samuel Madden, and Timothy Mattson. 2016. The TileDB Array Data Storage Manager. Proc. VLDB Endow. 10, 4 (Nov. 2016), 349--360.

Digital Library

[25]

Karl Pearson. 1895. Notes on regression and inheritance in the case of two parents. In Proceedings of the Royal Society of London. 240--242.

[26]

Gregorio Quintana-Ortí, Francisco D. Igual, Mercedes Marqués, Enrique S. Quintana-Ortí, and Robert A. van de Geijn. 2012. A Runtime System for Programming Out-of-Core Matrix Algorithms-by-Tiles on Multithreaded Architectures. ACM Trans. Math. Softw. 38, 4 (Aug. 2012), 25:1--25:25.

Digital Library

[27]

rro Accessed 2/12/2017. Microsoft R Open. https://mran.microsoft.com/open/. (Accessed 2/12/2017).

[28]

Francis P. Russell, Michael R. Mellor, Paul H. J. Kelly, and Olav Beckmann. 2011. DESOLA: An Active Linear Algebra Library Using Delayed Evaluation and Runtime Code Generation. Sci. Comput. Program. (2011).

Digital Library

[29]

Arvind K. Sujeeth, Hyoukjoong Lee, Kevin J. Brown, Hassan Chafi, Michael Wu, Anand R. Atreya, Kunle Olukotun, Tiark Rompf, and Martin Odersky. 2011. OptiML: an implicitly parallel domainspecific language for machine learning. In in Proceedings of the 28th International Conference on Machine Learning.

Digital Library

[30]

J. Talbot, Z. DeVito, and P. Hanrahan. 2012. Riposte: A trace-driven compiler and parallel VM for vector code in R. In 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

Digital Library

[31]

Linpeng Tang, Qi Huang, Wyatt Lloyd, Sanjeev Kumar, and Kai Li. 2015. RIPQ: Advanced Photo Caching on Flash for Facebook. In 13th USENIX Conference on File and Storage Technologies (FAST 15). Santa Clara, CA.

Digital Library

[32]

Sivan Toledo. 1999. External Memory Algorithms. Boston, MA, USA, Chapter A Survey of Out-of-core Algorithms in Numerical Linear Algebra, 161--179.

Digital Library

[33]

webgraph Accessed 4/18/2014. Web graph. http://webdatacommons.org/hyperlinkgraph/. (Accessed 4/18/2014).

[34]

Maurice V. Wilkes. 2001. The Memory Gap and the Future of High Performance Memories. SIGARCH Comput. Archit. News 29, 1 (March 2001), 2--7.

Digital Library

[35]

Eric P. Xing, Qirong Ho, Wei Dai, Jin-Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A New Platform for Distributed Machine Learning on Big Data. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Digital Library

[36]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). USENIX, San Jose, CA, 15--28.

Digital Library

[37]

Da Zheng, Randal Burns, and Alexander S. Szalay. 2013. Toward Millions of File System IOPS on Low-Cost, Commodity Hardware. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis.

Digital Library

[38]

Da Zheng, Disa Mhembere, Randal Burns, Joshua Vogelstein, Carey E. Priebe, and Alexander S. Szalay. 2015. FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs. In 13th USENIX Conference on File and Storage Technologies (FAST 15).

Digital Library

[39]

Da Zheng, Disa Mhembere, Vince Lyzinski, Joshua Vogelstein, Carey E. Priebe, and Randal Burns. 2016. Semi-External Memory Sparse Matrix Multiplication on Billion-node Graphs. IEEE Transactions on Parallel & Distributed Systems (2016).

Digital Library

[40]

Xiaowei Zhu, Wentao Han, and Wenguang Chen. 2015. GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. In 2015 USENIX Annual Technical Conference (USENIX ATC 15).

Digital Library

Cited By

Ernstsson AKessler C(2018)Extending smart containers for data locality‐aware skeleton programmingConcurrency and Computation: Practice and Experience10.1002/cpe.500331:5Online publication date: 22-Oct-2018
https://doi.org/10.1002/cpe.5003

Recommendations

FlashR: parallelize and scale R for machine learning using SSDs
PPoPP '18

R is one of the most popular programming languages for statistics and machine learning, but it is slow and unable to scale to large datasets. The general approach for having an efficient algorithm in R is to implement it in C or FORTRAN and provide an R ...
DCS5: Diagonal Coding Scheme for Enhancing the Endurance of SSD-Based RAID-5 Systems
NAS '14: Proceedings of the 2014 9th IEEE International Conference on Networking, Architecture, and Storage

Solid-state drives (SSDs) have been widely deployed in large-scale storage systems. To guarantee high reliability for SSD-based storage systems, it still requires data redundancy schemes, e.g., RAID schemes. Traditional RAID-5 shows its benefits in load-...
ACS: an alternate coding scheme to improve degrade read performance for SSD-based RAID5 systems

To guarantee high performance and reliability, storage systems require better devices and data redundancy schemes, e.g., SSD-based RAID5. However, failures in the large-scale storage systems are common. In order to serve requests on a failed node, the SSD-...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 2018

442 pages

ISBN:9781450349826

DOI:10.1145/3178487

General Chair:
Andreas Krall
Vienna University of Technology, Austria
,
Program Chair:
Thomas R. Gross
ETH Zürich, Switzerland

ACM SIGPLAN Notices Volume 53, Issue 1
PPoPP '18
January 2018
426 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3200691
Editor:
Matthew Fluet
Rodchester Institude of Technology
Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 February 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF

Conference

PPoPP '18

Sponsor:

PPoPP '18: 23nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 24 - 28, 2018

Vienna, Austria

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
615
Total Downloads

Downloads (Last 12 months)126
Downloads (Last 6 weeks)8

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ernstsson AKessler C(2018)Extending smart containers for data locality‐aware skeleton programmingConcurrency and Computation: Practice and Experience10.1002/cpe.500331:5Online publication date: 22-Oct-2018
https://doi.org/10.1002/cpe.5003

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten