skip to main content
10.1145/3655038.3665943acmconferencesArticle/Chapter ViewAbstractPublication PageshotstorageConference Proceedingsconference-collections
research-article
Open access

Rethinking Erasure-Coding Libraries in the Age of Optimized Machine Learning

Published: 08 July 2024 Publication History

Abstract

Erasure codes are critical tools for building fault-tolerant and resource-efficient storage systems. However developing and maintaining optimized erasure-coding libraries are challenging. We make the case that the growth of fast machine-learning (ML) libraries may serve as a lifeboat for easing the development of current and future optimized erasure-coding libraries: fast erasure-coding libraries for various hardware platforms can be easily implemented by using existing optimized ML libraries. We show that the computation structure of many erasure codes mirrors that common to matrix multiplication, which is heavily optimized in ML libraries. Due to this similarity, one can implement erasure codes using ML libraries in few lines of code and with little knowledge of erasure codes, while immediately adopting the many optimizations within these libraries, without requiring expertise in high-performance programming. We develop prototypes of our proposed approach using an existing ML library. Our prototypes are up to 1.75× faster than state-of-the-art custom erasure-coding libraries.

References

[1]
2021. Intel Galois Field New Instructions (GFNI) Technology Guide. https://tinyurl.com/35w8pebx. Last accessed 12 December 2022.
[2]
2021. Intel Intelligent Storage Acceleration Library. https://www.intel.com/content/www/us/en/developer/tools/isa-l/overview.html. Last accessed 17 September 2022.
[3]
2024?. NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect. Last accessed 28 May 2024.
[4]
Quentin Anthony and Donglai Dai. 2021. Evaluating Multi-Level Checkpointing for Distributed Deep Neural Network Training. In 2021 SC Workshops Supplementary Proceedings (SCWS 21).
[5]
David A Beckingsale, Jason Burmark, Rich Hornung, Holger Jones, William Killian, Adam J Kunen, Olga Pearce, Peter Robinson, Brian S Ryujin, and Thomas RW Scogland. 2019. RAJA: Portable Performance for Large-Scale Scientific Applications. In 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC).
[6]
Mario Blaum and Ron M Roth. 1999. On Lowest Density MDS Codes. IEEE Transactions on Information Theory 45, 1 (1999), 46--59.
[7]
Johannes Bloemer, Malik Kalfane, Richard Karp, Marek Karpinski, Michael Luby, and David Zuckerman. 1995. An XOR-Based Erasure-Resilient Coding Scheme. Technical Report TR-95-048. University of California, Berkeley.
[8]
Guoyang Chen, Huiyang Zhou, Xipeng Shen, Josh Gahm, Narayan Venkat, Skip Booth, and John Marshall. 2016. OpenCL-Based Erasure Coding on Heterogeneous Architectures. In 2016 IEEE 27th International Conference on Application-Specific Systems, Architectures and Processors (ASAP 16).
[9]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18).
[10]
Matthew L Curry, Anthony Skjellum, H Lee Ward, and Ron Brightwell. 2011. Gibraltar: A Reed-Solomon Coding Library for Storage Applications on Programmable Graphics Processors. Concurrency and Computation: Practice and Experience 23, 18 (2011), 2477--2495.
[11]
Dmitry Duplyakin, Robert Ricci, Aleksander Maricq, Gary Wong, Jonathon Duerig, Eric Eide, Leigh Stoller, Mike Hibler, David Johnson, Kirk Webb, Aditya Akella, Kuangching Wang, Glenn Ricart, Larry Landweber, Chip Elliott, Michael Zink, Emmanuel Cecchet, Snigdhaswin Kar, and Prabodh Mishra. 2019. The Design and Operation of CloudLab. In Proceedings of the USENIX Annual Technical Conference (ATC). 1--14. https://www.flux.utah.edu/paper/duplyakin-atc19
[12]
Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudigere, Raghuraman Krishnamoorthi, Krishnakumar Nair, Misha Smelyanskiy, and Murali Annavaram. 2022. Check-N-Run: A Checkpointing System for Training Deep Learning Recommendation Models. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22).
[13]
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google File System. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (SOSP 03).
[14]
Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin. 2012. Erasure Coding in Windows Azure Storage. In 2012 USENIX Annual Technical Conference (USENIX ATC 12).
[15]
Zhe Jia, Marco Maggioni, Jeffrey Smith, and Daniele Paolo Scarpazza. 2019. Dissecting the NVIDIA Turing T4 GPU via Microbenchmarking. arXiv preprint arXiv:1903.07486 (2019).
[16]
Saurabh Kadekodi, Francisco Maturana, Sanjith Athlur, Arif Merchant, KV Rashmi, and Gregory R Ganger. 2022. Tiger: Disk-Adaptive Redundancy Without Placement Restrictions. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22).
[17]
Saurabh Kadekodi, Francisco Maturana, Suhas Jayaram Subramanya, Juncheng Yang, KV Rashmi, and Gregory R Ganger. 2020. PACEMAKER: Avoiding HeART Attacks in Storage Clusters with Disk-Adaptive Redundancy. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20).
[18]
Shiyao Lin, Guowen Gong, Zhirong Shen, Patrick PC Lee, and Jiwu Shu. 2021. Boosting Full-Node Repair in Erasure-Coded Storage. In 2021 USENIX Annual Technical Conference (USENIX ATC 21).
[19]
Chengjian Liu, Qiang Wang, Xiaowen Chu, and Yiu-Wing Leung. 2018. G-CRS: GPU Accelerated Cauchy Reed-Solomon Coding. IEEE Transactions on Parallel and Distributed Systems 29, 7 (2018), 1484--1498.
[20]
Kaige Liu, Jack Kosaian, and KV Rashmi. 2021. ECRM: Efficient Fault Tolerance for Recommendation Model Training via Erasure Coding. arXiv preprint arXiv:2104.01981 (2021).
[21]
J. Luo, M. Shrestha, L. Xu, and J. S. Plank. 2013. Efficient Encoding Schedules for XOR-based Erasure Codes. IEEE Transactions on Computing (May 2013).
[22]
Florence Jessie MacWilliams and Neil James Alexander Sloane. 1977. The Theory of Error Correcting Codes. Vol. 16. Elsevier.
[23]
Kiwan Maeng, Shivam Bharuka, Isabel Gao, Mark C Jeffrey, Vikram Saraph, Bor-Yiing Su, Caroline Trippel, Jiyan Yang, Mike Rabbat, Brandon Lucia, et al. 2021. CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery. In The Fourth Conference on Systems and Machine Learning (MLSys 21).
[24]
Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R De Supinski. 2010. Design, Modeling, and Evaluation of a Scalable Multi-Level Checkpointing System. In SC'10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC 10).
[25]
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 21).
[26]
Bogdan Nicolae, Adam Moody, Elsa Gonsiorowski, Kathryn Mohror, and Franck Cappello. 2019. VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS 19).
[27]
David A. Patterson, Garth Gibson, and Randy H. Katz. 1988. A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 88).
[28]
J Plank and K Greenan. 2014. Jerasure: A Library in C Facilitating Erasure Coding for Storage Applications. Univ. Tennessee, Knoxville, TN, USA, Tech. Rep. CS-07-603 (2014).
[29]
J. S. Plank. 2008. The RAID-6 Liberation Codes. In 6th USENIX Conference on File and Storage Technologies (FAST 08).
[30]
J. S. Plank, K. M. Greenan, and E. L. Miller. 2013. A Complete Treatment of Software Implementations of Finite Field Arithmetic for Erasure Coding Applications. Technical Report UT-CS-13-717. University of Tennessee.
[31]
James S Plank, Kevin M Greenan, and Ethan L Miller. 2013. Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions. In 11th USENIX Conference on File and Storage Technologies (FAST 13).
[32]
Yi Qiao, Menghao Zhang, Yu Zhou, Xiao Kong, Han Zhang, Mingwei Xu, Jun Bi, and Jilong Wang. 2022. NetEC: Accelerating Erasure Coding Reconstruction With In-Network Aggregation. IEEE Transactions on Parallel and Distributed Systems 33, 10 (2022), 2571--2583.
[33]
Zaid Qureshi, Vikram Sharma Mailthody, Isaac Gelado, Seung Won Min, Amna Masood, Jeongmin Park, Jinjun Xiong, CJ Newburn, Dmitri Vainbrand, I Chung, et al. 2022. BaM: A Case for Enabling Fine-grain High Throughput GPU-Orchestrated Access to Storage. arXiv preprint arXiv:2203.04910 (2022).
[34]
K. V. Rashmi, Nihar B Shah, Dikang Gu, Hairong Kuang, Dhruba Borthakur, and Kannan Ramchandran. 2014. A Hitchhiker's Guide to Fast and Efficient Data Reconstruction in Erasure-Coded Data Centers. In Proceedings of the 2014 ACM SIGCOMM Conference (SIGCOMM 14).
[35]
Mahesh Sathiamoorthy, Megasthenis Asteris, Dimitris Papailiopoulos, Alexandros G Dimakis, Ramkumar Vadali, Scott Chen, and Dhruba Borthakur. 2013. XORing Elephants: Novel Erasure Codes for Big Data. Proceedings of the VLDB Endowment 6, 5 (2013).
[36]
Haiyang Shi and Xiaoyi Lu. 2019. TriEC: Tripartite Graph Based Erasure Coding NIC Offload. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 19).
[37]
Haiyang Shi and Xiaoyi Lu. 2020. INEC: Fast and Coherent In-Network Erasure Coding. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC 20).
[38]
Christian R. Trott, Damien Lebrun-Grandié, Daniel Arndt, Jan Ciesko, Vinh Dang, Nathan Ellingwood, Rahulkumar Gayatri, Evan Harvey, Daisy S. Hollman, Dan Ibanez, Nevin Liber, Jonathan Madsen, Jeff Miles, David Poliakoff, Amy Powell, Sivasankaran Rajamanickam, Mikael Simberg, Dan Sunderland, Bruno Turcksin, and Jeremiah Wilke. 2022. Kokkos 3: Programming Model Extensions for the Exascale Era. IEEE Transactions on Parallel and Distributed Systems 33, 4 (2022), 805--817.
[39]
Yuya Uezato. 2021. Accelerating XOR-Based Erasure Coding Using Program Optimization Techniques. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 21).
[40]
Hakim Weatherspoon and John D Kubiatowicz. 2002. Erasure Coding vs. Replication: A Quantitative Comparison. In International Workshop on Peer-to-Peer Systems (IPTPS 2002).
[41]
David B Williams-Young, Wibe A De Jong, Hubertus JJ Van Dam, and Chao Yang. 2020. On the Efficient Evaluation of the Exchange Correlation Potential on Graphics Processing Unit Clusters. Frontiers in chemistry 8 (2020), 581058.
[42]
Si Wu, Zhirong Shen, and Patrick PC Lee. 2020. Enabling I/O-Efficient Redundancy Transitioning in Erasure-Coded KV Stores via Elastic Reed-Solomon Codes. In 2020 International Symposium on Reliable Distributed Systems (SRDS). IEEE, 246--255.
[43]
Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et al. 2020. Ansor: Generating High-Performance Tensor Programs for Deep Learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20).
[44]
Tianli Zhou and Chao Tian. 2019. Fast Erasure Coding for Data Storage: A Comprehensive Study of the Acceleration Techniques. In 17th USENIX Conference on File and Storage Technologies (FAST 19).

Index Terms

  1. Rethinking Erasure-Coding Libraries in the Age of Optimized Machine Learning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    HotStorage '24: Proceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems
    July 2024
    141 pages
    ISBN:9798400706301
    DOI:10.1145/3655038
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 July 2024

    Check for updates

    Author Tags

    1. erasure coding
    2. machine learning
    3. redundancy

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    HOTSTORAGE '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 34 of 87 submissions, 39%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 461
      Total Downloads
    • Downloads (Last 12 months)461
    • Downloads (Last 6 weeks)69
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media