research-article

RDARuntime: An OS for AI Accelerators

Authors:
Benjamin Glick

SambaNova Systems Inc, USA

SambaNova Systems Inc, USA

0000-0002-0762-2684
View Profile

,
Arjun Sabnis

SambaNova Systems Inc, USA

SambaNova Systems Inc, USA

0009-0007-5699-6117
View Profile

,
Renate Kempf

SambaNova Systems Inc, USA

SambaNova Systems Inc, USA

0009-0001-1555-8387
View Profile

,
Arnav Goel

SambaNova Systems Inc, USA

SambaNova Systems Inc, USA

0009-0003-7116-1506
View Profile

,
Aarti Lalwani

SambaNova Systems Inc, USA

SambaNova Systems Inc, USA

0009-0006-2058-9299
View Profile

,
Guoyao Feng

SambaNova Systems Inc, USA

SambaNova Systems Inc, USA

0009-0006-3514-9616
View Profile

,
Kiran Ranganath

SambaNova Systems Inc, USA

SambaNova Systems Inc, USA

0000-0001-8946-0000
View Profile

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and AnalysisNovember 2023Pages 1576–1587https://doi.org/10.1145/3624062.3624235

Published:12 November 2023Publication History

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

Pages 1576–1587

ABSTRACT

Today’s supercomputers are more heterogeneous than ever before. As the share of AI workloads in data centers continues to grow, the share of GPUs and AI-specific hardware grows with it. AI accelerators are different from traditional hardware, affecting all aspects of system design, from data-center scale to single-chip scale. AI accelerators are much more efficient than CPUs or GPUs for some HPC workloads, especially in AI for Science. They also add complexity to system architecture, management, and programming. Although runtime frameworks are critical to reducing system complexity, there is little literature describing AI accelerator runtimes. In this paper, we introduce RDARuntime - an AI-specific OS tailored for the development and operation of SambaNova’s reconfigurable dataflow architecture. We discuss the architecture, our design decisions, and some of the results we have achieved, along with some lessons we have learned while helping to deploy the Reconfigurable Dataflow Unit (RDU) to production environments.

References

[n. d.]. Accelerated Computing with a Reconfigurable Dataflow Architecture. Retrieved July 29, 2023 from https://sambanova.ai/wp-content/uploads/2021/04/SambaNova_Accelerated-Computing-with-a-Reconfigurable-Dataflow-Architecture_Whitepaper_English.pdfGoogle Scholar
[n. d.]. Data Plane Development Kit. Retrieved July 29, 2023 from https://github.com/DPDK/dpdkGoogle Scholar
[n. d.]. SambaNova DataScale® SN30. Retrieved September 14, 2023 from https://sambanova.ai/wp-content/uploads/2022/09/SambaNova_DataSheet_DataScale_SN30_09132022_EN-1.pdfGoogle Scholar
[n. d.]. TOP500 HIGHLIGHTS - JUNE 2023. Retrieved July 29, 2023 from https://www.top500.org/lists/top500/2023/06/highs/Google Scholar
Claudio Angione, Eric Silverman, and Elisabeth Yaneske. 2022. Using machine learning as a surrogate model for agent-based simulations. Plos one 17, 2 (2022), e0263150.Google ScholarCross Ref
Adel Belkhiri, Martin Pepin, Mike Bly, and Michel Dagenais. 2023. Performance analysis of DPDK-based applications through tracing. J. Parallel and Distrib. Comput. 173 (2023), 1–19. https://doi.org/10.1016/j.jpdc.2022.10.012Google ScholarDigital Library
Ivano Cerrato, Mauro Annarumma, and Fulvio Risso. 2014. Supporting Fine-Grained Network Functions through Intel DPDK. In 2014 Third European Workshop on Software Defined Networks. 1–6. https://doi.org/10.1109/EWSDN.2014.33Google ScholarDigital Library
Ranen Chatterjee, Ravinder Kumar, Raghunath Shenbagam, Maran Wilson, Conrad Alexander Turlik, Arnav Goel, Arjun Sabnis, and Yannan Chen. 2023. Elevated Isolation of Reconfigurable Data Flow Resources in Cloud Computing. Retrieved July 31, 2023 from https://patentimages.storage.googleapis.com/c3/26/ee/3a19bad1548112/US20230205585A1.pdf Patent No. US20230205585A1, Filed December 12, 2022, Issued June 29, 2023.Google Scholar
Chi-Chung Chen, Chia-Lin Yang, and Hsiang-Yun Cheng. 2018. Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform. arXiv e-prints (2018), arXiv–1809.Google Scholar
Ruining Chen and Guoao Sun. 2018. A Survey of Kernel-Bypass Techniques in Network Stack. In Proceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence (Shenzhen, China) (CSAI ’18). Association for Computing Machinery, New York, NY, USA, 474–477. https://doi.org/10.1145/3297156.3297242Google ScholarDigital Library
Murali Emani, Venkatram Vishwanath, Corey Adams, Michael E Papka, Rick Stevens, Laura Florescu, Sumti Jairath, William Liu, Tejas Nama, and Arvind Sujeeth. 2021. Accelerating scientific applications with sambanova reconfigurable dataflow architecture. Computing in Science & Engineering 23, 2 (2021), 114–119.Google ScholarCross Ref
Murali Emani, Zhen Xie, Siddhisanket Raskar, Varuni Sastry, William Arnold, Bruce Wilson, Rajeev Thakur, Venkatram Vishwanath, Zhengchun Liu, Michael E. Papka, Cindy Orozco Bohorquez, Rick Weisner, Karen Li, Yongning Sheng, Yun Du, Jian Zhang, Alexander Tsyplikhin, Gurdaman Khaira, Jeremy Fowers, Ramakrishnan Sivakumar, Victoria Godsoe, Adrian Macias, Chetan Tekur, and Matthew Boyd. 2022. A Comprehensive Evaluation of Novel AI Accelerators for Deep Learning Workloads. In 2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 13–25. https://doi.org/10.1109/PMBS56514.2022.00007Google ScholarCross Ref
Gregory Frederick Grohoski, Manish K Shah, Raghu Prabhakar, Mark Luttrell, Ravinder Kumar, Kin Hing Leung, Ranen Chatterjee, Sumti Jairath, David Alan Koeplinger, Ram Sivaramakrishnan, 2022. Runtime Patching of Configuration Files. US Patent App. 16/996,666.Google Scholar
Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks.. In Proceedings of Machine Learning and Systems, A. Talwalkar, V. Smith, and M. Zaharia (Eds.). Vol. 1. 1–13. https://proceedings.mlsys.org/paper_files/paper/2019/file/b422680f3db0986ddd7f8f126baaf0fa-Paper.pdfGoogle Scholar
Peishi Jiang, Nis Meinert, Helga Jordão, Constantin Weisser, Simon Holgate, Alexander Lavin, Björn Lütjens, Dava Newman, Haruko Wainwright, Catherine Walker, and Patrick Barnard. 2021. Digital Twin Earth – Coasts: Developing a fast and physics-informed surrogate model for coastal floods via neural operators. arxiv:2110.07100 [physics.ao-ph]Google Scholar
Poul-Henning Kamp. 1998. Malloc (3) revisited. In 1998 USENIX Annual Technical Conference (USENIX ATC 98).Google Scholar
David Kirk 2007. NVIDIA CUDA software and GPU parallel computing architecture. In ISMM, Vol. 7. 103–104.Google Scholar
Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, [n. d.]. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. Proceedings of the VLDB Endowment 13, 12 ([n. d.]).Google Scholar
Robert Love. 2003. Kernel korner: CPU affinity. Linux Journal 2003, 111 (2003), 8.Google ScholarDigital Library
Ogier Maitre and Pierre Collet. 2013. Understanding NVIDIA GPGPU Hardware. Springer Berlin Heidelberg, Berlin, Heidelberg, 15–34. https://doi.org/10.1007/978-3-642-37959-8_2Google ScholarCross Ref
Anand Misra, Arnav Goel, Qi Zheng, Raghunath Shenbagam, and Ravinder Kumar. 2021. Time-Multiplexed use of Reconfigurable Hardware. Retrieved July 31, 2023 from https://patentimages.storage.googleapis.com/a0/ac/0c/06792e61002e09/US20220269534A1.pdf Patent No. US20220269534A1, Filed February 25, 2021, Issued August 25, 2022.Google Scholar
Anand Misra, Conrad Alexander Turlik, Maran Wilson, Anand Vayyala, Raghu Shenbagam, Ranen Chatterjee, Pushkar Shridar Nandkar, and Shivam Raikundalia. 2022. Hot-plug events in a pool of reconfigurable data flow resources. Retrieved July 31, 2023 from https://patentimages.storage.googleapis.com/c3/26/ee/3a19bad1548112/US20230205585A1.pdf Patent No. US11487694B1, Filed December 17, 2021, Issued November 1, 2022.Google Scholar
Oliver Peckham. 2022. SambaNova launches Second-Gen DataScale System. HPCWire (2022). https://www.hpcwire.com/2022/09/14/sambanova-launches-second-gen-datascale-system/Google Scholar
Martin Russell Raumann, Qi Zheng, Bandish B Shah, Ravinder Kumar, Kin Hing Leung, Sumti Jairath, and Gregory Frederick Grohoski. 2021. Dataflow all-reduce for reconfigurable processor systems. Retrieved July 31, 2023 from https://patentimages.storage.googleapis.com/c3/26/ee/3a19bad1548112/US20230205585A1.pdf Patent No. US11237880B1, Filed July 19, 2021, Issued February 1, 2022.Google Scholar
Hugo Sadok, Zhipeng Zhao, Valerie Choung, Nirav Atre, Daniel S. Berger, James C. Hoe, Aurojit Panda, and Justine Sherry. 2021. We Need Kernel Interposition over the Network Dataplane. In Proceedings of the Workshop on Hot Topics in Operating Systems (Ann Arbor, Michigan) (HotOS ’21). Association for Computing Machinery, New York, NY, USA, 152–158. https://doi.org/10.1145/3458336.3465281Google ScholarDigital Library
Arman Shehabi, Sarah J Smith, Eric Masanet, and Jonathan Koomey. 2018. Data center growth in the United States: decoupling the demand for services from electricity use. Environmental Research Letters 13, 12 (dec 2018), 124030. https://doi.org/10.1088/1748-9326/aaec9cGoogle ScholarCross Ref
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. [n. d.]. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. ([n. d.]).Google Scholar
Jaspal Subhlok, James M Stichnoth, David R O’hallaron, and Thomas Gross. 1993. Exploiting task and data parallelism on a multicomputer. In Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming. 13–22.Google ScholarDigital Library
M S Vinaya, Nagavijayalakshmi Vydyanathan, and Mrugesh Gajjar. 2012. An evaluation of CUDA-enabled virtualization solutions. In 2012 2nd IEEE International Conference on Parallel, Distributed and Grid Computing. 621–626. https://doi.org/10.1109/PDGC.2012.6449892Google ScholarCross Ref
Mark Wijtvliet, Henk Corporaal, and Akash Kumar. 2022. CGRA Background and Related Work. Blocks, Towards Energy-efficient, Coarse-grained Reconfigurable Architectures (2022), 15–60.Google ScholarCross Ref
Michael R Wyatt, Valen Yamamoto, Zoë Tosi, Ian Karlin, and Brian Van Essen. 2021. Is disaggregation possible for HPC cognitive simulation?. In 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC). IEEE, 94–105.Google ScholarCross Ref
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, 2023. Pytorch FSDP: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277 (2023).Google Scholar

Index Terms

RDARuntime: An OS for AI Accelerators

Recommendations

Hardware and software infrastructure to implement many-core systems in modern FPGAs
SBCCI '17: Proceedings of the 30th Symposium on Integrated Circuits and Systems Design: Chip on the Sands

Many-core systems are increasingly popular in embedded systems due to their high-performance and flexibility to execute different workloads. These many-core systems provide a rich processing fabric but lack the flexibility to accelerate critical ...
Read More
Comparing Hardware Accelerators in Scientific Applications: A Case Study

Multicore processors and a variety of accelerators have allowed scientific applications to scale to larger problem sizes. We present a performance, design methodology, platform, and architectural comparison of several application accelerators executing ...
Read More
Petascale computing with accelerators
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

A trend is developing in high performance computing in which commodity processors are coupled to various types of computational accelerators. Such systems are commonly called hybrid systems. In this paper, we describe our experience developing an ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
November 2023
2180 pages
ISBN:9798400707858
DOI:10.1145/3624062

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 November 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
AI accelerators
distributed learning
operating systems
runtime frameworks
system management
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 65
  Total Downloads
- Downloads (Last 12 months)65
- Downloads (Last 6 weeks)18
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

RDARuntime: An OS for AI Accelerators

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Hardware and software infrastructure to implement many-core systems in modern FPGAs

Comparing Hardware Accelerators in Scientific Applications: A Case Study

Petascale computing with accelerators

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

RDARuntime: An OS for AI Accelerators

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Hardware and software infrastructure to implement many-core systems in modern FPGAs

Comparing Hardware Accelerators in Scientific Applications: A Case Study

Petascale computing with accelerators

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media