ABSTRACT
Today’s supercomputers are more heterogeneous than ever before. As the share of AI workloads in data centers continues to grow, the share of GPUs and AI-specific hardware grows with it. AI accelerators are different from traditional hardware, affecting all aspects of system design, from data-center scale to single-chip scale. AI accelerators are much more efficient than CPUs or GPUs for some HPC workloads, especially in AI for Science. They also add complexity to system architecture, management, and programming. Although runtime frameworks are critical to reducing system complexity, there is little literature describing AI accelerator runtimes. In this paper, we introduce RDARuntime - an AI-specific OS tailored for the development and operation of SambaNova’s reconfigurable dataflow architecture. We discuss the architecture, our design decisions, and some of the results we have achieved, along with some lessons we have learned while helping to deploy the Reconfigurable Dataflow Unit (RDU) to production environments.
- [n. d.]. Accelerated Computing with a Reconfigurable Dataflow Architecture. Retrieved July 29, 2023 from https://sambanova.ai/wp-content/uploads/2021/04/SambaNova_Accelerated-Computing-with-a-Reconfigurable-Dataflow-Architecture_Whitepaper_English.pdfGoogle Scholar
- [n. d.]. Data Plane Development Kit. Retrieved July 29, 2023 from https://github.com/DPDK/dpdkGoogle Scholar
- [n. d.]. SambaNova DataScale® SN30. Retrieved September 14, 2023 from https://sambanova.ai/wp-content/uploads/2022/09/SambaNova_DataSheet_DataScale_SN30_09132022_EN-1.pdfGoogle Scholar
- [n. d.]. TOP500 HIGHLIGHTS - JUNE 2023. Retrieved July 29, 2023 from https://www.top500.org/lists/top500/2023/06/highs/Google Scholar
- Claudio Angione, Eric Silverman, and Elisabeth Yaneske. 2022. Using machine learning as a surrogate model for agent-based simulations. Plos one 17, 2 (2022), e0263150.Google ScholarCross Ref
- Adel Belkhiri, Martin Pepin, Mike Bly, and Michel Dagenais. 2023. Performance analysis of DPDK-based applications through tracing. J. Parallel and Distrib. Comput. 173 (2023), 1–19. https://doi.org/10.1016/j.jpdc.2022.10.012Google ScholarDigital Library
- Ivano Cerrato, Mauro Annarumma, and Fulvio Risso. 2014. Supporting Fine-Grained Network Functions through Intel DPDK. In 2014 Third European Workshop on Software Defined Networks. 1–6. https://doi.org/10.1109/EWSDN.2014.33Google ScholarDigital Library
- Ranen Chatterjee, Ravinder Kumar, Raghunath Shenbagam, Maran Wilson, Conrad Alexander Turlik, Arnav Goel, Arjun Sabnis, and Yannan Chen. 2023. Elevated Isolation of Reconfigurable Data Flow Resources in Cloud Computing. Retrieved July 31, 2023 from https://patentimages.storage.googleapis.com/c3/26/ee/3a19bad1548112/US20230205585A1.pdf Patent No. US20230205585A1, Filed December 12, 2022, Issued June 29, 2023.Google Scholar
- Chi-Chung Chen, Chia-Lin Yang, and Hsiang-Yun Cheng. 2018. Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform. arXiv e-prints (2018), arXiv–1809.Google Scholar
- Ruining Chen and Guoao Sun. 2018. A Survey of Kernel-Bypass Techniques in Network Stack. In Proceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence (Shenzhen, China) (CSAI ’18). Association for Computing Machinery, New York, NY, USA, 474–477. https://doi.org/10.1145/3297156.3297242Google ScholarDigital Library
- Murali Emani, Venkatram Vishwanath, Corey Adams, Michael E Papka, Rick Stevens, Laura Florescu, Sumti Jairath, William Liu, Tejas Nama, and Arvind Sujeeth. 2021. Accelerating scientific applications with sambanova reconfigurable dataflow architecture. Computing in Science & Engineering 23, 2 (2021), 114–119.Google ScholarCross Ref
- Murali Emani, Zhen Xie, Siddhisanket Raskar, Varuni Sastry, William Arnold, Bruce Wilson, Rajeev Thakur, Venkatram Vishwanath, Zhengchun Liu, Michael E. Papka, Cindy Orozco Bohorquez, Rick Weisner, Karen Li, Yongning Sheng, Yun Du, Jian Zhang, Alexander Tsyplikhin, Gurdaman Khaira, Jeremy Fowers, Ramakrishnan Sivakumar, Victoria Godsoe, Adrian Macias, Chetan Tekur, and Matthew Boyd. 2022. A Comprehensive Evaluation of Novel AI Accelerators for Deep Learning Workloads. In 2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 13–25. https://doi.org/10.1109/PMBS56514.2022.00007Google ScholarCross Ref
- Gregory Frederick Grohoski, Manish K Shah, Raghu Prabhakar, Mark Luttrell, Ravinder Kumar, Kin Hing Leung, Ranen Chatterjee, Sumti Jairath, David Alan Koeplinger, Ram Sivaramakrishnan, 2022. Runtime Patching of Configuration Files. US Patent App. 16/996,666.Google Scholar
- Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks.. In Proceedings of Machine Learning and Systems, A. Talwalkar, V. Smith, and M. Zaharia (Eds.). Vol. 1. 1–13. https://proceedings.mlsys.org/paper_files/paper/2019/file/b422680f3db0986ddd7f8f126baaf0fa-Paper.pdfGoogle Scholar
- Peishi Jiang, Nis Meinert, Helga Jordão, Constantin Weisser, Simon Holgate, Alexander Lavin, Björn Lütjens, Dava Newman, Haruko Wainwright, Catherine Walker, and Patrick Barnard. 2021. Digital Twin Earth – Coasts: Developing a fast and physics-informed surrogate model for coastal floods via neural operators. arxiv:2110.07100 [physics.ao-ph]Google Scholar
- Poul-Henning Kamp. 1998. Malloc (3) revisited. In 1998 USENIX Annual Technical Conference (USENIX ATC 98).Google Scholar
- David Kirk 2007. NVIDIA CUDA software and GPU parallel computing architecture. In ISMM, Vol. 7. 103–104.Google Scholar
- Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, [n. d.]. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. Proceedings of the VLDB Endowment 13, 12 ([n. d.]).Google Scholar
- Robert Love. 2003. Kernel korner: CPU affinity. Linux Journal 2003, 111 (2003), 8.Google ScholarDigital Library
- Ogier Maitre and Pierre Collet. 2013. Understanding NVIDIA GPGPU Hardware. Springer Berlin Heidelberg, Berlin, Heidelberg, 15–34. https://doi.org/10.1007/978-3-642-37959-8_2Google ScholarCross Ref
- Anand Misra, Arnav Goel, Qi Zheng, Raghunath Shenbagam, and Ravinder Kumar. 2021. Time-Multiplexed use of Reconfigurable Hardware. Retrieved July 31, 2023 from https://patentimages.storage.googleapis.com/a0/ac/0c/06792e61002e09/US20220269534A1.pdf Patent No. US20220269534A1, Filed February 25, 2021, Issued August 25, 2022.Google Scholar
- Anand Misra, Conrad Alexander Turlik, Maran Wilson, Anand Vayyala, Raghu Shenbagam, Ranen Chatterjee, Pushkar Shridar Nandkar, and Shivam Raikundalia. 2022. Hot-plug events in a pool of reconfigurable data flow resources. Retrieved July 31, 2023 from https://patentimages.storage.googleapis.com/c3/26/ee/3a19bad1548112/US20230205585A1.pdf Patent No. US11487694B1, Filed December 17, 2021, Issued November 1, 2022.Google Scholar
- Oliver Peckham. 2022. SambaNova launches Second-Gen DataScale System. HPCWire (2022). https://www.hpcwire.com/2022/09/14/sambanova-launches-second-gen-datascale-system/Google Scholar
- Martin Russell Raumann, Qi Zheng, Bandish B Shah, Ravinder Kumar, Kin Hing Leung, Sumti Jairath, and Gregory Frederick Grohoski. 2021. Dataflow all-reduce for reconfigurable processor systems. Retrieved July 31, 2023 from https://patentimages.storage.googleapis.com/c3/26/ee/3a19bad1548112/US20230205585A1.pdf Patent No. US11237880B1, Filed July 19, 2021, Issued February 1, 2022.Google Scholar
- Hugo Sadok, Zhipeng Zhao, Valerie Choung, Nirav Atre, Daniel S. Berger, James C. Hoe, Aurojit Panda, and Justine Sherry. 2021. We Need Kernel Interposition over the Network Dataplane. In Proceedings of the Workshop on Hot Topics in Operating Systems (Ann Arbor, Michigan) (HotOS ’21). Association for Computing Machinery, New York, NY, USA, 152–158. https://doi.org/10.1145/3458336.3465281Google ScholarDigital Library
- Arman Shehabi, Sarah J Smith, Eric Masanet, and Jonathan Koomey. 2018. Data center growth in the United States: decoupling the demand for services from electricity use. Environmental Research Letters 13, 12 (dec 2018), 124030. https://doi.org/10.1088/1748-9326/aaec9cGoogle ScholarCross Ref
- Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. [n. d.]. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. ([n. d.]).Google Scholar
- Jaspal Subhlok, James M Stichnoth, David R O’hallaron, and Thomas Gross. 1993. Exploiting task and data parallelism on a multicomputer. In Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming. 13–22.Google ScholarDigital Library
- M S Vinaya, Nagavijayalakshmi Vydyanathan, and Mrugesh Gajjar. 2012. An evaluation of CUDA-enabled virtualization solutions. In 2012 2nd IEEE International Conference on Parallel, Distributed and Grid Computing. 621–626. https://doi.org/10.1109/PDGC.2012.6449892Google ScholarCross Ref
- Mark Wijtvliet, Henk Corporaal, and Akash Kumar. 2022. CGRA Background and Related Work. Blocks, Towards Energy-efficient, Coarse-grained Reconfigurable Architectures (2022), 15–60.Google ScholarCross Ref
- Michael R Wyatt, Valen Yamamoto, Zoë Tosi, Ian Karlin, and Brian Van Essen. 2021. Is disaggregation possible for HPC cognitive simulation?. In 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC). IEEE, 94–105.Google ScholarCross Ref
- Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, 2023. Pytorch FSDP: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277 (2023).Google Scholar
Index Terms
- RDARuntime: An OS for AI Accelerators
Recommendations
Hardware and software infrastructure to implement many-core systems in modern FPGAs
SBCCI '17: Proceedings of the 30th Symposium on Integrated Circuits and Systems Design: Chip on the SandsMany-core systems are increasingly popular in embedded systems due to their high-performance and flexibility to execute different workloads. These many-core systems provide a rich processing fabric but lack the flexibility to accelerate critical ...
Comparing Hardware Accelerators in Scientific Applications: A Case Study
Multicore processors and a variety of accelerators have allowed scientific applications to scale to larger problem sizes. We present a performance, design methodology, platform, and architectural comparison of several application accelerators executing ...
Petascale computing with accelerators
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programmingA trend is developing in high performance computing in which commodity processors are coupled to various types of computational accelerators. Such systems are commonly called hybrid systems. In this paper, we describe our experience developing an ...
Comments