skip to main content
10.1145/3620666.3651329acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Open Access

Characterizing Power Management Opportunities for LLMs in the Cloud

Published:27 April 2024Publication History

ABSTRACT

Recent innovation in large language models (LLMs), and their myriad use cases have rapidly driven up the compute demand for datacenter GPUs. Several cloud providers and other enterprises plan to substantially grow their datacenter capacity to support these new workloads. A key bottleneck resource in datacenters is power, which LLMs are quickly saturating due to their rapidly increasing model sizes.

We extensively characterize the power consumption patterns of a variety of LLMs and their configurations. We identify the differences between the training and inference power consumption patterns. Based on our analysis, we claim that the average and peak power utilization in LLM inference clusters should not be very high. Our deductions align with data from production LLM clusters, revealing that inference workloads offer substantial headroom for power oversubscription. However, the stringent set of telemetry and controls that GPUs offer in a virtualized environment make it challenging to build a reliable and robust power management framework.

We leverage the insights from our characterization to identify opportunities for better power management. As a detailed use case, we propose a new framework called POLCA, which enables power oversubscription in LLM inference clouds. POLCA is robust, reliable, and readily deployable. Using open-source models to replicate the power patterns observed in production, we simulate POLCA and demonstrate that we can deploy 30% more servers in existing clusters with minimal performance loss.

References

  1. Amazon SageMaker. https://aws.amazon.com/sagemaker, 2023.Google ScholarGoogle Scholar
  2. Azure Machine Learning - ML as a Service. https://azure.microsoft.com/en-us/products/machine-learning, 2023.Google ScholarGoogle Scholar
  3. AMD. ROCm Open Software Platform for GPU Compute. https://www.amd.com/en/graphics/servers-solutions-rocm.Google ScholarGoogle Scholar
  4. Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, and Yuxiong He. DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. In SC, 2022.Google ScholarGoogle Scholar
  5. Alex Andonian, Quentin Anthony, Stella Biderman, Sid Black, Preetham Gali, Leo Gao, Eric Hallahan, Josh Levy-Kramer, Connor Leahy, Lucas Nestler, Kip Parker, Michael Pieler, Shivanshu Purohit, Tri Songz, Wang Phil, and Samuel Weinbach. GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch. https://www.github.com/eleutherai/gpt-neox, 2021.Google ScholarGoogle Scholar
  6. Microsoft Azure. Azure OpenAI Service. https://azure.microsoft.com/en-us/products/ai-services/openai-service, 2022.Google ScholarGoogle Scholar
  7. Luiz André Barroso, Urs Hölzle, and Parthasarathy Ranganathan. The Datacenter as a Computer: Designing Warehouse-Scale Machines. Synthesis Lectures on Computer Architecture, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In FAccT, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Srikant Bharadwaj, Shomit Das, Kaushik Mazumdar, Bradford Beckmann, and Stephen Kosonocky. Predict; Do Not React for Enabling Efficient Fine Grain DVFS in GPUs. In ASPLOS, 2023.Google ScholarGoogle Scholar
  10. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language Models Are Few-Shot Learners. In NeurIPS, 2020.Google ScholarGoogle Scholar
  11. Sangjin Choi, Inhoe Koo, Jeongseob Ahn, Myeongjae Jeon, and Youngjin Kwon. EnvPipe: Performance-preserving DNN Training Framework for Saving Energy. In USENIX ATC, 2023.Google ScholarGoogle Scholar
  12. Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling Instruction-Finetuned Language Models. arXiv preprint arXiv:2210.11416, 2022.Google ScholarGoogle Scholar
  13. Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. In NeurIPS, 2022.Google ScholarGoogle Scholar
  14. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, 2019.Google ScholarGoogle Scholar
  15. Xiaobo Fan, Wolf-Dietrich Weber, and Luiz Andre Barroso. Power Provisioning for a Warehouse-sized Computer. In ISCA, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. International Institute for Strategic Studies. Large Language Models: Fast Proliferation and Budding International Competition. Strategic Comments, 2023.Google ScholarGoogle Scholar
  17. Xing Fu, Xiaorui Wang, and Charles Lefurgy. How Much Power Oversubscription is Safe and Allowed in Data Centers? In ICAC, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. GitHub. bitsandbytes: Memory decreases but latency increases. https://github.com/TimDettmers/bitsandbytes/issues/6, 2022.Google ScholarGoogle Scholar
  19. Sriram Govindan, Jeonghwan Choi, Bhuvan Urgaonkar, Anand Sivasubramaniam, and Andrea Baldini. Statistical Profiling-based Techniques for Effective Power Provisioning in Data Centers. In EuroSys, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, and Sourab Mangrulkar. Accelerate: Training and Inference at Scale Made Simple, Efficient and Adaptable. https://github.com/huggingface/accelerate, 2022.Google ScholarGoogle Scholar
  21. Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, James Law, Kevin Lee, Jason Lu, Pieter Noordhuis, Misha Smelyanskiy, Liang Xiong, and Xiaodong Wang. Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In HPCA, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  22. Miro Hodak, Masha Gorkovenko, and Ajay Dholakia. Towards Power Efficiency in Deep Learning on Data Center Hardware. In BigData, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  23. Chang-Hong Hsu, Qingyuan Deng, Jason Mars, and Lingjia Tang. SmoothOperator: Reducing Power Fragmentation and Improving Power Utilization in Large-Scale Datacenters. In ASPLOS, 2018.Google ScholarGoogle Scholar
  24. Qinghao Hu, Peng Sun, Shengen Yan, Yonggang Wen, and Tianwei Zhang. Characterization and Prediction of Deep Learning Workloads in Large-scale GPU Datacenters. In SC, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism. In NeurIPS, 2019.Google ScholarGoogle Scholar
  26. Intel, Hewlett Packard, NEC, and Dell. Intelligent Platform Management Interface Specification (IPMI). https://www.intel.in/content/www/in/en/products/docs/servers/ipmi/ipmi-second-gen-interface-spec-v2-rev1-1.html, 2013.Google ScholarGoogle Scholar
  27. Ali Jahanshahi, Hadi Zamani Sabzi, Chester Lau, and Daniel Wong. GPU-NEST: Characterizing Energy Efficiency of Multi-GPU Inference Servers. In CAL, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  28. Majid Jalili, Ioannis Manousakis, Íñigo Goiri, Pulkit A. Misra, Ashish Raniwala, Husam Alissa, Bharath Ramakrishnan, Phillip Tuma, Christian Belady, Marcus Fontoura, and Ricardo Bianchini. Cost-Efficient Overclocking in Immersion-Cooled Datacenters. In ISCA, 2021.Google ScholarGoogle Scholar
  29. Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. Analysis of Large-Scale Multi-Tenant GPU clusters for DNN Training Workloads. In USENIX ATC, 2019.Google ScholarGoogle Scholar
  30. Kashif Nizam Khan, Mikael Hirki, Tapio Niemi, Jukka K. Nurminen, and Zhonghong Ou. RAPL in Action: Experiences in Using RAPL for Power Measurements. 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Alok Gautam Kumbhare, Reza Azimi, Ioannis Manousakis, Anand Bonde, Felipe Frujeri, Nithish Mahalingam, Pulkit A. Misra, Seyyed Ahmad Javadi, Bianca Schroeder, Marcus Fontoura, and Ricardo Bianchini. Prediction-Based Power Oversubscription in Cloud Platforms. In USENIX ATC, 2021.Google ScholarGoogle Scholar
  32. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention. In SOSP, 2023.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Baolin Li, Rohin Arora, Siddharth Samsi, Tirthak Patel, William Arcand, David Bestor, Chansup Byun, Rohan Basu Roy, Bill Bergeron, John Holodnak, Michael Houle, Matthew Hubbell, Michael Jones, Jeremy Kepner, Anna Klein, Peter Michaleas, Joseph McDonald, Lauren Milechin, Julie Mullen, Andrew Prout, Benjamin Price, Albert Reuther, Antonio Rosa, Matthew Weiss, Charles Yee, Daniel Edelman, Allan Vanterpool, Anson Cheng, Vijay Gadepally, and Devesh Tiwari. AI-Enabling Workloads on Large-Scale GPU-Accelerated System: Characterization, Opportunities, and Implications. In HPCA, 2022.Google ScholarGoogle ScholarCross RefCross Ref
  34. Shaohong Li, Xi Wang, Xiao Zhang, Vasileios Kontorinis, Sreekumar Kodakara, David Lo, and Parthasarathy Ranganathan. Thunderbolt: Throughput-Optimized, Quality-of-Service-Aware Power Capping at Scale. In OSDI, 2020.Google ScholarGoogle Scholar
  35. Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. Pytorch Distributed: Experiences on Accelerating Data Parallel Training. In VLDB, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Yang Li, Charles R Lefurgy, Karthick Rajamani, Malcolm S Allen-Ware, Guillermo J Silva, Daniel D Heimsoth, Saugata Ghose, and Onur Mutlu. A Scalable Priority-aware Approach to Managing Data Center Server Power. In HPCA, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  37. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692, 2019.Google ScholarGoogle Scholar
  38. Meta. Introducing the AI Research SuperCluster --- Meta's Cutting-edge AI Supercomputer for AI Research. https://ai.facebook.com/blog/ai-rsc/.Google ScholarGoogle Scholar
  39. Microsoft. DeepSpeed: Model Implementations for Inference (MII). https://github.com/microsoft/DeepSpeed-MII.Google ScholarGoogle Scholar
  40. Seyed Morteza Nabavinejad, Sherief Reda, and Masoumeh Ebrahimi. Coordinated Batching and DVFS for DNN Inference on GPU Accelerators. TPDS, 2022.Google ScholarGoogle ScholarCross RefCross Ref
  41. Bin Nie, Ji Xue, Saurabh Gupta, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. Characterizing Temperature, Power, and Soft-error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities. In MASCOTS, 2017.Google ScholarGoogle Scholar
  42. NVIDIA. Data Center GPU Driver. https://docs.nvidia.com/datacenter/tesla/pdf/NVIDIA_Data_Center_GPU_Driver_Release_Notes_450_v1.pdf.Google ScholarGoogle Scholar
  43. NVIDIA. Data Center GPU Manager (DCGM). https://developer.nvidia.com/dcgm.Google ScholarGoogle Scholar
  44. NVIDIA. DGX A100: The Universal System for AI Infrastructure. https://resources.nvidia.com/en-us-dgx-systems/dgx-ai.Google ScholarGoogle Scholar
  45. NVIDIA. DGX H100. https://www.nvidia.com/en-us/data-center/dgx-h100/.Google ScholarGoogle Scholar
  46. NVIDIA. NVIDIA A100 80GB PCIe GPU Product Brief. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/PB-10577-001_v02.pdf.Google ScholarGoogle Scholar
  47. NVIDIA. System Management Interface (nvidia-smi). https://developer.nvidia.com/nvidia-system-management-interface.Google ScholarGoogle Scholar
  48. OpenAI. Scaling Kubernetes to 7,500 Nodes. https://openai.com/research/scaling-kubernetes-to-7500-nodes.Google ScholarGoogle Scholar
  49. Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo Goiri, Aashaka Shah, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. arXiv preprint arXiv:2311.18677, 2023.Google ScholarGoogle Scholar
  50. Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo Goiri, Brijesh Warrier, Nithish Mahalingam, and Ricardo Bianchini. POLCA: Power Oversubscription in LLM Cloud Providers. arXiv preprint arXiv:2308.12908, 2023.Google ScholarGoogle Scholar
  51. Pratyush Patel, Zibo Gong, Syeda Rizvi, Esha Choukse, Pulkit Misra, Thomas Anderson, and Akshitha Sriraman. Towards Improved Power Management in Cloud GPUs. In CAL, 2023.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Tapasya Patki, Zachary Frye, Harsh Bhatia, Francesco Di Natale, James Glosli, Helgi Ingolfsson, and Barry Rountree. Comparing GPU Power and Frequency Capping: A Case Study with the MuMMI Workflow. In WORK, 2019.Google ScholarGoogle Scholar
  53. David Patterson, Joseph Gonzalez, Urs Hölzle, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David R So, Maud Texier, and Jeff Dean. The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink. Computer, 2022.Google ScholarGoogle Scholar
  54. Martin Peres. Reverse Engineering Power Management on NVIDIA GPUs - A Detailed Overview. In XDC, 2013.Google ScholarGoogle Scholar
  55. Pavlos Petoumenos, Lev Mukhanov, Zheng Wang, Hugh Leather, and Dimitrios S. Nikolopoulos. Power Capping: What Works, What Does Not. In ICPADS, 2015.Google ScholarGoogle Scholar
  56. Google Cloud Platform. Vertex AI. https://cloud.google.com/vertex-ai, 2023.Google ScholarGoogle Scholar
  57. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language Models are Unsupervised Multitask Learners, 2019.Google ScholarGoogle Scholar
  58. Parthasarathy Ranganathan, Phil Leech, David Irwin, and Jeffrey Chase. Ensemble-level Power Management for Dense Blade Servers. In ISCA, 2006.Google ScholarGoogle Scholar
  59. Tirias Research. Why Your AI infrastructure Needs Both Training and Inference. 2019.Google ScholarGoogle Scholar
  60. Philipp Schmid. Fine-tune FLAN-T5 XL/XXL using DeepSpeed & Hugging Face Transformers. https://www.philschmid.de/fine-tune-flan-t5-deepspeed.Google ScholarGoogle Scholar
  61. Amazon Web Services. Amazon EC2 Update - Inf1 Instances with AWS Inferentia Chips for High Performance Cost-Effective Inferencing. https://aws.amazon.com/blogs/aws/amazon-ec2-update-inf1-instances-with-aws-inferentia-chips-for-high-performance-cost-effective-inferencing/.Google ScholarGoogle Scholar
  62. Amazon Web Services. AWS Trainium: High-performance Machine Learning Training Accelerator, Purpose Built by AWS. https://aws.amazon.com/machine-learning/trainium/.Google ScholarGoogle Scholar
  63. Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training Multi-billion Parameter Language Models Using Model Parallelism. arXiv preprint arXiv:1909.08053, 2019.Google ScholarGoogle Scholar
  64. Dharma Shukla, Muthian Sivathanu, Srinidhi Viswanatha, Bhargav Gulavani, Rimma Nehme, Amey Agrawal, Chen Chen, Nipun Kwatra, Ramachandran Ramjee, Pankaj Sharma, Atul Katiyar, Vipul Modi, Vaibhav Sharma, Abhishek Singh, Shreshth Singhal, Kaustubh Welankar, Lu Xun, Ravi Anupindi, Karthik Elangovan, and Mark Russinovich. Singularity: Planet-scale, Preemptive and Elastic Scheduling of AI Workloads. arXiv preprint arXiv:2202.07848, 2022.Google ScholarGoogle Scholar
  65. Prasoon Sinha, Akhil Guliani, Rutwik Jain, Brandon Tran, Matthew D Sinclair, and Shivaram Venkataraman. Not All GPUs Are Created Equal: Characterizing Variability in Large-scale, Accelerator-rich Systems. In SC, 2022.Google ScholarGoogle Scholar
  66. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open Foundation and Fine-tuned Chat Models. arXiv preprint arXiv:2307.09288, 2023.Google ScholarGoogle Scholar
  67. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is All You Need. In NeurIPS, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Lan Vu, Hari Sivaraman, and Rishi Bidarkar. GPU Virtualization for High Performance General Purpose Computing on the ESX Hypervisor. In HPC, 2014.Google ScholarGoogle Scholar
  69. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art Natural Language Processing. In EMNLP, 2020.Google ScholarGoogle Scholar
  70. BigScience Workshop, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, and Colin Raffel. BLOOM: A 176B-Parameter Open-access Multilingual Language Model. arXiv preprint arXiv:2211.05100, 2022.Google ScholarGoogle Scholar
  71. Jie You, Jae-Won Chung, and Mosharaf Chowdhury. Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training. In NSDI, 2023.Google ScholarGoogle Scholar
  72. Junyeol Yu, Jongseok Kim, and Euiseong Seo. Know Your Enemy To Save Cloud Energy: Energy-Performance Characterization of Machine Learning Serving. In HPCA, 2023.Google ScholarGoogle ScholarCross RefCross Ref
  73. Chaojie Zhang, Alok Gautam Kumbhare, Ioannis Manousakis, Deli Zhang, Pulkit A Misra, Rod Assis, Kyle Woolcock, Nithish Mahalingam, Brijesh Warrier, David Gauthier, Lalu Kunnath, Steve Solomon, Osvaldo Morales, Marcus Fontoura, and Ricardo Bianchini. Flex: High-Availability Datacenters With Zero Reserved Power. In ISCA, 2021.Google ScholarGoogle Scholar
  74. Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. OPT: Open Pre-trained Transformer Language Models. arXiv preprint arXiv:2205.01068, 2022.Google ScholarGoogle Scholar

Index Terms

  1. Characterizing Power Management Opportunities for LLMs in the Cloud

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3
            April 2024
            1106 pages
            ISBN:9798400703867
            DOI:10.1145/3620666

            This work is licensed under a Creative Commons Attribution International 4.0 License.

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 27 April 2024

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate535of2,713submissions,20%
          • Article Metrics

            • Downloads (Last 12 months)284
            • Downloads (Last 6 weeks)284

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader