research-article

Liger: Interleaving Intra- and Inter-Operator Parallelism for Distributed Large Model Inference

Authors:

Shenggan Cheng,

Yutong LuAuthors Info & Claims

PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

Pages 42 - 54

https://doi.org/10.1145/3627535.3638466

Published: 20 February 2024 Publication History

Abstract

Distributed large model inference is still in a dilemma where balancing cost and effect. The online scenarios demand intraoperator parallelism to achieve low latency and intensive communications makes it costly. Conversely, the inter-operator parallelism can achieve high throughput with much fewer communications, but it fails to enhance the effectiveness.

In this paper, we present Liger, a distributed large model inference runtime system that is of capability to achieve low latency at high throughput on the multi-GPU architecture. The key idea lies in the novel interleaved parallelism, which interleaves the computation and communication across requests. Liger enables this parallelism by carefully scheduling computation and communication kernels across requests onto multiple streams of multiple GPUs. It achieves precise control of kernel execution order efficiently by mixing use the CPU-GPU synchronization and the inter-stream synchronization. To prevent scheduling failures caused by resource contention, Liger introduces a contention factor strategy to anticipate the penalty of contention. It enables a higher degree of overlap by decomposing lengthy kernels into smaller, more manageable units at runtime.

Extensive evaluations show that Liger, in most cases, outperforms existing parallelism approaches across models and devices, presenting the best latency and throughput results. In a 4-device case, Liger reduces the average latency by 36.0% while maintaining the same throughput compared to the inter-operator approach. Meanwhile, it improves the throughput by 1.34× with improved average latency compared to the intra-operator approach.

References

[1]

AMD. 2023. AMD Infinity Architecture. https://www.amd.com/en/technologies/infinity-architecture

[2]

Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, et al. 2022. DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 646--660.

[3]

Tom Brown, Benjamin Mann, Nick Ryder, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877--1901.

[4]

Shenggan Cheng, Ziming Liu, Jiangsu Du, and Yang You. 2023. ATP: Adaptive Tensor Parallelism for Foundation Models. arXiv preprint arXiv:2301.08658 (2023).

[5]

Shenggan Cheng, Xuanlei Zhao, Guangyang Lu, Jiarui Fang, Zhongming Yu, Tian Zheng, Ruidong Wu, Xiwen Zhang, Jian Peng, and Yang You. 2023. FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours. arXiv:2203.00854 [cs.LG]

[6]

ONNX Runtime developers. 2018. ONNX Runtime. https://github.com/microsoft/onnxruntime

[7]

Jiangsu Du, Ziming Liu, Jiarui Fang, Shenggui Li, Yongbin Li, Yutong Lu, and Yang You. 2022. EnergonAI: An Inference System for 10--100 Billion Parameter Transformer Models. arXiv preprint arXiv:2209.02341 (2022).

[8]

Mengnan Du, Subhabrata Mukherjee, Yu Cheng, Milad Shokouhi, Xia Hu, and Ahmed Hassan Awadallah. 2021. What do compressed large language models forget? robustness challenges in model compression. arXiv e-prints (2021), arXiv-2110.

[9]

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 320--335.

[10]

Glenn A Elliott, Bryan C Ward, and James H Anderson. 2013. GPUSync: A framework for real-time GPU management. In 2013 IEEE 34th Real-Time Systems Symposium. IEEE, 33--44.

Digital Library

[11]

Jiarui Fang, Yang Yu, Chengduo Zhao, and Jie Zhou. 2021. Turbotransformers: an efficient gpu serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 389--402.

Digital Library

[12]

Yangyang Feng, Minhui Xie, Zijie Tian, Shuo Wang, Youyou Lu, and Jiwu Shu. 2023. Mobius: Fine tuning large-scale models on commodity gpu servers. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 489--501.

Digital Library

[13]

Github. 2023. Copilot. https://github.com/features/copilot

[14]

Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. 2022. Microsecond-scale preemption for concurrent GPU-accelerated DNN inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 539--558.

[15]

Xu Han, Guoyang Zeng, Weilin Zhao, Zhiyuan Liu, Zhengyan Zhang, Jie Zhou, Jun Zhang, Jia Chao, and Maosong Sun. 2022. BMInf: An efficient toolkit for big model inference and tuning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 224--230.

[16]

Yanping Huang, Youlong Cheng, Ankur Bapna, et al. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8--14, 2019, Vancouver, BC, Canada. 103--112.

[17]

Shinpei Kato, Karthik Lakshmanan, et al. 2011. TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments. In 2011 USENIX Annual Technical Conference (USENIX ATC 11).

[18]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, et al. 2020. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In International Conference on Learning Representations.

[19]

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gonzalez, et al. 2023. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 663--679.

[20]

Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, and Ion Stoica. 2021. Terapipe: Token-level pipeline parallelism for training large-scale language models. In International Conference on Machine Learning. PMLR, 6543--6552.

[21]

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 1--15.

Digital Library

[22]

NVIDIA. 2023. Fast Multi-GPU collectives with NCCL. https://developer.nvidia.com/blog/fast-multi-gpu-collectives-nccl/

[23]

NVIDIA. 2023. FasterTransformer. https://github.com/NVIDIA/FasterTransformer

[24]

NVIDIA. 2023. NCCL Tests. https://github.com/NVIDIA/nccl-tests/tree/master

[25]

NVIDIA. 2023. NVLink and NVSwitch. https://www.nvidia.com/en-us/data-center/nvlink/

[26]

NVIDIA. 2023. TensorRT. https://github.com/NVIDIA/TensorRT

[27]

OpenAI. 2023. Chatgpt. https://chat.openai.com/chat

[28]

Reiner Pope, Sholto Douglas, et al. 2023. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems 5 (2023).

[29]

Samyam Rajbhandari, Conglong Li, and et al. 2022. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International Conference on Machine Learning. PMLR, 18332--18346.

[30]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--16.

[31]

Saeed Rashidi, Matthew Denton, Srinivas Sridharan, Sudarshan Srinivasan, Amoghavarsha Suresh, Jade Nie, and Tushar Krishna. 2021. Enabling compute-communication overlap in distributed deep learning training platforms. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 540--553.

Digital Library

[32]

Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, et al. 2018. Mesh-tensorflow: Deep learning for supercomputers. Advances in neural information processing systems 31 (2018).

[33]

Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. 2019. Nexus: A GPU cluster engine for accelerating DNN-based video analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 322--337.

Digital Library

[34]

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Re, Ion Stoica, and Ce Zhang. 2023. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. (2023).

[35]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, et al. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRR abs/1909.08053 (2019). arXiv:1909.08053 http://arxiv.org/abs/1909.08053

[36]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[37]

Bingyang Wu, Yinmin Zhong, Zili Zhang, Gang Huang, Xuanzhe Liu, and Xin Jin. 2023. Fast Distributed Inference Serving for Large Language Models. arXiv preprint arXiv:2305.05920 (2023).

[38]

Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. 2020. {AntMan}: Dynamic scaling on {GPU} clusters for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 533--548.

[39]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for {Transformer-Based} Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521--538.

[40]

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2022. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414 (2022).

[41]

Susan Zhang, Stephen Roller, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).

[42]

Wei Zhang, Binghao Chen, and et al. 2022. {PilotFish}: Harvesting Free Cycles of Cloud Gaming with Deep Learning Training. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 217--232.

[43]

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, et al. 2023. H₂O : Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. arXiv preprint arXiv:2306.14048 (2023).

[44]

Quan Zhou, Haiquan Wang, and et al. 2023. MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 556--569.

[45]

Zhe Zhou, Xuechao Wei, Jiejing Zhang, and Guangyu Sun. 2022. {PetS}: A Unified Framework for {Parameter-Efficient} Transformers Serving. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 489--504.

Cited By

Luo XLiu DKong HHuai SChen HXiong GLiu W(2024)Efficient Deep Learning Infrastructures for Embedded Computing Systems: A Comprehensive Survey and Future EnvisionACM Transactions on Embedded Computing Systems10.1145/370172824:1(1-100)Online publication date: 24-Oct-2024
https://dl.acm.org/doi/10.1145/3701728

Index Terms

Liger: Interleaving Intra- and Inter-Operator Parallelism for Distributed Large Model Inference
1. Computing methodologies
  1. Artificial intelligence
    1. Distributed artificial intelligence
  2. Distributed computing methodologies
    1. Distributed algorithms

Recommendations

POSTER: Pagoda: A Runtime System to Maximize GPU Utilization in Data Parallel Tasks with Limited Parallelism
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the hardware, contemporary workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains ...
Intra-Socket and Inter-Socket Communication in Multi-core Systems

The increasing computational and communication demands of the scientific and industrial communities require a clear understanding of the performance trade-offs involved in multi-core computing platforms. Such analysis can help application and toolkit ...
Opara: Exploiting Operator Parallelism for Expediting DNN Inference on GPUs
GPUs have become the <italic>defacto</italic> hardware devices for accelerating Deep Neural Network (DNN) inference workloads. However, the conventional <italic>sequential execution mode of DNN operators</italic> in mainstream deep learning frameworks ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

March 2024

498 pages

ISBN:9798400704352

DOI:10.1145/3627535

Chair:
Michel Steuwer,
Program Chairs:
I-Ting Angelina Lee,
Milind Chabbi
Uber Technologies Inc.

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Conference

PPoPP '24

Sponsor:

PPoPP '24: 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

March 2 - 6, 2024

Edinburgh, United Kingdom

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
1,663
Total Downloads

Downloads (Last 12 months)1,663
Downloads (Last 6 weeks)122

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Luo XLiu DKong HHuai SChen HXiong GLiu W(2024)Efficient Deep Learning Infrastructures for Embedded Computing Systems: A Comprehensive Survey and Future EnvisionACM Transactions on Embedded Computing Systems10.1145/370172824:1(1-100)Online publication date: 24-Oct-2024
https://dl.acm.org/doi/10.1145/3701728

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten