poster

Poster: PipeLLM: Pipeline LLM Inference on Heterogeneous Devices with Sequence Slicing

Authors:

Qi Qi,

Jianxin LiaoAuthors Info & Claims

ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 Conference

Pages 1126 - 1128

https://doi.org/10.1145/3603269.3610856

Published: 01 September 2023 Publication History

Get Access

Abstract

Large Language Models (LLMs) has fostered the creation of innovative requirements. Locally deployed LLMs for micro-enterprise mitigates potential issues such as privacy infringements and sluggish response. However, they are hampered by the limitations in computing capability and memory space of possessed devices. We introduce PipeLLM, which allocates the model across devices commensurate with their computing capabilities. It enables the parallel execution of layers with slicing input sequence along the token dimension. PipeLLM demonstrates the potential to accelerate LLM inference with heterogeneity devices, offering a solution for LLM deployment in micro-enterprise hardware environment.

References

[1]

Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al. 2022. DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--15.

Crossref

Google Scholar

[2]

Jiawei Fei, Chen-Yu Ho, Atal N Sahu, Marco Canini, and Amedeo Sapio. 2021. Efficient sparse collective communication and its application to accelerate distributed deep learning. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference. 676--691.

Digital Library

Google Scholar

[3]

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019).

Google Scholar

[4]

Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks. Proceedings of Machine Learning and Systems 1 (2019), 1--13.

Google Scholar

[5]

Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, and Ion Stoica. 2021. Terapipe: Token-level pipeline parallelism for training large-scale language models. In International Conference on Machine Learning. PMLR, 6543--6552.

Google Scholar

[6]

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. 2021. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--15.

Digital Library

Google Scholar

[7]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019).

Google Scholar

Cited By

View all

Chin JChin JLee SLee SPark CPark CYeoun MYeoun M(2024)Proposal of User Interface Based on Heavy User Usage Analysis in LLM ServiceArchives of Design Research10.15187/adr.2024.08.37.4.28737:4(287-313)Online publication date: 31-Aug-2024
https://doi.org/10.15187/adr.2024.08.37.4.287
Zeng XWang XZhang TYu CZhao SChen Y(2024)GestureGPT: Toward Zero-Shot Free-Form Hand Gesture Understanding with Large Language Model AgentsProceedings of the ACM on Human-Computer Interaction10.1145/36981458:ISS(462-499)Online publication date: 24-Oct-2024
https://dl.acm.org/doi/10.1145/3698145
Wang RZhang JZhang QZhang BGu ZAttarpour AJi YTornatore MSekar VYu MSeneviratne AVeitch D(2024)Poster: Flexible Scheduling of Network and Computing Resources for Distributed AI TasksProceedings of the ACM SIGCOMM 2024 Conference: Posters and Demos10.1145/3672202.3673744(60-62)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3672202.3673744
Show More Cited By

Index Terms

Poster: PipeLLM: Pipeline LLM Inference on Heterogeneous Devices with Sequence Slicing
1. Computing methodologies
  1. Artificial intelligence
    1. Distributed artificial intelligence
  2. Parallel computing methodologies
    1. Parallel algorithms

Recommendations

TwinPilots: A New Computing Paradigm for GPU-CPU Parallel LLM Inference
SYSTOR '24: Proceedings of the 17th ACM International Systems and Storage Conference

When trained Large Language Models (LLMs) become available, it is desirable to carry out LLM inferences at the user end with limited resources. A common belief on LLM inference is that GPU is essentially the only meaningful processor as almost all ...
LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services
SC '24: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

As Large Language Models (LLMs) are rapidly growing in popularity, LLM inference services must be able to serve requests from thousands of users while satisfying performance requirements. The performance of an LLM inference service is largely determined ...
Enhancing On-Device LLM Inference with Historical Cloud-Based LLM Interactions
KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Many billion-scale large language models (LLMs) have been released for resource-constraint mobile devices to provide local LLM inference service when cloud-based powerful LLMs are not available. However, the capabilities of current on-device LLMs still ...

Comments

Information & Contributors

Information

Published In

ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 Conference

September 2023

1217 pages

ISBN:9798400702365

DOI:10.1145/3603269

Chairs:
Henning Schulzrinne,
Vishal Misra,
Program Chairs:
Eddie Kohler,
David Maltz

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2023

Check for updates

Author Tags

Qualifiers

Poster

Funding Sources

National Natural Science Foundation of China
National Key R&D Program of China
Ministry of Education and China Mobile Joint Fund

Conference

ACM SIGCOMM '23

Sponsor:

SIGCOMM

ACM SIGCOMM '23: ACM SIGCOMM 2023 Conference

September 10, 2023

NY, New York, USA

Acceptance Rates

Overall Acceptance Rate 462 of 3,389 submissions, 14%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
1,272
Total Downloads

Downloads (Last 12 months)693
Downloads (Last 6 weeks)36

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Chin JChin JLee SLee SPark CPark CYeoun MYeoun M(2024)Proposal of User Interface Based on Heavy User Usage Analysis in LLM ServiceArchives of Design Research10.15187/adr.2024.08.37.4.28737:4(287-313)Online publication date: 31-Aug-2024
https://doi.org/10.15187/adr.2024.08.37.4.287
Zeng XWang XZhang TYu CZhao SChen Y(2024)GestureGPT: Toward Zero-Shot Free-Form Hand Gesture Understanding with Large Language Model AgentsProceedings of the ACM on Human-Computer Interaction10.1145/36981458:ISS(462-499)Online publication date: 24-Oct-2024
https://dl.acm.org/doi/10.1145/3698145
Wang RZhang JZhang QZhang BGu ZAttarpour AJi YTornatore MSekar VYu MSeneviratne AVeitch D(2024)Poster: Flexible Scheduling of Network and Computing Resources for Distributed AI TasksProceedings of the ACM SIGCOMM 2024 Conference: Posters and Demos10.1145/3672202.3673744(60-62)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3672202.3673744
Liu CZhao JPapavassiliou SSchmid S(2024)Resource Allocation for Stable LLM Training in Mobile Edge ComputingProceedings of the Twenty-fifth International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing10.1145/3641512.3686358(81-90)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3641512.3686358
Jiang PWang HCai ZGao LZhang WMa RZhou X(2024)SLoB: Suboptimal Load Balancing Scheduling in Local Heterogeneous GPU Clusters for Large Language Model InferenceIEEE Transactions on Computational Social Systems10.1109/TCSS.2024.342374911:6(7941-7951)Online publication date: Dec-2024
https://doi.org/10.1109/TCSS.2024.3423749
Friha OAmine Ferrag MKantarci BCakmak BOzgun AGhoualmi-Zine N(2024)LLM-Based Edge Intelligence: A Comprehensive Survey on Architectures, Applications, Security and TrustworthinessIEEE Open Journal of the Communications Society10.1109/OJCOMS.2024.34565495(5799-5856)Online publication date: 2024
https://doi.org/10.1109/OJCOMS.2024.3456549
Li JHan BLi SWang XLi J(2024)CoLLM: A Collaborative LLM Inference Framework for Resource-Constrained Devices2024 IEEE/CIC International Conference on Communications in China (ICCC)10.1109/ICCC62479.2024.10681712(185-190)Online publication date: 7-Aug-2024
https://doi.org/10.1109/ICCC62479.2024.10681712
Jiao LShao YSun LLiu FYang SMa WLi LLiu XHou BZhang XShang RLi YWang STang XGuo Y(2024)Advanced Deep Learning Models for 6G: Overview, Opportunities, and ChallengesIEEE Access10.1109/ACCESS.2024.341890012(133245-133314)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3418900
Luo HLuo JVasilakos A(2024)BC4LLM: A perspective of trusted artificial intelligence when blockchain meets large language modelsNeurocomputing10.1016/j.neucom.2024.128089599(128089)Online publication date: Sep-2024
https://doi.org/10.1016/j.neucom.2024.128089
Xu JWang QCao YZeng BLiu S(2024)A General Purpose Device for Interaction with LLMsProceedings of the Future Technologies Conference (FTC) 2024, Volume 210.1007/978-3-031-73122-8_40(613-626)Online publication date: 5-Nov-2024
https://doi.org/10.1007/978-3-031-73122-8_40

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

TwinPilots: A New Computing Paradigm for GPU-CPU Parallel LLM Inference

LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services

Enhancing On-Device LLM Inference with Historical Cloud-Based LLM Interactions

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations