skip to main content
10.1145/3603269.3610856acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
poster

Poster: PipeLLM: Pipeline LLM Inference on Heterogeneous Devices with Sequence Slicing

Published: 01 September 2023 Publication History

Abstract

Large Language Models (LLMs) has fostered the creation of innovative requirements. Locally deployed LLMs for micro-enterprise mitigates potential issues such as privacy infringements and sluggish response. However, they are hampered by the limitations in computing capability and memory space of possessed devices. We introduce PipeLLM, which allocates the model across devices commensurate with their computing capabilities. It enables the parallel execution of layers with slicing input sequence along the token dimension. PipeLLM demonstrates the potential to accelerate LLM inference with heterogeneity devices, offering a solution for LLM deployment in micro-enterprise hardware environment.

References

[1]
Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al. 2022. DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--15.
[2]
Jiawei Fei, Chen-Yu Ho, Atal N Sahu, Marco Canini, and Amedeo Sapio. 2021. Efficient sparse collective communication and its application to accelerate distributed deep learning. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference. 676--691.
[3]
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019).
[4]
Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks. Proceedings of Machine Learning and Systems 1 (2019), 1--13.
[5]
Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, and Ion Stoica. 2021. Terapipe: Token-level pipeline parallelism for training large-scale language models. In International Conference on Machine Learning. PMLR, 6543--6552.
[6]
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. 2021. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--15.
[7]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019).

Cited By

View all
  • (2024)Proposal of User Interface Based on Heavy User Usage Analysis in LLM ServiceArchives of Design Research10.15187/adr.2024.08.37.4.28737:4(287-313)Online publication date: 31-Aug-2024
  • (2024)GestureGPT: Toward Zero-Shot Free-Form Hand Gesture Understanding with Large Language Model AgentsProceedings of the ACM on Human-Computer Interaction10.1145/36981458:ISS(462-499)Online publication date: 24-Oct-2024
  • (2024)Poster: Flexible Scheduling of Network and Computing Resources for Distributed AI TasksProceedings of the ACM SIGCOMM 2024 Conference: Posters and Demos10.1145/3672202.3673744(60-62)Online publication date: 4-Aug-2024
  • Show More Cited By

Index Terms

  1. Poster: PipeLLM: Pipeline LLM Inference on Heterogeneous Devices with Sequence Slicing

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 Conference
      September 2023
      1217 pages
      ISBN:9798400702365
      DOI:10.1145/3603269
      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 September 2023

      Check for updates

      Author Tags

      1. LLM inference acceleration
      2. pipeline inference
      3. model deployment

      Qualifiers

      • Poster

      Funding Sources

      Conference

      ACM SIGCOMM '23
      Sponsor:
      ACM SIGCOMM '23: ACM SIGCOMM 2023 Conference
      September 10, 2023
      NY, New York, USA

      Acceptance Rates

      Overall Acceptance Rate 462 of 3,389 submissions, 14%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)693
      • Downloads (Last 6 weeks)36
      Reflects downloads up to 01 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Proposal of User Interface Based on Heavy User Usage Analysis in LLM ServiceArchives of Design Research10.15187/adr.2024.08.37.4.28737:4(287-313)Online publication date: 31-Aug-2024
      • (2024)GestureGPT: Toward Zero-Shot Free-Form Hand Gesture Understanding with Large Language Model AgentsProceedings of the ACM on Human-Computer Interaction10.1145/36981458:ISS(462-499)Online publication date: 24-Oct-2024
      • (2024)Poster: Flexible Scheduling of Network and Computing Resources for Distributed AI TasksProceedings of the ACM SIGCOMM 2024 Conference: Posters and Demos10.1145/3672202.3673744(60-62)Online publication date: 4-Aug-2024
      • (2024)Resource Allocation for Stable LLM Training in Mobile Edge ComputingProceedings of the Twenty-fifth International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing10.1145/3641512.3686358(81-90)Online publication date: 14-Oct-2024
      • (2024)SLoB: Suboptimal Load Balancing Scheduling in Local Heterogeneous GPU Clusters for Large Language Model InferenceIEEE Transactions on Computational Social Systems10.1109/TCSS.2024.342374911:6(7941-7951)Online publication date: Dec-2024
      • (2024)LLM-Based Edge Intelligence: A Comprehensive Survey on Architectures, Applications, Security and TrustworthinessIEEE Open Journal of the Communications Society10.1109/OJCOMS.2024.34565495(5799-5856)Online publication date: 2024
      • (2024)CoLLM: A Collaborative LLM Inference Framework for Resource-Constrained Devices2024 IEEE/CIC International Conference on Communications in China (ICCC)10.1109/ICCC62479.2024.10681712(185-190)Online publication date: 7-Aug-2024
      • (2024)Advanced Deep Learning Models for 6G: Overview, Opportunities, and ChallengesIEEE Access10.1109/ACCESS.2024.341890012(133245-133314)Online publication date: 2024
      • (2024)BC4LLM: A perspective of trusted artificial intelligence when blockchain meets large language modelsNeurocomputing10.1016/j.neucom.2024.128089599(128089)Online publication date: Sep-2024
      • (2024)A General Purpose Device for Interaction with LLMsProceedings of the Future Technologies Conference (FTC) 2024, Volume 210.1007/978-3-031-73122-8_40(613-626)Online publication date: 5-Nov-2024

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media