skip to main content
10.1145/3489517.3530510acmconferencesArticle/Chapter ViewAbstractPublication PagesdacConference Proceedingsconference-collections
research-article

PARIS and ELSA: an elastic scheduling algorithm for reconfigurable multi-GPU inference servers

Published: 23 August 2022 Publication History

Abstract

Providing low latency to end-users while maximizing server utilization and system throughput is crucial for cloud ML servers. NVIDIA's recently announced Ampere GPU architecture provides features to "reconfigure" one large, monolithic GPU into multiple smaller "GPU partitions". Such feature provides cloud ML service providers the ability to utilize the reconfigurable GPU not only for large-batch training but also for small-batch inference with the potential to achieve high resource utilization. We study this emerging GPU architecture with reconfigurability to develop a high-performance multi-GPU ML inference server, presenting a sophisticated partitioning algorithm for reconfigurable GPUs combined with an elastic scheduling algorithm tailored for our heterogeneously partitioned GPU server.

References

[1]
J. Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018.
[2]
A. Erfan Eshratifar et al. JointDNN: An Efficient Training and Inference Engine for Intelligent Mobile Cloud Computing Services. IEEE Transactions on Mobile Computing, 2019.
[3]
A. Gulati et al. Conformer: Convolution-augmented Transformer for Speech Recognition. arXiv preprint arXiv:2005.08100, 2020.
[4]
U. Gupta et al. DeepRecSys: A System for Optimizing End-to-end At-scale Neural Recommendation Inference. In Proceedings of the International Symposium on Computer Architecture (ISCA), 2020.
[5]
M. Han et al. MOSAIC: Heterogeneity-, Communication-, and Constraint-Aware Model Slicing and Execution for Accurate and Efficient Inference. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), 2019.
[6]
J. Hauswald et al. DjiNN and Tonic: DNN as a Service and Its Implications for Future Warehouse Scale Computers. In Proceedings of the International Symposium on Computer Architecture (ISCA), 2015.
[7]
A. G. Howard et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv preprint arXiv:1704.04861, 2017.
[8]
Y. Huang et al. GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism. Proceedings of the International Conference on Neural Information Processing Systems (NIPS), 2019.
[9]
N. Ma et al. ShuffleNet v2: Practical Guidelines for Efficient CNN Architecture Design. In European Conference on Computer Vision (ECCV), 2018.
[10]
D. Narayanan et al. PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proceedings of the ACM Symposium on Operating Systems Principles, 2019.
[11]
S. Gross and others. Training and Investigating Residual Nets, 2016.
[12]
M. Shoeybi et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv preprint arXiv:1909.08053, 2019.

Cited By

View all
  • (2024)USHERProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691989(947-964)Online publication date: 10-Jul-2024
  • (2024)Carat: Unlocking Value-Level Parallelism for Multiplier-Free GEMMsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640364(167-184)Online publication date: 27-Apr-2024
  • (2024)ElasticBatch: A Learning-Augmented Elastic Scheduling System for Batch Inference on MIGIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343118935:10(1708-1720)Online publication date: Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference
July 2022
1462 pages
ISBN:9781450391429
DOI:10.1145/3489517
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 August 2022

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

DAC '22
Sponsor:
DAC '22: 59th ACM/IEEE Design Automation Conference
July 10 - 14, 2022
California, San Francisco

Acceptance Rates

Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

Upcoming Conference

DAC '25
62nd ACM/IEEE Design Automation Conference
June 22 - 26, 2025
San Francisco , CA , USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)173
  • Downloads (Last 6 weeks)8
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)USHERProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691989(947-964)Online publication date: 10-Jul-2024
  • (2024)Carat: Unlocking Value-Level Parallelism for Multiplier-Free GEMMsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640364(167-184)Online publication date: 27-Apr-2024
  • (2024)ElasticBatch: A Learning-Augmented Elastic Scheduling System for Batch Inference on MIGIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343118935:10(1708-1720)Online publication date: Oct-2024
  • (2024) InSS : An Intelligent Scheduling Orchestrator for Multi-GPU Inference With Spatio-Temporal Sharing IEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343006335:10(1735-1748)Online publication date: Oct-2024
  • (2024)ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud EnvironmentsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00048(1-14)Online publication date: 17-Nov-2024
  • (2024)STAR: Sub-Entry Sharing-Aware TLB for Multi-Instance GPU2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00031(309-323)Online publication date: 2-Nov-2024
  • (2024)ESEN: Efficient GPU sharing of Ensemble Neural NetworksNeurocomputing10.1016/j.neucom.2024.128030599(128030)Online publication date: Sep-2024
  • (2023)WattWiser: Power & Resource-Efficient Scheduling for Multi-Model Multi-GPU Inference ServersProceedings of the 14th International Green and Sustainable Computing Conference10.1145/3634769.3634807(39-44)Online publication date: 28-Oct-2023
  • (2023)AdaInf: Data Drift Adaptive Scheduling for Accurate and SLO-guaranteed Multiple-Model Inference Serving at Edge ServersProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604830(473-485)Online publication date: 10-Sep-2023
  • (2023)KRISP: Enabling Kernel-wise RIght-sizing for Spatial Partitioned GPU Inference Servers2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071121(624-637)Online publication date: Feb-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media