research-article

PARIS and ELSA: an elastic scheduling algorithm for reconfigurable multi-GPU inference servers

Authors:

Yunseong Kim,

Yujeong Choi,

Minsoo RhuAuthors Info & Claims

DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference

Pages 607 - 612

https://doi.org/10.1145/3489517.3530510

Published: 23 August 2022 Publication History

Get Access

Abstract

Providing low latency to end-users while maximizing server utilization and system throughput is crucial for cloud ML servers. NVIDIA's recently announced Ampere GPU architecture provides features to "reconfigure" one large, monolithic GPU into multiple smaller "GPU partitions". Such feature provides cloud ML service providers the ability to utilize the reconfigurable GPU not only for large-batch training but also for small-batch inference with the potential to achieve high resource utilization. We study this emerging GPU architecture with reconfigurability to develop a high-performance multi-GPU ML inference server, presenting a sophisticated partitioning algorithm for reconfigurable GPUs combined with an elastic scheduling algorithm tailored for our heterogeneously partitioned GPU server.

References

[1]

J. Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018.

Google Scholar

[2]

A. Erfan Eshratifar et al. JointDNN: An Efficient Training and Inference Engine for Intelligent Mobile Cloud Computing Services. IEEE Transactions on Mobile Computing, 2019.

Google Scholar

[3]

A. Gulati et al. Conformer: Convolution-augmented Transformer for Speech Recognition. arXiv preprint arXiv:2005.08100, 2020.

Google Scholar

[4]

U. Gupta et al. DeepRecSys: A System for Optimizing End-to-end At-scale Neural Recommendation Inference. In Proceedings of the International Symposium on Computer Architecture (ISCA), 2020.

Google Scholar

[5]

M. Han et al. MOSAIC: Heterogeneity-, Communication-, and Constraint-Aware Model Slicing and Execution for Accurate and Efficient Inference. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), 2019.

Digital Library

Google Scholar

[6]

J. Hauswald et al. DjiNN and Tonic: DNN as a Service and Its Implications for Future Warehouse Scale Computers. In Proceedings of the International Symposium on Computer Architecture (ISCA), 2015.

Digital Library

Google Scholar

[7]

A. G. Howard et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv preprint arXiv:1704.04861, 2017.

Google Scholar

[8]

Y. Huang et al. GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism. Proceedings of the International Conference on Neural Information Processing Systems (NIPS), 2019.

Google Scholar

[9]

N. Ma et al. ShuffleNet v2: Practical Guidelines for Efficient CNN Architecture Design. In European Conference on Computer Vision (ECCV), 2018.

Digital Library

Google Scholar

[10]

D. Narayanan et al. PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proceedings of the ACM Symposium on Operating Systems Principles, 2019.

Digital Library

Google Scholar

[11]

S. Gross and others. Training and Investigating Residual Nets, 2016.

Google Scholar

[12]

M. Shoeybi et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv preprint arXiv:1909.08053, 2019.

Google Scholar

Cited By

View all

Shubha SShen HIyer AGavrilovska ATerry D(2024)USHERProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691989(947-964)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691938.3691989
Pan ZSan Miguel JWu DTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Carat: Unlocking Value-Level Parallelism for Multiplier-Free GEMMsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640364(167-184)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640364
Qi JXiao WLi MYang CLi YLin WYang HLuan ZQian D(2024)ElasticBatch: A Learning-Augmented Elastic Scheduling System for Batch Inference on MIGIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343118935:10(1708-1720)Online publication date: Oct-2024
https://doi.org/10.1109/TPDS.2024.3431189
Show More Cited By

Recommendations

ELSA: hardware-software co-design for efficient, lightweight self-attention mechanism in neural networks
ISCA '21: Proceedings of the 48th Annual International Symposium on Computer Architecture

The self-attention mechanism is rapidly emerging as one of the most important key primitives in neural networks (NNs) for its ability to identify the relations within input entities. The self-attention-oriented NN models such as Google Transformer and ...
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
Nuclear Reactor Simulations on OpenCL FPGA Platform
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Field-programmable gate arrays (FPGAs) are becoming a promising choice as a heterogeneous computing component for scientific computing when floating-point optimized architectures are added to the current FPGAs. The maturing high-level synthesis (HLS) ...

Comments

Information & Contributors

Information

Published In

DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference

July 2022

1462 pages

ISBN:9781450391429

DOI:10.1145/3489517

General Chair:
Rob Oshana
NXP

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 August 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Samsung Advanced Institute of Technology

Conference

DAC '22

Sponsor:

SIGDA

DAC '22: 59th ACM/IEEE Design Automation Conference

July 10 - 14, 2022

California, San Francisco

Acceptance Rates

Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

Upcoming Conference

DAC '25

Sponsor:
sigda

62nd ACM/IEEE Design Automation Conference

June 22 - 26, 2025

San Francisco , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
463
Total Downloads

Downloads (Last 12 months)173
Downloads (Last 6 weeks)8

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Shubha SShen HIyer AGavrilovska ATerry D(2024)USHERProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691989(947-964)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691938.3691989
Pan ZSan Miguel JWu DTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Carat: Unlocking Value-Level Parallelism for Multiplier-Free GEMMsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640364(167-184)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640364
Qi JXiao WLi MYang CLi YLin WYang HLuan ZQian D(2024)ElasticBatch: A Learning-Augmented Elastic Scheduling System for Batch Inference on MIGIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343118935:10(1708-1720)Online publication date: Oct-2024
https://doi.org/10.1109/TPDS.2024.3431189
Han ZZhou RXu CZeng YZhang R(2024) InSS : An Intelligent Scheduling Orchestrator for Multi-GPU Inference With Spatio-Temporal Sharing IEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343006335:10(1735-1748)Online publication date: Oct-2024
https://doi.org/10.1109/TPDS.2024.3430063
Lee MSeong SKang MLee JNa GChun INikolopoulos DHong C(2024)ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud EnvironmentsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00048(1-14)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00048
Li BWang YWang TEeckhout LYang JJaleel ATang X(2024)STAR: Sub-Entry Sharing-Aware TLB for Multi-Instance GPU2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00031(309-323)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00031
Wang JShi YChen ZWen M(2024)ESEN: Efficient GPU sharing of Ensemble Neural NetworksNeurocomputing10.1016/j.neucom.2024.128030599(128030)Online publication date: Sep-2024
https://doi.org/10.1016/j.neucom.2024.128030
Jahanshahi ARezvani MWong D(2023)WattWiser: Power & Resource-Efficient Scheduling for Multi-Model Multi-GPU Inference ServersProceedings of the 14th International Green and Sustainable Computing Conference10.1145/3634769.3634807(39-44)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3634769.3634807
Shubha SShen HSchulzrinne HKohler EMaltz DMisra V(2023)AdaInf: Data Drift Adaptive Scheduling for Accurate and SLO-guaranteed Multiple-Model Inference Serving at Edge ServersProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604830(473-485)Online publication date: 10-Sep-2023
https://dl.acm.org/doi/10.1145/3603269.3604830
Chow MJahanshahi AWong D(2023)KRISP: Enabling Kernel-wise RIght-sizing for Spatial Partitioned GPU Inference Servers2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071121(624-637)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071121

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Recommendations

ELSA: hardware-software co-design for efficient, lightweight self-attention mechanism in neural networks

Evaluation of Rodinia Codes on Intel Xeon Phi

Nuclear Reactor Simulations on OpenCL FPGA Platform

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations