research-article

Paella: Low-latency Model Serving with Software-defined GPU Scheduling

Authors:
Kelvin K. W. Ng

University of Pennsylvania, Philadelphia, Pennsylvania, USA

University of Pennsylvania, Philadelphia, Pennsylvania, USA

https://orcid.org/0009-0001-9191-2085
View Profile

,
Henri Maxime Demoulin

DBOS, inc, Sunnyvale, California, USA

DBOS, inc, Sunnyvale, California, USA

https://orcid.org/0009-0007-3159-0567
View Profile

,
Vincent Liu

University of Pennsylvania, Philadelphia, Pennsylvania, United States

University of Pennsylvania, Philadelphia, Pennsylvania, United States

https://orcid.org/0000-0001-7683-208X
View Profile

SOSP '23: Proceedings of the 29th Symposium on Operating Systems PrinciplesOctober 2023Pages 595–610https://doi.org/10.1145/3600006.3613163

Published:23 October 2023Publication History

SOSP '23: Proceedings of the 29th Symposium on Operating Systems Principles

Pages 595–610

ABSTRACT

Model serving systems play a critical role in multiplexing machine learning inference jobs across shared GPU infrastructure. These systems have traditionally sat at a high level of abstraction---receiving jobs from clients through a narrow API and relying on black-box GPU scheduling mechanisms when dispatching them. Fundamental limitations in the built-in GPU hardware scheduler, in particular, can lead to inefficiency when executing concurrent jobs. The current abstraction level also incurs system overheads that are similarly most significant when the GPU is heavily shared.

In this paper, we argue for co-designing the model compiler, local clients, and the scheduler to bypass the built-in GPU scheduler and enable software control of kernel execution order. Doing so enables the use of arbitrary scheduling algorithms and reduces system overheads throughout the critical path of inference.

References

NVIDIA HyperQ. https://docs.nvidia.com/cuda/samples/6_Advanced/simpleHyperQ/doc/HyperQ.pdf.Google Scholar
NVIDIA MPS. https://docs.nvidia.com/deploy/mps/index.html.Google Scholar
NVRTC (Runtime Compilation). https://docs.nvidia.com/cuda/nvrtc/index.html.Google Scholar
Parallel Thread Execution ISA). https://docs.nvidia.com/cuda/parallel-thread-execution/index.html.Google Scholar
Onnx model zoo, 2020. https://github.com/onnx/models.Google Scholar
Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, and Ming Zhang. Cherrypick: Adaptively unearthing the best cloud configurations for big data analytics. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation, NSDI'17, pages 469--482, Berkeley, CA, USA, 2017. USENIX Association.Google Scholar
T. Amert, N. Otterness, M. Yang, J. H. Anderson, and F. D. Smith. Gpu scheduling on the nvidia tx2: Hidden details revealed. In 2017 IEEE Real-Time Systems Symposium (RTSS), pages 104--115, 2017.Google ScholarCross Ref
Zhihao Bai, Zhen Zhang, Yibo Zhu, and Xin Jin. Pipeswitch: Fast pipelined context switching for deep learning applications. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 499--514. USENIX Association, November 2020.Google Scholar
J. Bakita, Nathan Otterness, J. Anderson, and F. D. Smith. Scaling up: The validation of empirically derived scheduling rules on NVIDIA GPUs. In 14th Workshop on Operating Systems Platforms for Embedded Real-Time Applications (OSPERT), 2018.Google Scholar
Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. IX: A protected dataplane operating system for high throughput and low latency. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 49--65, Broomfield, CO, October 2014. USENIX Association.Google ScholarDigital Library
N. Capodieci, R. Cavicchioli, M. Bertogna, and A. Paramakuru. Deadline-based scheduling for gpu with preemption support. In 2018 IEEE Real-Time Systems Symposium (RTSS), pages 119--130, 2018.Google ScholarCross Ref
A. X. M. Chang and E. Culurciello. Hardware accelerators for recurrent neural networks on fpga. In 2017 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1--4, 2017.Google ScholarCross Ref
G. Chen and X. Shen. Free launch: Optimizing gpu dynamic kernel launches through thread reuse. In 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 407--419, 2015.Google ScholarDigital Library
Li Chen, Justinas Lingys, Kai Chen, and Feng Liu. Auto: Scaling deep reinforcement learning for datacenter-scale automatic traffic optimization. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, SIGCOMM '18, page 191--205, New York, NY, USA, 2018. Association for Computing Machinery.Google ScholarDigital Library
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578--594, Carlsbad, CA, October 2018. USENIX Association.Google ScholarDigital Library
CN Coelho, A Kuusela, S Li, H Zhuang, T Aarrestad, V Loncar, J Ngadiuba, M Pierini, AA Pol, and S Summers. Automatic deep heterogeneous quantization of deep neural networks for ultra low-area, low-latency inference on the edge at particle colliders. arXiv preprint arXiv:2006.10159.Google Scholar
Intel Corporation. Intel 64 and ia-32 architectures software developer's manual volume 3a: System programming guide, 2021.Google Scholar
Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Stoica, Joseph Gonzalez, and Alexey Tumanov. Inferline: Latency-aware provisioning and scaling for prediction serving pipelines. In Proceedings of the 11th ACM Symposium on Cloud Computing, SoCC '20, page 477--491, New York, NY, USA, 2020. Association for Computing Machinery.Google Scholar
Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. Clipper: A low-latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 613--627, Boston, MA, March 2017. USENIX Association.Google ScholarDigital Library
Martijn de Rooij. Ultra low latency deep neural network inference for gravitational waves interferometer. 2021.Google Scholar
Henri Maxime Demoulin, Joshua Fried, Isaac Pedisich, Marios Kogias, Boon Thau Loo, Linh Thi Xuan Phan, and Irene Zhang. When idling is ideal: Optimizing tail-latency for heavy-tailed datacenter workloads with perséphone. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, SOSP '21, page 621--637, New York, NY, USA, 2021. Association for Computing Machinery.Google ScholarDigital Library
Henri Maxime Demoulin, Isaac Pedisich, Nikos Vasilakis, Vincent Liu, Boon Thau Loo, and Linh Thi Xuan Phan. Detecting asymmetric application-layer denial-of-service attacks in-flight with finelame. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 693--708, Renton, WA, July 2019. USENIX Association.Google Scholar
Javier Duarte, Song Han, Philip Harris, Sergo Jindariani, Edward Kreinar, Benjamin Kreis, Jennifer Ngadiuba, Maurizio Pierini, Ryan Rivera, Nhan Tran, et al. Fast inference of deep neural networks in fpgas for particle physics. Journal of Instrumentation, 13(07):P07027, 2018.Google ScholarCross Ref
Glenn A. Elliott and James H. Anderson. Real-world constraints of gpus in real-time systems. In Proceedings of the 2011 IEEE 17th International Conference on Embedded and Real-Time Computing Systems and Applications - Volume 02, RTCSA '11, page 48--54, USA, 2011. IEEE Computer Society.Google Scholar
Joshua Fried, Zhenyuan Ruan, Amy Ousterhout, and Adam Belay. Caladan: Mitigating interference at microsecond timescales. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 281--297. USENIX Association, November 2020.Google Scholar
W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warp formation and scheduling for efficient gpu control flow. In 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), pages 407--420, 2007.Google ScholarDigital Library
Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, and Janardhan Kulkarni. Graphene: Packing and dependency-aware scheduling for data-parallel clusters. OSDI'16, page 81--97, USA, 2016. USENIX Association.Google Scholar
Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. Serving dnns like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 443--462. USENIX Association, November 2020.Google Scholar
Jashwant Raj Gunasekaran, Cyan Subhra Mishra, Prashanth Thinakaran, Bikash Sharma, Mahmut Taylan Kandemir, and Chita R. Das. Cocktail: A multidimensional optimization for model serving in cloud. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 1041--1057, Renton, WA, April 2022. USENIX Association.Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770--778, 2016.Google ScholarCross Ref
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. NIPS Deep Learning and Representation Learning Workshop, 2015.Google Scholar
Cheol-Ho Hong, Ivor Spence, and Dimitrios S. Nikolopoulos. Gpu virtualization and scheduling methods: A comprehensive survey. ACM Comput. Surv., 50(3), June 2017.Google Scholar
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.Google Scholar
Chien-Chin Huang, Gu Jin, and Jinyang Li. Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '20, page 1341--1355, New York, NY, USA, 2020. Association for Computing Machinery.Google ScholarDigital Library
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261--2269, 2017.Google ScholarCross Ref
Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size. CoRR, abs/1602.07360, 2016.Google Scholar
Zhe Jia, Marco Maggioni, Jeffrey Smith, and Daniele Paolo Scarpazza. Dissecting the nvidia turing t4 gpu via microbenchmarking, 2019.Google Scholar
Kostis Kaffes, Timothy Chong, Jack Tigar Humphries, Adam Belay, David Mazières, and Christos Kozyrakis. Shinjuku: Preemptive scheduling for μsecond-scale tail latency. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 345--360, Boston, MA, February 2019. USENIX Association.Google Scholar
Anuj Kalia, Michael Kaminsky, and David Andersen. Datacenter RPCs can be general and fast. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 1--16, Boston, MA, February 2019. USENIX Association.Google Scholar
Shinpei Kato, Karthik Lakshmanan, Ragunathan Rajkumar, and Yutaka Ishikawa. Timegraph: Gpu scheduling for real-time multi-tasking environments. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, USENIXATC'11, page 2, USA, 2011. USENIX Association.Google ScholarDigital Library
Charles W. Kazer, João Sedoc, Kelvin K.W. Ng, Vincent Liu, and Lyle H. Ungar. Fast network simulation through approximation or: How blind men can describe elephants. In Proceedings of the 17th ACM Workshop on Hot Topics in Networks, HotNets '18, page 141--147, New York, NY, USA, 2018. Association for Computing Machinery.Google ScholarDigital Library
Ana Klimovic, Heiner Litz, and Christos Kozyrakis. Reflex: Remote flash ~ local flash. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '17, page 345--359, New York, NY, USA, 2017. Association for Computing Machinery.Google ScholarDigital Library
Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD '18, page 489--504, New York, NY, USA, 2018. Association for Computing Machinery.Google ScholarDigital Library
Redis Labs and Tensorwerk. Redisai, 2020. https://github.com/RedisAI/RedisAI.Google Scholar
Griffin Lacey, Graham W. Taylor, and Shawki Areibi. Deep learning on fpgas: Past, present, and future, 2016.Google Scholar
Huan Liu, Farhad Hussain, Chew Lim Tan, and Manoranjan Dash. Discretization: An enabling technique. Data Mining and Knowledge Discovery, 6(4):393--423, December 2002.Google ScholarDigital Library
Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. Rammer: Enabling holistic deep learning compiler optimizations with rTasks. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 881--897. USENIX Association, November 2020.Google Scholar
K. V. Manian, A. A. Ammar, A. Ruhela, C.-H. Chu, H. Subramoni, and D. K. Panda. Characterizing cuda unified memory (um)-aware mpi designs on modern gpu architectures. In Proceedings of the 12th Workshop on General Purpose Processing Using GPUs, GPGPU '19, page 43--52, New York, NY, USA, 2019. Association for Computing Machinery.Google ScholarDigital Library
Michele Martinelli. Poster: Gpu i/o persistent kernel for latency bound systems. In ACM Symposium on High-Performance Parallel and Distributed Computing, 2017.Google Scholar
Pınar Muyan-Özçelik and John D. Owens. Methods for multitasking among real-time embedded compute tasks running on the gpu. Concurrency and Computation: Practice and Experience, 29(15):e4118, 2017. e4118 cpe.4118.Google ScholarCross Ref
Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. Improving gpu performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, page 308--317, New York, NY, USA, 2011. Association for Computing Machinery.Google ScholarDigital Library
NVIDIA. Triton inference server, 2020. https://github.com/triton-inference-server/server.Google Scholar
Ignacio Sañudo Olmedo, Nicola Capodieci, Jorge Luis Martinez, Andrea Marongiu, and Marko Bertogna. Dissecting the cuda scheduling hierarchy: a performance and predictability perspective. In 2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 213--225, 2020.Google ScholarCross Ref
Christopher Olston, Fangwei Li, Jeremiah Harmsen, Jordan Soyke, Kiril Gorovoy, Li Lao, Noah Fiedel, Sukriti Ramesh, and Vinu Rajashekhar. Tensorflow-serving: Flexible, high-performance ml serving. In Workshop on ML Systems at NIPS 2017, 2017.Google Scholar
Aaron Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg, et al. Parallel wavenet: Fast high-fidelity speech synthesis. In International conference on machine learning, pages 3918--3926. PMLR, 2018.Google Scholar
N. Otterness, M. Yang, S. Rust, E. Park, J. H. Anderson, F. D. Smith, A. Berg, and S. Wang. An evaluation of the nvidia tx1 for supporting real-time computer-vision workloads. In 2017 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 353--364, 2017.Google ScholarCross Ref
Sreepathi Pai. How the fermi thread block scheduler works (illustrated), Mar 2014. https://cs.rochester.edu/~sree/fermi-tbs/fermi-tbs.html.Google Scholar
Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, and Chita R. Das. Scheduling techniques for gpu architectures with processing-in-memory capabilities. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT '16, page 31--44, New York, NY, USA, 2016. Association for Computing Machinery.Google ScholarDigital Library
George Prekas, Marios Kogias, and Edouard Bugnion. Zygos: Achieving low tail latency for microsecond-scale networked tasks. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP '17, page 325--341, New York, NY, USA, 2017. Association for Computing Machinery.Google ScholarDigital Library
Supranamaya Ranjan, Ram Swaminathan, Mustafa Uysal, Antonio Nucci, and Edward Knightly. Ddos-shield: Ddos-resilient scheduling to counter application layer attacks. IEEE/ACM Trans. Netw., 17(1):26--39, February 2009.Google ScholarDigital Library
Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis. INFaaS: Automated model-less inference serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 397--411. USENIX Association, July 2021.Google Scholar
A. Shawahna, S. M. Sait, and A. El-Maleh. Fpga-based accelerators of deep learning networks for learning and classification: A review. IEEE Access, 7:7823--7859, 2019.Google ScholarCross Ref
Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. Nexus: A gpu cluster engine for accelerating dnn-based video analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP '19, page 322--337, New York, NY, USA, 2019. Association for Computing Machinery.Google ScholarDigital Library
M. Shreedhar and George Varghese. Efficient fair queueing using deficit round robin. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, SIGCOMM '95, page 231--242, New York, NY, USA, 1995. Association for Computing Machinery.Google ScholarDigital Library
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818--2826, Los Alamitos, CA, USA, jun 2016. IEEE Computer Society.Google ScholarCross Ref
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), 2015.Google ScholarCross Ref
Concurrent Real-Time Linux Development Team. Real-time performance during cuda. Technical report, Concurrent Real-Time, 11 2010.Google Scholar
Adam Wierman and Bert Zwart. Is tail-optimal scheduling possible? Operations research, 60(5):1249--1257, 2012.Google Scholar
Keith Winstein and Hari Balakrishnan. Tcp ex machina: Computergenerated congestion control. In Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, SIGCOMM '13, pages 123--134, New York, NY, USA, 2013. ACM.Google ScholarDigital Library
C. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan, K. Hazelwood, E. Isaac, Y. Jia, B. Jia, T. Leyvand, H. Lu, Y. Lu, L. Qiao, B. Reagen, J. Spisak, F. Sun, A. Tulloch, P. Vajda, X. Wang, Y. Wang, B. Wasti, Y. Wu, R. Xian, S. Yoo, and P. Zhang. Machine learning at facebook: Understanding inference at the edge. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 331--344, 2019.Google ScholarCross Ref
Nofel Yaseen, John Sonchack, and Vincent Liu. tpprof: A network traffic pattern profiler. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 1015--1030, Santa Clara, CA, February 2020. USENIX Association.Google Scholar
Peifeng Yu and Mosharaf Chowdhury. Salus: Fine-grained GPU sharing primitives for deep learning applications. CoRR, abs/1902.04610, 2019.Google Scholar
Irene Zhang, Amanda Raybuck, Pratyush Patel, Kirk Olynyk, Jacob Nelson, Omar S. Navarro Leija, Ashlie Martinez, Jing Liu, Anna Kornfeld Simpson, Sujay Jayakar, Pedro Henrique Penna, Max Demoulin, Piali Choudhury, and Anirudh Badam. The demikernel datapath os architecture for microsecond-scale datacenter systems. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, SOSP '21, page 195--211, New York, NY, USA, 2021. Association for Computing Machinery.Google ScholarDigital Library
J. Zhong and B. He. Kernelet: High-throughput gpu kernel executions with dynamic slicing and scheduling. IEEE Transactions on Parallel and Distributed Systems, 25(6):1522--1532, 2014.Google ScholarDigital Library

Index Terms

Paella: Low-latency Model Serving with Software-defined GPU Scheduling

Recommendations

Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architectures

Performance of high-end supercomputers will reach the exascale through the advent of core counts in billions. However, in the upcoming exascale computing era it is important not only to focus on the performance, but also on scalability of fine-grained ...
Read More
Scheduling of deteriorating jobs with release dates to minimize the maximum lateness

In this paper, we consider the problem of scheduling n deteriorating jobs with release dates on a single (batching) machine. Each job's processing time is a simple linear function of its starting time. The objective is to minimize the maximum lateness. ...
Read More
Disengaged scheduling for fair, protected access to fast computational accelerators
ASPLOS '14

Today's operating systems treat GPUs and other computational accelerators as if they were simple devices, with bounded and predictable response times. With accelerators assuming an increasing share of the workload on modern machines, this strategy is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SOSP '23: Proceedings of the 29th Symposium on Operating Systems Principles
October 2023
802 pages
ISBN:9798400702297
DOI:10.1145/3600006
Conference Chairs:
Jason Flinn
Meta
,
Margo Seltzer
University of British Columbia
,
General Chairs:
Peter Druschel
Max Planck Institute for Software Systems (MPI-SWS)
,
Antoine Kaufmann
Max Planck Institute for Software Systems (MPI-SWS)
,
Jonathan Mace
Max Planck Institute for Software Systems (MPI-SWS) and Microsoft Research
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
Author Tags
GPUs
low-latency model serving
machine learning inference
scheduling
Qualifiers
- research-article
Conference

Acceptance Rates
SOSP '23 Paper Acceptance Rate43of232submissions,19%Overall Acceptance Rate131of716submissions,18%
More
Upcoming Conference
SOSP '24

Sponsor:

sigops

ACM SIGOPS 29th Symposium on Operating Systems Principles

November 5 - 8, 2024

Austin , TX , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 1,325
  Total Downloads
- Downloads (Last 12 months)1,325
- Downloads (Last 6 weeks)151
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Paella: Low-latency Model Serving with Software-defined GPU Scheduling

SOSP '23: Proceedings of the 29th Symposium on Operating Systems Principles

ABSTRACT

References

Cited By

Index Terms

Recommendations

Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architectures

Scheduling of deteriorating jobs with release dates to minimize the maximum lateness

Disengaged scheduling for fair, protected access to fast computational accelerators