skip to main content
10.1145/3600006.3613163acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections

Paella: Low-latency Model Serving with Software-defined GPU Scheduling

Published:23 October 2023Publication History

ABSTRACT

Model serving systems play a critical role in multiplexing machine learning inference jobs across shared GPU infrastructure. These systems have traditionally sat at a high level of abstraction---receiving jobs from clients through a narrow API and relying on black-box GPU scheduling mechanisms when dispatching them. Fundamental limitations in the built-in GPU hardware scheduler, in particular, can lead to inefficiency when executing concurrent jobs. The current abstraction level also incurs system overheads that are similarly most significant when the GPU is heavily shared.

In this paper, we argue for co-designing the model compiler, local clients, and the scheduler to bypass the built-in GPU scheduler and enable software control of kernel execution order. Doing so enables the use of arbitrary scheduling algorithms and reduces system overheads throughout the critical path of inference.

References

  1. NVIDIA HyperQ. https://docs.nvidia.com/cuda/samples/6_Advanced/simpleHyperQ/doc/HyperQ.pdf.Google ScholarGoogle Scholar
  2. NVIDIA MPS. https://docs.nvidia.com/deploy/mps/index.html.Google ScholarGoogle Scholar
  3. NVRTC (Runtime Compilation). https://docs.nvidia.com/cuda/nvrtc/index.html.Google ScholarGoogle Scholar
  4. Parallel Thread Execution ISA). https://docs.nvidia.com/cuda/parallel-thread-execution/index.html.Google ScholarGoogle Scholar
  5. Onnx model zoo, 2020. https://github.com/onnx/models.Google ScholarGoogle Scholar
  6. Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, and Ming Zhang. Cherrypick: Adaptively unearthing the best cloud configurations for big data analytics. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation, NSDI'17, pages 469--482, Berkeley, CA, USA, 2017. USENIX Association.Google ScholarGoogle Scholar
  7. T. Amert, N. Otterness, M. Yang, J. H. Anderson, and F. D. Smith. Gpu scheduling on the nvidia tx2: Hidden details revealed. In 2017 IEEE Real-Time Systems Symposium (RTSS), pages 104--115, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  8. Zhihao Bai, Zhen Zhang, Yibo Zhu, and Xin Jin. Pipeswitch: Fast pipelined context switching for deep learning applications. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 499--514. USENIX Association, November 2020.Google ScholarGoogle Scholar
  9. J. Bakita, Nathan Otterness, J. Anderson, and F. D. Smith. Scaling up: The validation of empirically derived scheduling rules on NVIDIA GPUs. In 14th Workshop on Operating Systems Platforms for Embedded Real-Time Applications (OSPERT), 2018.Google ScholarGoogle Scholar
  10. Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. IX: A protected dataplane operating system for high throughput and low latency. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 49--65, Broomfield, CO, October 2014. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. N. Capodieci, R. Cavicchioli, M. Bertogna, and A. Paramakuru. Deadline-based scheduling for gpu with preemption support. In 2018 IEEE Real-Time Systems Symposium (RTSS), pages 119--130, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  12. A. X. M. Chang and E. Culurciello. Hardware accelerators for recurrent neural networks on fpga. In 2017 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1--4, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  13. G. Chen and X. Shen. Free launch: Optimizing gpu dynamic kernel launches through thread reuse. In 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 407--419, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Li Chen, Justinas Lingys, Kai Chen, and Feng Liu. Auto: Scaling deep reinforcement learning for datacenter-scale automatic traffic optimization. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, SIGCOMM '18, page 191--205, New York, NY, USA, 2018. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578--594, Carlsbad, CA, October 2018. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. CN Coelho, A Kuusela, S Li, H Zhuang, T Aarrestad, V Loncar, J Ngadiuba, M Pierini, AA Pol, and S Summers. Automatic deep heterogeneous quantization of deep neural networks for ultra low-area, low-latency inference on the edge at particle colliders. arXiv preprint arXiv:2006.10159.Google ScholarGoogle Scholar
  17. Intel Corporation. Intel 64 and ia-32 architectures software developer's manual volume 3a: System programming guide, 2021.Google ScholarGoogle Scholar
  18. Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Stoica, Joseph Gonzalez, and Alexey Tumanov. Inferline: Latency-aware provisioning and scaling for prediction serving pipelines. In Proceedings of the 11th ACM Symposium on Cloud Computing, SoCC '20, page 477--491, New York, NY, USA, 2020. Association for Computing Machinery.Google ScholarGoogle Scholar
  19. Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. Clipper: A low-latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 613--627, Boston, MA, March 2017. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Martijn de Rooij. Ultra low latency deep neural network inference for gravitational waves interferometer. 2021.Google ScholarGoogle Scholar
  21. Henri Maxime Demoulin, Joshua Fried, Isaac Pedisich, Marios Kogias, Boon Thau Loo, Linh Thi Xuan Phan, and Irene Zhang. When idling is ideal: Optimizing tail-latency for heavy-tailed datacenter workloads with perséphone. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, SOSP '21, page 621--637, New York, NY, USA, 2021. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Henri Maxime Demoulin, Isaac Pedisich, Nikos Vasilakis, Vincent Liu, Boon Thau Loo, and Linh Thi Xuan Phan. Detecting asymmetric application-layer denial-of-service attacks in-flight with finelame. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 693--708, Renton, WA, July 2019. USENIX Association.Google ScholarGoogle Scholar
  23. Javier Duarte, Song Han, Philip Harris, Sergo Jindariani, Edward Kreinar, Benjamin Kreis, Jennifer Ngadiuba, Maurizio Pierini, Ryan Rivera, Nhan Tran, et al. Fast inference of deep neural networks in fpgas for particle physics. Journal of Instrumentation, 13(07):P07027, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  24. Glenn A. Elliott and James H. Anderson. Real-world constraints of gpus in real-time systems. In Proceedings of the 2011 IEEE 17th International Conference on Embedded and Real-Time Computing Systems and Applications - Volume 02, RTCSA '11, page 48--54, USA, 2011. IEEE Computer Society.Google ScholarGoogle Scholar
  25. Joshua Fried, Zhenyuan Ruan, Amy Ousterhout, and Adam Belay. Caladan: Mitigating interference at microsecond timescales. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 281--297. USENIX Association, November 2020.Google ScholarGoogle Scholar
  26. W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warp formation and scheduling for efficient gpu control flow. In 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), pages 407--420, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, and Janardhan Kulkarni. Graphene: Packing and dependency-aware scheduling for data-parallel clusters. OSDI'16, page 81--97, USA, 2016. USENIX Association.Google ScholarGoogle Scholar
  28. Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. Serving dnns like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 443--462. USENIX Association, November 2020.Google ScholarGoogle Scholar
  29. Jashwant Raj Gunasekaran, Cyan Subhra Mishra, Prashanth Thinakaran, Bikash Sharma, Mahmut Taylan Kandemir, and Chita R. Das. Cocktail: A multidimensional optimization for model serving in cloud. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 1041--1057, Renton, WA, April 2022. USENIX Association.Google ScholarGoogle Scholar
  30. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770--778, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  31. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. NIPS Deep Learning and Representation Learning Workshop, 2015.Google ScholarGoogle Scholar
  32. Cheol-Ho Hong, Ivor Spence, and Dimitrios S. Nikolopoulos. Gpu virtualization and scheduling methods: A comprehensive survey. ACM Comput. Surv., 50(3), June 2017.Google ScholarGoogle Scholar
  33. Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.Google ScholarGoogle Scholar
  34. Chien-Chin Huang, Gu Jin, and Jinyang Li. Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '20, page 1341--1355, New York, NY, USA, 2020. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261--2269, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  36. Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size. CoRR, abs/1602.07360, 2016.Google ScholarGoogle Scholar
  37. Zhe Jia, Marco Maggioni, Jeffrey Smith, and Daniele Paolo Scarpazza. Dissecting the nvidia turing t4 gpu via microbenchmarking, 2019.Google ScholarGoogle Scholar
  38. Kostis Kaffes, Timothy Chong, Jack Tigar Humphries, Adam Belay, David Mazières, and Christos Kozyrakis. Shinjuku: Preemptive scheduling for μsecond-scale tail latency. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 345--360, Boston, MA, February 2019. USENIX Association.Google ScholarGoogle Scholar
  39. Anuj Kalia, Michael Kaminsky, and David Andersen. Datacenter RPCs can be general and fast. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 1--16, Boston, MA, February 2019. USENIX Association.Google ScholarGoogle Scholar
  40. Shinpei Kato, Karthik Lakshmanan, Ragunathan Rajkumar, and Yutaka Ishikawa. Timegraph: Gpu scheduling for real-time multi-tasking environments. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, USENIXATC'11, page 2, USA, 2011. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Charles W. Kazer, João Sedoc, Kelvin K.W. Ng, Vincent Liu, and Lyle H. Ungar. Fast network simulation through approximation or: How blind men can describe elephants. In Proceedings of the 17th ACM Workshop on Hot Topics in Networks, HotNets '18, page 141--147, New York, NY, USA, 2018. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Ana Klimovic, Heiner Litz, and Christos Kozyrakis. Reflex: Remote flash ~ local flash. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '17, page 345--359, New York, NY, USA, 2017. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD '18, page 489--504, New York, NY, USA, 2018. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Redis Labs and Tensorwerk. Redisai, 2020. https://github.com/RedisAI/RedisAI.Google ScholarGoogle Scholar
  45. Griffin Lacey, Graham W. Taylor, and Shawki Areibi. Deep learning on fpgas: Past, present, and future, 2016.Google ScholarGoogle Scholar
  46. Huan Liu, Farhad Hussain, Chew Lim Tan, and Manoranjan Dash. Discretization: An enabling technique. Data Mining and Knowledge Discovery, 6(4):393--423, December 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. Rammer: Enabling holistic deep learning compiler optimizations with rTasks. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 881--897. USENIX Association, November 2020.Google ScholarGoogle Scholar
  48. K. V. Manian, A. A. Ammar, A. Ruhela, C.-H. Chu, H. Subramoni, and D. K. Panda. Characterizing cuda unified memory (um)-aware mpi designs on modern gpu architectures. In Proceedings of the 12th Workshop on General Purpose Processing Using GPUs, GPGPU '19, page 43--52, New York, NY, USA, 2019. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Michele Martinelli. Poster: Gpu i/o persistent kernel for latency bound systems. In ACM Symposium on High-Performance Parallel and Distributed Computing, 2017.Google ScholarGoogle Scholar
  50. Pınar Muyan-Özçelik and John D. Owens. Methods for multitasking among real-time embedded compute tasks running on the gpu. Concurrency and Computation: Practice and Experience, 29(15):e4118, 2017. e4118 cpe.4118.Google ScholarGoogle ScholarCross RefCross Ref
  51. Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. Improving gpu performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, page 308--317, New York, NY, USA, 2011. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. NVIDIA. Triton inference server, 2020. https://github.com/triton-inference-server/server.Google ScholarGoogle Scholar
  53. Ignacio Sañudo Olmedo, Nicola Capodieci, Jorge Luis Martinez, Andrea Marongiu, and Marko Bertogna. Dissecting the cuda scheduling hierarchy: a performance and predictability perspective. In 2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 213--225, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  54. Christopher Olston, Fangwei Li, Jeremiah Harmsen, Jordan Soyke, Kiril Gorovoy, Li Lao, Noah Fiedel, Sukriti Ramesh, and Vinu Rajashekhar. Tensorflow-serving: Flexible, high-performance ml serving. In Workshop on ML Systems at NIPS 2017, 2017.Google ScholarGoogle Scholar
  55. Aaron Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg, et al. Parallel wavenet: Fast high-fidelity speech synthesis. In International conference on machine learning, pages 3918--3926. PMLR, 2018.Google ScholarGoogle Scholar
  56. N. Otterness, M. Yang, S. Rust, E. Park, J. H. Anderson, F. D. Smith, A. Berg, and S. Wang. An evaluation of the nvidia tx1 for supporting real-time computer-vision workloads. In 2017 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 353--364, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  57. Sreepathi Pai. How the fermi thread block scheduler works (illustrated), Mar 2014. https://cs.rochester.edu/~sree/fermi-tbs/fermi-tbs.html.Google ScholarGoogle Scholar
  58. Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, and Chita R. Das. Scheduling techniques for gpu architectures with processing-in-memory capabilities. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT '16, page 31--44, New York, NY, USA, 2016. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. George Prekas, Marios Kogias, and Edouard Bugnion. Zygos: Achieving low tail latency for microsecond-scale networked tasks. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP '17, page 325--341, New York, NY, USA, 2017. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Supranamaya Ranjan, Ram Swaminathan, Mustafa Uysal, Antonio Nucci, and Edward Knightly. Ddos-shield: Ddos-resilient scheduling to counter application layer attacks. IEEE/ACM Trans. Netw., 17(1):26--39, February 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis. INFaaS: Automated model-less inference serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 397--411. USENIX Association, July 2021.Google ScholarGoogle Scholar
  62. A. Shawahna, S. M. Sait, and A. El-Maleh. Fpga-based accelerators of deep learning networks for learning and classification: A review. IEEE Access, 7:7823--7859, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  63. Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. Nexus: A gpu cluster engine for accelerating dnn-based video analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP '19, page 322--337, New York, NY, USA, 2019. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. M. Shreedhar and George Varghese. Efficient fair queueing using deficit round robin. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, SIGCOMM '95, page 231--242, New York, NY, USA, 1995. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818--2826, Los Alamitos, CA, USA, jun 2016. IEEE Computer Society.Google ScholarGoogle ScholarCross RefCross Ref
  66. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), 2015.Google ScholarGoogle ScholarCross RefCross Ref
  67. Concurrent Real-Time Linux Development Team. Real-time performance during cuda. Technical report, Concurrent Real-Time, 11 2010.Google ScholarGoogle Scholar
  68. Adam Wierman and Bert Zwart. Is tail-optimal scheduling possible? Operations research, 60(5):1249--1257, 2012.Google ScholarGoogle Scholar
  69. Keith Winstein and Hari Balakrishnan. Tcp ex machina: Computergenerated congestion control. In Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, SIGCOMM '13, pages 123--134, New York, NY, USA, 2013. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. C. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan, K. Hazelwood, E. Isaac, Y. Jia, B. Jia, T. Leyvand, H. Lu, Y. Lu, L. Qiao, B. Reagen, J. Spisak, F. Sun, A. Tulloch, P. Vajda, X. Wang, Y. Wang, B. Wasti, Y. Wu, R. Xian, S. Yoo, and P. Zhang. Machine learning at facebook: Understanding inference at the edge. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 331--344, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  71. Nofel Yaseen, John Sonchack, and Vincent Liu. tpprof: A network traffic pattern profiler. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 1015--1030, Santa Clara, CA, February 2020. USENIX Association.Google ScholarGoogle Scholar
  72. Peifeng Yu and Mosharaf Chowdhury. Salus: Fine-grained GPU sharing primitives for deep learning applications. CoRR, abs/1902.04610, 2019.Google ScholarGoogle Scholar
  73. Irene Zhang, Amanda Raybuck, Pratyush Patel, Kirk Olynyk, Jacob Nelson, Omar S. Navarro Leija, Ashlie Martinez, Jing Liu, Anna Kornfeld Simpson, Sujay Jayakar, Pedro Henrique Penna, Max Demoulin, Piali Choudhury, and Anirudh Badam. The demikernel datapath os architecture for microsecond-scale datacenter systems. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, SOSP '21, page 195--211, New York, NY, USA, 2021. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. J. Zhong and B. He. Kernelet: High-throughput gpu kernel executions with dynamic slicing and scheduling. IEEE Transactions on Parallel and Distributed Systems, 25(6):1522--1532, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Paella: Low-latency Model Serving with Software-defined GPU Scheduling

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                SOSP '23: Proceedings of the 29th Symposium on Operating Systems Principles
                October 2023
                802 pages
                ISBN:9798400702297
                DOI:10.1145/3600006

                Copyright © 2023 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 23 October 2023

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article

                Acceptance Rates

                SOSP '23 Paper Acceptance Rate43of232submissions,19%Overall Acceptance Rate131of716submissions,18%

                Upcoming Conference

                SOSP '24

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader