ABSTRACT
Model serving systems play a critical role in multiplexing machine learning inference jobs across shared GPU infrastructure. These systems have traditionally sat at a high level of abstraction---receiving jobs from clients through a narrow API and relying on black-box GPU scheduling mechanisms when dispatching them. Fundamental limitations in the built-in GPU hardware scheduler, in particular, can lead to inefficiency when executing concurrent jobs. The current abstraction level also incurs system overheads that are similarly most significant when the GPU is heavily shared.
In this paper, we argue for co-designing the model compiler, local clients, and the scheduler to bypass the built-in GPU scheduler and enable software control of kernel execution order. Doing so enables the use of arbitrary scheduling algorithms and reduces system overheads throughout the critical path of inference.
- NVIDIA HyperQ. https://docs.nvidia.com/cuda/samples/6_Advanced/simpleHyperQ/doc/HyperQ.pdf.Google Scholar
- NVIDIA MPS. https://docs.nvidia.com/deploy/mps/index.html.Google Scholar
- NVRTC (Runtime Compilation). https://docs.nvidia.com/cuda/nvrtc/index.html.Google Scholar
- Parallel Thread Execution ISA). https://docs.nvidia.com/cuda/parallel-thread-execution/index.html.Google Scholar
- Onnx model zoo, 2020. https://github.com/onnx/models.Google Scholar
- Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, and Ming Zhang. Cherrypick: Adaptively unearthing the best cloud configurations for big data analytics. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation, NSDI'17, pages 469--482, Berkeley, CA, USA, 2017. USENIX Association.Google Scholar
- T. Amert, N. Otterness, M. Yang, J. H. Anderson, and F. D. Smith. Gpu scheduling on the nvidia tx2: Hidden details revealed. In 2017 IEEE Real-Time Systems Symposium (RTSS), pages 104--115, 2017.Google ScholarCross Ref
- Zhihao Bai, Zhen Zhang, Yibo Zhu, and Xin Jin. Pipeswitch: Fast pipelined context switching for deep learning applications. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 499--514. USENIX Association, November 2020.Google Scholar
- J. Bakita, Nathan Otterness, J. Anderson, and F. D. Smith. Scaling up: The validation of empirically derived scheduling rules on NVIDIA GPUs. In 14th Workshop on Operating Systems Platforms for Embedded Real-Time Applications (OSPERT), 2018.Google Scholar
- Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. IX: A protected dataplane operating system for high throughput and low latency. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 49--65, Broomfield, CO, October 2014. USENIX Association.Google ScholarDigital Library
- N. Capodieci, R. Cavicchioli, M. Bertogna, and A. Paramakuru. Deadline-based scheduling for gpu with preemption support. In 2018 IEEE Real-Time Systems Symposium (RTSS), pages 119--130, 2018.Google ScholarCross Ref
- A. X. M. Chang and E. Culurciello. Hardware accelerators for recurrent neural networks on fpga. In 2017 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1--4, 2017.Google ScholarCross Ref
- G. Chen and X. Shen. Free launch: Optimizing gpu dynamic kernel launches through thread reuse. In 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 407--419, 2015.Google ScholarDigital Library
- Li Chen, Justinas Lingys, Kai Chen, and Feng Liu. Auto: Scaling deep reinforcement learning for datacenter-scale automatic traffic optimization. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, SIGCOMM '18, page 191--205, New York, NY, USA, 2018. Association for Computing Machinery.Google ScholarDigital Library
- Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578--594, Carlsbad, CA, October 2018. USENIX Association.Google ScholarDigital Library
- CN Coelho, A Kuusela, S Li, H Zhuang, T Aarrestad, V Loncar, J Ngadiuba, M Pierini, AA Pol, and S Summers. Automatic deep heterogeneous quantization of deep neural networks for ultra low-area, low-latency inference on the edge at particle colliders. arXiv preprint arXiv:2006.10159.Google Scholar
- Intel Corporation. Intel 64 and ia-32 architectures software developer's manual volume 3a: System programming guide, 2021.Google Scholar
- Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Stoica, Joseph Gonzalez, and Alexey Tumanov. Inferline: Latency-aware provisioning and scaling for prediction serving pipelines. In Proceedings of the 11th ACM Symposium on Cloud Computing, SoCC '20, page 477--491, New York, NY, USA, 2020. Association for Computing Machinery.Google Scholar
- Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. Clipper: A low-latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 613--627, Boston, MA, March 2017. USENIX Association.Google ScholarDigital Library
- Martijn de Rooij. Ultra low latency deep neural network inference for gravitational waves interferometer. 2021.Google Scholar
- Henri Maxime Demoulin, Joshua Fried, Isaac Pedisich, Marios Kogias, Boon Thau Loo, Linh Thi Xuan Phan, and Irene Zhang. When idling is ideal: Optimizing tail-latency for heavy-tailed datacenter workloads with perséphone. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, SOSP '21, page 621--637, New York, NY, USA, 2021. Association for Computing Machinery.Google ScholarDigital Library
- Henri Maxime Demoulin, Isaac Pedisich, Nikos Vasilakis, Vincent Liu, Boon Thau Loo, and Linh Thi Xuan Phan. Detecting asymmetric application-layer denial-of-service attacks in-flight with finelame. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 693--708, Renton, WA, July 2019. USENIX Association.Google Scholar
- Javier Duarte, Song Han, Philip Harris, Sergo Jindariani, Edward Kreinar, Benjamin Kreis, Jennifer Ngadiuba, Maurizio Pierini, Ryan Rivera, Nhan Tran, et al. Fast inference of deep neural networks in fpgas for particle physics. Journal of Instrumentation, 13(07):P07027, 2018.Google ScholarCross Ref
- Glenn A. Elliott and James H. Anderson. Real-world constraints of gpus in real-time systems. In Proceedings of the 2011 IEEE 17th International Conference on Embedded and Real-Time Computing Systems and Applications - Volume 02, RTCSA '11, page 48--54, USA, 2011. IEEE Computer Society.Google Scholar
- Joshua Fried, Zhenyuan Ruan, Amy Ousterhout, and Adam Belay. Caladan: Mitigating interference at microsecond timescales. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 281--297. USENIX Association, November 2020.Google Scholar
- W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warp formation and scheduling for efficient gpu control flow. In 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), pages 407--420, 2007.Google ScholarDigital Library
- Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, and Janardhan Kulkarni. Graphene: Packing and dependency-aware scheduling for data-parallel clusters. OSDI'16, page 81--97, USA, 2016. USENIX Association.Google Scholar
- Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. Serving dnns like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 443--462. USENIX Association, November 2020.Google Scholar
- Jashwant Raj Gunasekaran, Cyan Subhra Mishra, Prashanth Thinakaran, Bikash Sharma, Mahmut Taylan Kandemir, and Chita R. Das. Cocktail: A multidimensional optimization for model serving in cloud. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 1041--1057, Renton, WA, April 2022. USENIX Association.Google Scholar
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770--778, 2016.Google ScholarCross Ref
- Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. NIPS Deep Learning and Representation Learning Workshop, 2015.Google Scholar
- Cheol-Ho Hong, Ivor Spence, and Dimitrios S. Nikolopoulos. Gpu virtualization and scheduling methods: A comprehensive survey. ACM Comput. Surv., 50(3), June 2017.Google Scholar
- Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.Google Scholar
- Chien-Chin Huang, Gu Jin, and Jinyang Li. Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '20, page 1341--1355, New York, NY, USA, 2020. Association for Computing Machinery.Google ScholarDigital Library
- Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261--2269, 2017.Google ScholarCross Ref
- Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size. CoRR, abs/1602.07360, 2016.Google Scholar
- Zhe Jia, Marco Maggioni, Jeffrey Smith, and Daniele Paolo Scarpazza. Dissecting the nvidia turing t4 gpu via microbenchmarking, 2019.Google Scholar
- Kostis Kaffes, Timothy Chong, Jack Tigar Humphries, Adam Belay, David Mazières, and Christos Kozyrakis. Shinjuku: Preemptive scheduling for μsecond-scale tail latency. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 345--360, Boston, MA, February 2019. USENIX Association.Google Scholar
- Anuj Kalia, Michael Kaminsky, and David Andersen. Datacenter RPCs can be general and fast. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 1--16, Boston, MA, February 2019. USENIX Association.Google Scholar
- Shinpei Kato, Karthik Lakshmanan, Ragunathan Rajkumar, and Yutaka Ishikawa. Timegraph: Gpu scheduling for real-time multi-tasking environments. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, USENIXATC'11, page 2, USA, 2011. USENIX Association.Google ScholarDigital Library
- Charles W. Kazer, João Sedoc, Kelvin K.W. Ng, Vincent Liu, and Lyle H. Ungar. Fast network simulation through approximation or: How blind men can describe elephants. In Proceedings of the 17th ACM Workshop on Hot Topics in Networks, HotNets '18, page 141--147, New York, NY, USA, 2018. Association for Computing Machinery.Google ScholarDigital Library
- Ana Klimovic, Heiner Litz, and Christos Kozyrakis. Reflex: Remote flash ~ local flash. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '17, page 345--359, New York, NY, USA, 2017. Association for Computing Machinery.Google ScholarDigital Library
- Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD '18, page 489--504, New York, NY, USA, 2018. Association for Computing Machinery.Google ScholarDigital Library
- Redis Labs and Tensorwerk. Redisai, 2020. https://github.com/RedisAI/RedisAI.Google Scholar
- Griffin Lacey, Graham W. Taylor, and Shawki Areibi. Deep learning on fpgas: Past, present, and future, 2016.Google Scholar
- Huan Liu, Farhad Hussain, Chew Lim Tan, and Manoranjan Dash. Discretization: An enabling technique. Data Mining and Knowledge Discovery, 6(4):393--423, December 2002.Google ScholarDigital Library
- Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. Rammer: Enabling holistic deep learning compiler optimizations with rTasks. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 881--897. USENIX Association, November 2020.Google Scholar
- K. V. Manian, A. A. Ammar, A. Ruhela, C.-H. Chu, H. Subramoni, and D. K. Panda. Characterizing cuda unified memory (um)-aware mpi designs on modern gpu architectures. In Proceedings of the 12th Workshop on General Purpose Processing Using GPUs, GPGPU '19, page 43--52, New York, NY, USA, 2019. Association for Computing Machinery.Google ScholarDigital Library
- Michele Martinelli. Poster: Gpu i/o persistent kernel for latency bound systems. In ACM Symposium on High-Performance Parallel and Distributed Computing, 2017.Google Scholar
- Pınar Muyan-Özçelik and John D. Owens. Methods for multitasking among real-time embedded compute tasks running on the gpu. Concurrency and Computation: Practice and Experience, 29(15):e4118, 2017. e4118 cpe.4118.Google ScholarCross Ref
- Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. Improving gpu performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, page 308--317, New York, NY, USA, 2011. Association for Computing Machinery.Google ScholarDigital Library
- NVIDIA. Triton inference server, 2020. https://github.com/triton-inference-server/server.Google Scholar
- Ignacio Sañudo Olmedo, Nicola Capodieci, Jorge Luis Martinez, Andrea Marongiu, and Marko Bertogna. Dissecting the cuda scheduling hierarchy: a performance and predictability perspective. In 2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 213--225, 2020.Google ScholarCross Ref
- Christopher Olston, Fangwei Li, Jeremiah Harmsen, Jordan Soyke, Kiril Gorovoy, Li Lao, Noah Fiedel, Sukriti Ramesh, and Vinu Rajashekhar. Tensorflow-serving: Flexible, high-performance ml serving. In Workshop on ML Systems at NIPS 2017, 2017.Google Scholar
- Aaron Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg, et al. Parallel wavenet: Fast high-fidelity speech synthesis. In International conference on machine learning, pages 3918--3926. PMLR, 2018.Google Scholar
- N. Otterness, M. Yang, S. Rust, E. Park, J. H. Anderson, F. D. Smith, A. Berg, and S. Wang. An evaluation of the nvidia tx1 for supporting real-time computer-vision workloads. In 2017 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 353--364, 2017.Google ScholarCross Ref
- Sreepathi Pai. How the fermi thread block scheduler works (illustrated), Mar 2014. https://cs.rochester.edu/~sree/fermi-tbs/fermi-tbs.html.Google Scholar
- Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, and Chita R. Das. Scheduling techniques for gpu architectures with processing-in-memory capabilities. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT '16, page 31--44, New York, NY, USA, 2016. Association for Computing Machinery.Google ScholarDigital Library
- George Prekas, Marios Kogias, and Edouard Bugnion. Zygos: Achieving low tail latency for microsecond-scale networked tasks. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP '17, page 325--341, New York, NY, USA, 2017. Association for Computing Machinery.Google ScholarDigital Library
- Supranamaya Ranjan, Ram Swaminathan, Mustafa Uysal, Antonio Nucci, and Edward Knightly. Ddos-shield: Ddos-resilient scheduling to counter application layer attacks. IEEE/ACM Trans. Netw., 17(1):26--39, February 2009.Google ScholarDigital Library
- Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis. INFaaS: Automated model-less inference serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 397--411. USENIX Association, July 2021.Google Scholar
- A. Shawahna, S. M. Sait, and A. El-Maleh. Fpga-based accelerators of deep learning networks for learning and classification: A review. IEEE Access, 7:7823--7859, 2019.Google ScholarCross Ref
- Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. Nexus: A gpu cluster engine for accelerating dnn-based video analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP '19, page 322--337, New York, NY, USA, 2019. Association for Computing Machinery.Google ScholarDigital Library
- M. Shreedhar and George Varghese. Efficient fair queueing using deficit round robin. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, SIGCOMM '95, page 231--242, New York, NY, USA, 1995. Association for Computing Machinery.Google ScholarDigital Library
- C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818--2826, Los Alamitos, CA, USA, jun 2016. IEEE Computer Society.Google ScholarCross Ref
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), 2015.Google ScholarCross Ref
- Concurrent Real-Time Linux Development Team. Real-time performance during cuda. Technical report, Concurrent Real-Time, 11 2010.Google Scholar
- Adam Wierman and Bert Zwart. Is tail-optimal scheduling possible? Operations research, 60(5):1249--1257, 2012.Google Scholar
- Keith Winstein and Hari Balakrishnan. Tcp ex machina: Computergenerated congestion control. In Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, SIGCOMM '13, pages 123--134, New York, NY, USA, 2013. ACM.Google ScholarDigital Library
- C. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan, K. Hazelwood, E. Isaac, Y. Jia, B. Jia, T. Leyvand, H. Lu, Y. Lu, L. Qiao, B. Reagen, J. Spisak, F. Sun, A. Tulloch, P. Vajda, X. Wang, Y. Wang, B. Wasti, Y. Wu, R. Xian, S. Yoo, and P. Zhang. Machine learning at facebook: Understanding inference at the edge. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 331--344, 2019.Google ScholarCross Ref
- Nofel Yaseen, John Sonchack, and Vincent Liu. tpprof: A network traffic pattern profiler. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 1015--1030, Santa Clara, CA, February 2020. USENIX Association.Google Scholar
- Peifeng Yu and Mosharaf Chowdhury. Salus: Fine-grained GPU sharing primitives for deep learning applications. CoRR, abs/1902.04610, 2019.Google Scholar
- Irene Zhang, Amanda Raybuck, Pratyush Patel, Kirk Olynyk, Jacob Nelson, Omar S. Navarro Leija, Ashlie Martinez, Jing Liu, Anna Kornfeld Simpson, Sujay Jayakar, Pedro Henrique Penna, Max Demoulin, Piali Choudhury, and Anirudh Badam. The demikernel datapath os architecture for microsecond-scale datacenter systems. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, SOSP '21, page 195--211, New York, NY, USA, 2021. Association for Computing Machinery.Google ScholarDigital Library
- J. Zhong and B. He. Kernelet: High-throughput gpu kernel executions with dynamic slicing and scheduling. IEEE Transactions on Parallel and Distributed Systems, 25(6):1522--1532, 2014.Google ScholarDigital Library
Index Terms
- Paella: Low-latency Model Serving with Software-defined GPU Scheduling
Recommendations
Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architectures
Performance of high-end supercomputers will reach the exascale through the advent of core counts in billions. However, in the upcoming exascale computing era it is important not only to focus on the performance, but also on scalability of fine-grained ...
Scheduling of deteriorating jobs with release dates to minimize the maximum lateness
In this paper, we consider the problem of scheduling n deteriorating jobs with release dates on a single (batching) machine. Each job's processing time is a simple linear function of its starting time. The objective is to minimize the maximum lateness. ...
Disengaged scheduling for fair, protected access to fast computational accelerators
ASPLOS '14Today's operating systems treat GPUs and other computational accelerators as if they were simple devices, with bounded and predictable response times. With accelerators assuming an increasing share of the workload on modern machines, this strategy is ...
Comments