skip to main content
10.1145/3437984.3458837acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article

Interference-Aware Scheduling for Inference Serving

Published:26 April 2021Publication History

ABSTRACT

Machine learning inference applications have proliferated through diverse domains such as healthcare, security, and analytics. Recent work has proposed inference serving systems for improving the deployment and scalability of models. To improve resource utilization, multiple models can be co-located on the same backend machine. However, co-location can cause latency degradation due to interference and can subsequently violate latency requirements. Although interference-aware schedulers for general workloads have been introduced, they do not scale appropriately to heterogeneous inference serving systems where the number of co-location configurations grows exponentially with the number of models and machine types.

This paper proposes an interference-aware scheduler for heterogeneous inference serving systems, reducing the latency degradation from co-location interference. We characterize the challenges in predicting the impact of co-location interference on inference latency (e.g., varying latency degradation across machine types), and identify properties of models and hardware that should be considered during scheduling. We then propose a unified prediction model that estimates an inference model's latency degradation during co-location, and develop an interference-aware scheduler that leverages this predictor. Our preliminary results show that our interference-aware scheduler achieves 2× lower latency degradation than a commonly used least-loaded scheduler. We also discuss future research directions for interference-aware schedulers for inference serving systems.

References

  1. 2018. NVIDIA TensorRT: Programmable Inference Accelerator. https://developer.nvidia.com/tensorrt.Google ScholarGoogle Scholar
  2. Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, and Ming Zhang. 2017. CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). USENIX Association, Boston, MA, 469--482. https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/alipourfardGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  3. AWS [n.d.]. AWS Neuron. https://github.com/aws/aws-neuron-sdk.Google ScholarGoogle Scholar
  4. AWS 2018. AWS Inferentia. https://aws.amazon.com/machine-learning/inferentia/.Google ScholarGoogle Scholar
  5. AWS 2019. Deliver high performance ML inference with AWS Inferentia. https://d1.awsstatic.com/events/reinvent/2019/REPEAT_1_Deliver_high_performance_ML_inference_with_AWS_Inferentia_CMP324-R1.pdf.Google ScholarGoogle Scholar
  6. Maria-Florina Balcan and Ruth Urner. 2016. Active Learning - Modern Learning Theory. Springer New York, New York, NY, 8--13. https://doi.org/10.1007/978-1-4939-2864-4_769Google ScholarGoogle Scholar
  7. Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson correlation coefficient. In Noise reduction in speech processing. Springer, 37--40.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 578--594. https://www.usenix.org/conference/osdi18/presentation/chenGoogle ScholarGoogle Scholar
  9. Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Stoica, Joseph Gonzalez, and Alexey Tumanov. 2020. InferLine: latency-aware provisioning and scaling for prediction serving pipelines. In Proceedings of the 11th ACM Symposium on Cloud Computing. 477--491.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. 2017. Clipper: A Low-Latency Online Prediction Serving System. In 14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017, Boston, MA, USA, March 27-29, 2017. 613--627. https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/crankshawGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  11. Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (Houston, Texas, USA) (ASPLOS '13). Association for Computing Machinery, New York, NY, USA, 77--88. https://doi.org/10.1145/2451116.2451125Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-Efficient and QoS-Aware Cluster Management. SIGPLAN Not. 49, 4 (Feb. 2014), 127--144. https://doi.org/10.1145/2644865.2541941Google ScholarGoogle Scholar
  13. D. Chitra Devi and V. Rhymend Uthariaraj. 2016. Load Balancing in Cloud Computing Environment Using Improved Weighted Round Robin Algorithm for Nonpreemptive Dependent Tasks. The Scientific World Journal 2016 (03 Feb 2016), 3896065. https://doi.org/10.1155/2016/3896065Google ScholarGoogle Scholar
  14. Google [n.d.]. Google Cloud Platform. https://cloud.google.com/compute.Google ScholarGoogle Scholar
  15. Udit Gupta, Samuel Hsia, Vikram Saraph, Xiaodong Wang, Brandon Reagen, Gu-Yeon Wei, Hsien-Hsin S. Lee, David Brooks, and Carole-Jean Wu. 2020. DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference. arXiv:2001.02772 [cs.DC]Google ScholarGoogle Scholar
  16. Fei Jiang, Yong Jiang, Hui Zhi, Yi Dong, Hao Li, Sufeng Ma, Yilong Wang, Qia ng Dong, Haipeng Shen, and Yongjun Wang. 2017. Artificial intelligence in healthcare: past, present and future. Stroke and Vascular Neurology 2, 4 (2017), 230--243. https://doi.org/10.1136/svn-2017-000101 arXiv:https://svn.bmj.com/content/2/4/230.full.pdfGoogle ScholarGoogle ScholarCross RefCross Ref
  17. Junchen Jiang, Ganesh Ananthanarayanan, Peter Bodik, Siddhartha Sen, and Ion Stoica. 2018. Chameleon: Scalable Adaptation of Video Analytics. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication (Budapest, Hungary) (SIGCOMM '18). Association for Computing Machinery, New York, NY, USA, 253--266. https://doi.org/10.1145/3230543.3230574Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (Toronto, ON, Canada) (ISCA '17). ACM, New York, NY, USA, 1--12. https://doi.org/10.1145/3079856.3080246Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Samuel J. Kaufman, Phitchaya Mangpo Phothilimthana, Yanqi Zhou, and Mike Burrows. 2020. A Learned Performance Model for the Tensor Processing Unit. arXiv:2008.01040 [cs.PF]Google ScholarGoogle Scholar
  20. Keras [n.d.]. Keras. https://github.com/fchollet/keras.Google ScholarGoogle Scholar
  21. Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. 2019. Learning Scheduling Algorithms for Data Processing Clusters. In Proceedings of the ACM Special Interest Group on Data Communication (Beijing, China) (SIGCOMM '19). Association for Computing Machinery, New York, NY, USA, 270--288. https://doi.org/10.1145/3341302.3342080Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Fatemehsadat Mireshghallah, Mohammadkazem Taram, Prakash Ramrakhyani, Ali Jalali, Dean Tullsen, and Hadi Esmaeilzadeh. 2020. Shredder: Learning Noise Distributions to Protect Inference Privacy. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS '20). Association for Computing Machinery, New York, NY, USA, 3--18. https://doi.org/10.1145/3373376.3378522Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Mitzenmacher. 2001. The power of two choices in randomized load balancing. IEEE Transactions on Parallel and Distributed Systems 12, 10 (2001), 1094--1104. https://doi.org/10.1109/71.963420Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. Patel and D. Tiwari. 2020. CLITE: Efficient and QoS-Aware Co-Location of Multiple Latency-Critical Jobs for Warehouse Scale Computers. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). 193--206. https://doi.org/10.1109/HPCA47549.2020.00025Google ScholarGoogle Scholar
  25. Francisco Romero and Christina Delimitrou. 2018. Mage: Online and Interference-Aware Scheduling for Multi-Scale Heterogeneous Systems. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (Limassol, Cyprus) (PACT '18). Association for Computing Machinery, New York, NY, USA, Article 19, 13 pages. https://doi.org/10.1145/3243176.3243183Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis. 2019. INFaaS: Managed & Model-less Inference Serving. CoRR abs/1905.13348 (2019). arXiv:1905.13348 http://arxiv.org/abs/1905.13348Google ScholarGoogle Scholar
  27. Francisco Romero, Mark Zhao, Neeraja J. Yadwadkar, and Christos Kozyrakis. 2021. Llama: A Heterogeneous & Serverless Framework for Auto-Tuning Video Analytics Pipelines. arXiv:2102.01887 [cs.DC]Google ScholarGoogle Scholar
  28. Nicholas Roy and Andrew McCallum. 2001. Toward Optimal Active Learning through Sampling Estimation of Error Reduction. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML '01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 441--448.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Subhadra Shaw and A. Singh. 2014. A survey on scheduling and load balancing techniques in cloud computing environment. 87--95. https://doi.org/10.1109/ICCCT.2014.7001474Google ScholarGoogle Scholar
  30. Joannès Vermorel and Mehryar Mohri. 2005. Multi-Armed Bandit Algorithms and Empirical Evaluation. In Proceedings of the 16th European Conference on Machine Learning (Porto, Portugal) (ECML'05). Springer-Verlag, Berlin, Heidelberg, 437--448. https://doi.org/10.1007/11564096_42Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Xin Xu, Na Zhang, Michael Cui, Michael He, and Ridhi Surana. 2019. Characterization and Prediction of Performance Interference on Mediated Passthrough GPUs for Interference-aware Scheduler. In 11th USENIX Workshop on Hot Topics in Cloud Computing (Hot-Cloud 19). USENIX Association, Renton, WA. https://www.usenix.org/conference/hotcloud19/presentation/xu-xinGoogle ScholarGoogle Scholar
  32. Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. 2019. MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). USENIX Association, Renton, WA, 1049--1062. https://www.usenix.org/conference/atc19/presentation/zhang-chengliangGoogle ScholarGoogle Scholar
  33. Haoyu Zhang, Ganesh Ananthanarayanan, Peter Bodik, Matthai Philipose, Paramvir Bahl, and Michael J. Freedman. 2017. Live Video Analytics at Scale with Approximation and Delay-Tolerance. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). USENIX Association, Boston, MA, 377--392. https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/zhangGoogle ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Interference-Aware Scheduling for Inference Serving

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        EuroMLSys '21: Proceedings of the 1st Workshop on Machine Learning and Systems
        April 2021
        130 pages
        ISBN:9781450382984
        DOI:10.1145/3437984

        Copyright © 2021 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 26 April 2021

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        EuroMLSys '21 Paper Acceptance Rate18of26submissions,69%Overall Acceptance Rate18of26submissions,69%

        Upcoming Conference

        EuroSys '24
        Nineteenth European Conference on Computer Systems
        April 22 - 25, 2024
        Athens , Greece

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader