research-article

Laius: <u>T</u>owards <u>l</u>atency <u>a</u>wareness and <u>i</u>mproved <u>u</u>tilization of <u>s</u>patial multitasking accelerators in datacenters

Authors:

Daniel Edward Mawhirter,

Minyi GuoAuthors Info & Claims

ICS '19: Proceedings of the ACM International Conference on Supercomputing

Pages 58 - 68

https://doi.org/10.1145/3330345.3330351

Published: 26 June 2019 Publication History

Abstract

Datacenters use accelerators to provide the significant compute throughput required by emerging user-facing services. The diurnal user access pattern of user-facing services provides a strong incentive to co-located applications for better accelerator utilization, and prior work has focused on enabling co-location on multicore processors and traditional non-preemptive accelerators. However, current accelerators are evolving towards spatial multitasking and introduce a new set of challenges to eliminate QoS violation. To address this open problem, we explore the underlying causes of QoS violation on spatial multitasking accelerators. In response to these causes, we propose Laius, a runtime system that carefully allocates the computation resource to co-located applications for maximizing the throughput of batch applications while guaranteeing the required QoS of user-facing services. Our evaluation on a Nvidia RTX 2080Ti GPU shows that Laius improves the utilization of spatial multitasking accelerators by 20.8%, while achieving the 99%-ile latency target for user-facing services.

References

[1]

{n. d.}. Apple Siri. https://www.apple.com/siri/.

[2]

{n. d.}. Google Translate. https://translate.google.com/.

[3]

{n. d.}. Nvidia Night Compute. https://docs.nvidia.com/nsight-compute/NsightCompute/index.html.

[4]

{n. d.}. Prisma. https://prisma-ai.com/.

[5]

Jacob T Adriaens, Katherine Compton, Nam Sung Kim, and Michael J Schulte. 2012. The case for GPGPU spatial multitasking. In High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on. IEEE, 1--12.

Digital Library

[6]

Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. 2013. The datacenter as a computer: An introduction to the design of warehouse-scale machines. Synthesis lectures on computer architecture 8, 3 (2013), 1--154.

[7]

Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. 2013. The datacenter as a computer: An introduction to the design of warehouse-scale machines. Synthesis lectures on computer architecture 8, 3 (2013), 1--154.

[8]

Andrei Broder. 2002. A taxonomy of web search. In ACM Sigir forum, Vol. 36. ACM, 3--10.

Digital Library

[9]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. Ieee, 44--54.

Digital Library

[10]

Quan Chen, Hailong Yang, Minyi Guo, Ram Srivatsa Kannan, Jason Mars, and Lingjia Tang. 2017. Prophet: Precise qos prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers. ACM SIGARCH Computer Architecture News 45, 1 (2017), 17--32.

Digital Library

[11]

Quan Chen, Hailong Yang, Jason Mars, and Lingjia Tang. 2016. Baymax: Qos awareness and increased utilization for non-preemptive accelerators in warehouse scale computers. ACM SIGARCH Computer Architecture News 44, 2 (2016), 681--696.

Digital Library

[12]

Yunji Chen, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2016. DianNao family: energy-efficient hardware accelerators for machine learning. Commun. ACM 59, 11 (2016), 105--112.

Digital Library

[13]

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).

[14]

Carlos A Coello Coello. 2000. Treating constraints as objectives for single-objective evolutionary optimization. Engineering Optimization+ A35 32, 3 (2000), 275--308.

[15]

Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM 56, 2 (2013), 74--80.

Digital Library

[16]

Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM 56, 2 (2013), 74--80.

Digital Library

[17]

Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-aware scheduling for heterogeneous datacenters. In ACM SIGPLAN Notices, Vol. 48. ACM, 77--88.

Digital Library

[18]

Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: resource-efficient and QoS-aware cluster management. ACM SIGPLAN Notices 49, 4 (2014), 127--144.

Digital Library

[19]

Glenn A Elliott, Bryan C Ward, and James H Anderson. 2013. GPUSync: A framework for real-time GPU management. In 2013 IEEE 34th Real-Time Systems Symposium. IEEE, 33--44.

Digital Library

[20]

Johann Hauswald, Yiping Kang, Michael A Laurenzano, Quan Chen, Cheng Li, Trevor Mudge, Ronald G Dreslinski, Jason Mars, and Lingjia Tang. 2015. DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers. In Computer Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on. IEEE, 27--40.

Digital Library

[21]

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An introduction to statistical learning. Vol. 112. Springer.

Digital Library

[22]

Nicola Jones. 2014. Computer science: The learning machines. Nature News 505, 7482 (2014), 146.

[23]

Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md Wasi-ur Rahman, Nusrat S Islam, Xiangyong Ouyang, Hao Wang, Sayantan Sur, et al. 2011. Memcached design on high performance rdma capable interconnects. In 2011 International Conference on Parallel Processing. IEEE, 743--752.

Digital Library

[24]

Shinpei Kato, Karthik Lakshmanan, Raj Rajkumar, and Yutaka Ishikawa. 2011. TimeGraph: GPU scheduling for real-time multi-tasking environments. In Proc. USENIXATC. 17--30.

Digital Library

[25]

Raphael Landaverde, Tiansheng Zhang, Ayse K Coskun, and Martin Herbordt. 2014. An investigation of unified memory access performance in cuda. In High Performance Extreme Computing Conference (HPEC), 2014 IEEE. IEEE, 1--6.

[26]

Haeseung Lee, Al Faruque, and Mohammad Abdullah. 2014. GPU-EvR: Run-time event based real-time scheduling framework on GPGPU platform. In Proceedings of the conference on Design, Automation & Test in Europe. European Design and Automation Association, 220.

[27]

Teng Li, Vikram K Narayana, and Tarek El-Ghazawi. 2015. Reordering GPU kernel launches to enable efficient concurrent execution. arXiv preprint arXiv:1511.07983 (2015).

[28]

David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2015. Heracles: Improving resource efficiency at scale. In ACM SIGARCHComputer Architecture News, Vol. 43. ACM, 450--462.

Digital Library

[29]

Szymon Łukasik and Sławomir Zak. 2009. Firefly algorithm for continuous constrained optimization tasks. In International conference on computational collective intelligence. Springer, 97--106.

Digital Library

[30]

Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. 2011. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Proceedings of the 44th annual IEEE/ACM International Symposium on Microarchitecture. ACM, 248--259.

Digital Library

[31]

NVIDIA. 2012. Sharing a GPU between MPI processes: multi-process service(MPS).

[32]

NVIDIA. 2015. Multi-Process Service. https://docs.nvidia.com/deploy/mps/index.htmltopic_6_1_2.

[33]

CUDA Nvidia. 2008. Cublas library. NVIDIA Corporation, Santa Clara, California 15, 27 (2008), 31.

[34]

C Nvidia. 2012. Nvidias next generation cuda compute architecture: Kepler gk110. Whitepaper (2012) (2012).

[35]

Sreepathi Pai, Matthew J Thazhuthaveetil, and Ramaswamy Govindarajan. 2013. Improving GPGPU concurrency with elastic kernels. In ACM SIGPLAN Notices, Vol. 48. ACM, 407--418.

Digital Library

[36]

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2017. Dynamic resource management for efficient utilization of multitasking GPUs. ACM SIGOPS Operating Systems Review 51, 2 (2017), 527--540.

Digital Library

[37]

Vinicius Petrucci, Michael A Laurenzano, John Doherty, Yunqi Zhang, Daniel Mosse, Jason Mars, and Lingjia Tang. 2015. Octopus-man: Qos-driven task management for heterogeneous multicores in warehouse-scale computers. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 246--258.

[38]

S Rasoul Safavian and David Landgrebe. 1991. A survey of decision tree classifier methodology. IEEE transactions on systems, man, and cybernetics 21, 3 (1991), 660--674.

[39]

Sartaj Sahni. 1975. Approximate algorithms for the 0/1 knapsack problem. Journal of the ACM (JACM) 22, 1 (1975), 115--124.

Digital Library

[40]

George AF Seber and Alan J Lee. 2012. Linear regression analysis. Vol. 329. John Wiley & Sons.

[41]

Yusuke Suzuki, Shinpei Kato, Hiroshi Yamada, and Kenji Kono. 2014. GPUvm: Why not virtualizing GPUs at the hypervisor?. In USENIX Annual Technical Conference. 109--120.

Digital Library

[42]

Lingjia Tang, Jason Mars, and Mary Lou Soffa. 2012. Compiling for niceness: Mitigating contention for qos in warehouse scale computers. In Proceedings of the Tenth International Symposium on Code Generation and Optimization. ACM, 1--12.

Digital Library

[43]

Zhenning Wang, Jun Yang, Rami Melhem, Bruce Childers, Youtao Zhang, and Minyi Guo. 2016. Simultaneous multikernel GPU: Multi-tasking throughput processors via fine-grained sharing. In High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on. IEEE, 358--369.

[44]

Bo Wu, Xu Liu, Xiaobo Zhou, and Changjun Jiang. 2017. FLEP: Enabling flexible and efficient preemption on GPUs. ACM SIGOPS Operating Systems Review 51, 2 (2017), 483--496.

Digital Library

[45]

Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. 2013. Bubble-flux: Precise online qos management for increased utilization in warehouse scale computers. In ACM SIGARCH Computer Architecture News, Vol. 41. ACM, 607--618.

Digital Library

[46]

Junbo Zhang, Yu Zheng, Dekang Qi, Ruiyuan Li, and Xiuwen Yi. 2016. DNN-based prediction model for spatio-temporal data. In Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, 92.

Digital Library

[47]

Yunqi Zhang, Michael A Laurenzano, Jason Mars, and Lingjia Tang. 2014. Smite: Precise qos prediction on real-system smt processors to improve utilization in warehouse scale computers. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on. IEEE, 406--418.

Digital Library

[48]

Wenyi Zhao, Quan Chen, and Minyi Guo. 2018. KSM: Online Application-Level Performance Slowdown Prediction for Spatial Multitasking GPGPU. IEEE Computer Architecture Letters 17, 2 (2018), 187--191.

Digital Library

Cited By

Zhang BLi SLi Z(2024)MIGER: Integrating Multi-Instance GPU and Multi-Process Service for Deep Learning ClustersProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673089(504-513)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673089
Strati FMa XKlimovic A(2024)Orion: Interference-aware, Fine-grained GPU Sharing for ML ApplicationsProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629578(1075-1092)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3629578
Dhakal AKulkarni SRamakrishnan K(2024)D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUsIEEE Transactions on Cloud Computing10.1109/TCC.2024.347621012:4(1344-1358)Online publication date: Oct-2024
https://doi.org/10.1109/TCC.2024.3476210
Show More Cited By

Recommendations

Astraea: towards QoS-aware and resource-efficient multi-stage GPU services
ASPLOS '22: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

Multi-stage user-facing applications on GPUs are widely-used nowa- days, and are often implemented to be microservices. Prior re- search works are not applicable to ensuring QoS of GPU-based microservices due to the different communication patterns and ...
Enable simultaneous DNN services based on deterministic operator overlap and precise latency prediction
SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

While user-facing services experience diurnal load patterns, co-locating services improve hardware utilization. Prior work on co-locating services on GPUs run queries sequentially, as the latencies of the queries are neither stable nor predictable when ...
Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers
ASPLOS '16

Modern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '19: Proceedings of the ACM International Conference on Supercomputing

June 2019

533 pages

ISBN:9781450360791

DOI:10.1145/3330345

General Chair:
Rudolf Eigenmann
University of Delaware
,
Program Chairs:
Chen Ding
University of Rochester
,
Sally A. McKee
Clemson University

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF CAREER Award
CCF
National R&D Program of China
National Natural Science Foundation of China

Conference

ICS '19

Sponsor:

SIGARCH

ICS '19: 2019 International Conference on Supercomputing

June 26 - 28, 2019

Arizona, Phoenix

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
721
Total Downloads

Downloads (Last 12 months)34
Downloads (Last 6 weeks)1

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang BLi SLi Z(2024)MIGER: Integrating Multi-Instance GPU and Multi-Process Service for Deep Learning ClustersProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673089(504-513)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673089
Strati FMa XKlimovic A(2024)Orion: Interference-aware, Fine-grained GPU Sharing for ML ApplicationsProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629578(1075-1092)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3629578
Dhakal AKulkarni SRamakrishnan K(2024)D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUsIEEE Transactions on Cloud Computing10.1109/TCC.2024.347621012:4(1344-1358)Online publication date: Oct-2024
https://doi.org/10.1109/TCC.2024.3476210
Liu XZhao YLiu SLi XZhu YLiu XJin X(2024)MuxFlow: efficient GPU sharing in production-level clusters with more than 10000 GPUsScience China Information Sciences10.1007/s11432-024-4227-267:12Online publication date: 13-Dec-2024
https://doi.org/10.1007/s11432-024-4227-2
Ssemakula JGorricho JKibalya GSerrat-Fernandez J(2024)An artificial intelligence strategy for the deployment of future microservice-based applications in 6G networksNeural Computing and Applications10.1007/s00521-024-09643-936:18(10971-10997)Online publication date: 28-Mar-2024
https://doi.org/10.1007/s00521-024-09643-9
Chen BZhao HCui WHe YZhang SChen QLi ZGuo M(2023)Maximizing the Utilization of GPUs Used by Cloud Gaming through Adaptive Co-location with ComboProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624660(265-280)Online publication date: 30-Oct-2023
https://dl.acm.org/doi/10.1145/3620678.3624660
Chen RShi HLi YLiu XWang GFedorova ANarayanan DDi Luna GQuerzoni L(2023)OLPart: Online Learning based Resource Partitioning for Colocating Multiple Latency-Critical Jobs on Commodity ComputersProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3567490(347-364)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1145/3552326.3567490
Zhao HCui WChen QGuo M(2023)ISPA: Exploiting Intra-SM Parallelism in GPUs via Fine-Grained Resource ManagementIEEE Transactions on Computers10.1109/TC.2022.321408872:5(1473-1487)Online publication date: 1-May-2023
https://doi.org/10.1109/TC.2022.3214088
Tan XGolikov PVijaykumar NPekhimenko GKloeckner AMoreira J(2022)GPUPoolProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569650(317-332)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569650
Dhakal ARamakrishnan KKulkarni SSharma PCho JBellavista PZhang KGherbi ABagchi SPatiño MDi Modica GGascon-Samson J(2022)Slice-TuneProceedings of the 23rd ACM/IFIP International Middleware Conference10.1145/3528535.3565247(228-240)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3528535.3565247
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten