FLARE: Flexibly Sharing Commodity GPUs to Enforce QoS and Improve Utilization

Han, Wei; Mawhirter, Daniel; Wu, Bo; Ma, Lin; Tian, Chen

doi:10.1007/978-3-030-72789-5_3

FLARE: Flexibly Sharing Commodity GPUs to Enforce QoS and Improve Utilization

Wei Han¹⁰,
Daniel Mawhirter¹⁰,
Bo Wu¹⁰,
Lin Ma¹¹ &
…
Chen Tian¹¹

Conference paper
First Online: 26 March 2021

336 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11998))

Abstract

A modern GPU integrates tens of streaming multi-processors (SMs) on the chip. When used in data centers, the GPUs often suffer from under-utilization for exclusive access reservations, hence demanding multitasking (i.e., co-running applications) to reduce the total cost of ownership. However, latency-critical applications may experience too much interference to meet Quality-of-Service (QoS) targets. In this paper, we propose a software system, FLARE, to spatially share commodity GPUs between latency-critical applications and best-effort applications to enforce QoS as well as maximize overall throughput. By transforming the kernels of best-effort applications, FLARE enables both SM partitioning and thread block partitioning within an SM for co-running applications. It uses a microbenchmark guided static configuration search combined with online dynamic search to locate the optimal (near-optimal) strategy to partition resources. Evaluated on 11 benchmarks and 2 real-world applications, FLARE improves hardware utilization by an average of 1.39X compared to the preemption-based approach.

W. Han and D. Mawhirter—Equal contribution.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Text classification in TensorFlow (2017). https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/learn#text-classification
Adriaens, J., Compton, K., Kim, N.S., Schulte, M.J.: The case for GPGPU spatial multitasking. In: HPCA (2012)
Google Scholar
Allen, T., Feng, X., Ge, R.: Slate: enabling workload-aware efficient multiprocessing for modern GPGPUs. In: IPDPS (2019)
Google Scholar
Barroso, L.A., Clidaras, J., Hölzle, U.: The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, 2nd edn. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers (2013)
Google Scholar
Che, S., et al.: Rodinia: a benchmark suite for heterogeneous computing. In: IISWC (2009)
Google Scholar
Chen, G., Zhao, Y., Shen, X., Zhou, H.: EffiSha: a software framework for enabling efficient preemptive scheduling of GPU. In: Sarkar, V., Rauchwerger, L. (eds.) PPoPP, pp. 3–16. ACM (2017)
Google Scholar
Chen, Q., Yang, H., Guo, M., Kannan, R.S., Mars, J., Tang, L.: Prophet: precise QoS prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers. In: ASPLOS (2017)
Google Scholar
Chen, Q., Yang, H., Mars, J., Tang, L.: Baymax: QoS awareness and increased utilization of non-preemptive accelerators in warehouse scale computers. In: ASPLOS (2016)
Google Scholar
Chung, J., Gülçehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555 (2014)
Google Scholar
Danalis, A., et al.: The scalable heterogeneous computing (SHOC) benchmark suite. In: GPGPU (2010)
Google Scholar
Han, W., Mawhirter, D., Buland, M., Wu, B.: Graphie: large-scale asynchronous graph traversals on just a GPU. In: PACT 2017 (2017)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Hong, C.-H., Spence, I.T.A., Nikolopoulos, D.S.: FairGV: fair and fast GPU virtualization. IEEE Trans. Parallel Distrib. Syst. 28(12), 3472–3485 (2017)
Article Google Scholar
Hong, C.-H., Spence, I.T.A., Nikolopoulos, D.S.: GPU virtualization and scheduling methods: a comprehensive survey. ACM Comput. Surv. 50(3), 35:1–35:37 (2017)
Article Google Scholar
Kehne, J., Metter, J., Bellosa, F.: GPUswap: enabling oversubscription of GPU memory through transparent swapping. In: VEE (2015)
Google Scholar
Liang, Y., Huynh, H.P., Rupnow, K., Goh, R.S.M., Chen, D.: Efficient GPU spatial-temporal multitasking. IEEE Trans. Parallel Distrib. Syst. 26(3), 748–760 (2015)
Article Google Scholar
Liang, Y., Li, X., Xie, X.: Exploring cache bypassing and partitioning for multi-tasking on GPUs. In: ICCAD (2017)
Google Scholar
Lo, D., Cheng, L., Govindaraju, R., Barroso, L.A., Kozyrakis, C.: Towards energy proportionality for large-scale latency-critical workloads. In: ISCA (2014)
Google Scholar
Lo, D., Cheng, L., Govindaraju, R., Ranganathan, P., Kozyrakis, C.: Heracles: improving resource efficiency at scale. In: ISCA (2015)
Google Scholar
Mars, J., Tang, L., Hundt, R., Skadron, K., Soffa, M.L.: Bubble-up: increasing utilization in modern warehouse scale computers via sensible co-locations. In: MICRO (2011)
Google Scholar
Park, J.J.K., Park, Y., Mahlke, S.: Chimera: collaborative preemption for multitasking on a shared GPU. In: ASPLOS (2015)
Google Scholar
Park, J.J.K., Park, Y., Mahlke, S.A.: Dynamic resource management for efficient utilization of multitasking GPUs. In: ASPLOS (2017)
Google Scholar
Sutskever, I., Martens, J., Hinton, G.E.: Generating text with recurrent neural networks. In: ICML (2011)
Google Scholar
Suzuki, Y., Kato, S., Yamada, H., Kono, K.: GPUvm: why not virtualizing GPUs at the hypervisor? In: USENIX ATC (2014)
Google Scholar
Suzuki, Y., Yamada, H., Kato, S., Kono, K.: GLoop: an event-driven runtime for consolidating GPGPU applications. In: SoCC (2017)
Google Scholar
Tanasic, I., Gelado, I., Cabezas, J., Ramirez, A., Navarro, N., Valero, M.: Enabling preemptive multiprogramming on GPUs. In: ISCA (2014)
Google Scholar
Tian, K., Dong, Y., Cowperthwaite, D.: A full GPU virtualization solution with mediated pass-through. In: USENIX ATC (2014)
Google Scholar
Wang, Y., Davidson, A., Pan, Y., Wu, Y., Riffel, A., Owens, J.D.: Gunrock: a high-performance graph processing library on the GPU. In: PPoPP (2015)
Google Scholar
Wang, Z., Yang, J., Melhem, R., Childers, B., Zhang, Y., Guo, M.: Quality of service support for fine-grained sharing on GPUs. In: ISCA (2017)
Google Scholar
Wu, B., Chen, G., Li, D., Shen, X., Vetter, J.S.: Enabling and exploiting flexible task assignment on GPU through SM-centric program transformations. In: ICS (2015)
Google Scholar
Bo, W., Liu, X., Zhou, X., Jiang, C.: FLEP: enabling flexible and efficient preemption on GPUs. In: ASPLOS (2017)
Google Scholar
Zhang, W., et al.: Laius: towards latency awareness and improved utilization of spatial multitasking accelerators in datacenters. In: ICS (2019)
Google Scholar
Zhang, Y., Laurenzano, M.A., Mars, J., Tang, L.: SMiTe: precise QoS prediction on real-system SMT processors to improve utilization in warehouse scale computers. In: MICRO (2014)
Google Scholar
Zhu, H., Erez, M.: Dirigent: enforcing QoS for latency-critical tasks on shared multicore systems. In: ASPLOS (2016)
Google Scholar

Download references

Acknowledgement

We would like to thank Akihiro Hayashi (our shepherd) and the anonymous reviewers for their constructive comments. This project was supported in part by NSF grant CCF-1823005 and an NSF CAREER Award (CNS-1750760).

Author information

Authors and Affiliations

Colorado School of Mines, Golden, USA
Wei Han, Daniel Mawhirter & Bo Wu
Huawei US R&D Center, Santa Clara, USA
Lin Ma & Chen Tian

Authors

Wei Han
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Mawhirter
View author publications
You can also search for this author in PubMed Google Scholar
Bo Wu
View author publications
You can also search for this author in PubMed Google Scholar
Lin Ma
View author publications
You can also search for this author in PubMed Google Scholar
Chen Tian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Han .

Editor information

Editors and Affiliations

Georgia Institute of Technology, Atlanta, GA, USA
Santosh Pande
Georgia Institute of Technology, Atlanta, GA, USA
Vivek Sarkar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Han, W., Mawhirter, D., Wu, B., Ma, L., Tian, C. (2021). FLARE: Flexibly Sharing Commodity GPUs to Enforce QoS and Improve Utilization. In: Pande, S., Sarkar, V. (eds) Languages and Compilers for Parallel Computing. LCPC 2019. Lecture Notes in Computer Science(), vol 11998. Springer, Cham. https://doi.org/10.1007/978-3-030-72789-5_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-72789-5_3
Published: 26 March 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-72788-8
Online ISBN: 978-3-030-72789-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics