Abstract
A modern GPU integrates tens of streaming multi-processors (SMs) on the chip. When used in data centers, the GPUs often suffer from under-utilization for exclusive access reservations, hence demanding multitasking (i.e., co-running applications) to reduce the total cost of ownership. However, latency-critical applications may experience too much interference to meet Quality-of-Service (QoS) targets. In this paper, we propose a software system, FLARE, to spatially share commodity GPUs between latency-critical applications and best-effort applications to enforce QoS as well as maximize overall throughput. By transforming the kernels of best-effort applications, FLARE enables both SM partitioning and thread block partitioning within an SM for co-running applications. It uses a microbenchmark guided static configuration search combined with online dynamic search to locate the optimal (near-optimal) strategy to partition resources. Evaluated on 11 benchmarks and 2 real-world applications, FLARE improves hardware utilization by an average of 1.39X compared to the preemption-based approach.
W. Han and D. Mawhirter—Equal contribution.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Text classification in TensorFlow (2017). https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/learn#text-classification
Adriaens, J., Compton, K., Kim, N.S., Schulte, M.J.: The case for GPGPU spatial multitasking. In: HPCA (2012)
Allen, T., Feng, X., Ge, R.: Slate: enabling workload-aware efficient multiprocessing for modern GPGPUs. In: IPDPS (2019)
Barroso, L.A., Clidaras, J., Hölzle, U.: The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, 2nd edn. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers (2013)
Che, S., et al.: Rodinia: a benchmark suite for heterogeneous computing. In: IISWC (2009)
Chen, G., Zhao, Y., Shen, X., Zhou, H.: EffiSha: a software framework for enabling efficient preemptive scheduling of GPU. In: Sarkar, V., Rauchwerger, L. (eds.) PPoPP, pp. 3–16. ACM (2017)
Chen, Q., Yang, H., Guo, M., Kannan, R.S., Mars, J., Tang, L.: Prophet: precise QoS prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers. In: ASPLOS (2017)
Chen, Q., Yang, H., Mars, J., Tang, L.: Baymax: QoS awareness and increased utilization of non-preemptive accelerators in warehouse scale computers. In: ASPLOS (2016)
Chung, J., Gülçehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555 (2014)
Danalis, A., et al.: The scalable heterogeneous computing (SHOC) benchmark suite. In: GPGPU (2010)
Han, W., Mawhirter, D., Buland, M., Wu, B.: Graphie: large-scale asynchronous graph traversals on just a GPU. In: PACT 2017 (2017)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Hong, C.-H., Spence, I.T.A., Nikolopoulos, D.S.: FairGV: fair and fast GPU virtualization. IEEE Trans. Parallel Distrib. Syst. 28(12), 3472–3485 (2017)
Hong, C.-H., Spence, I.T.A., Nikolopoulos, D.S.: GPU virtualization and scheduling methods: a comprehensive survey. ACM Comput. Surv. 50(3), 35:1–35:37 (2017)
Kehne, J., Metter, J., Bellosa, F.: GPUswap: enabling oversubscription of GPU memory through transparent swapping. In: VEE (2015)
Liang, Y., Huynh, H.P., Rupnow, K., Goh, R.S.M., Chen, D.: Efficient GPU spatial-temporal multitasking. IEEE Trans. Parallel Distrib. Syst. 26(3), 748–760 (2015)
Liang, Y., Li, X., Xie, X.: Exploring cache bypassing and partitioning for multi-tasking on GPUs. In: ICCAD (2017)
Lo, D., Cheng, L., Govindaraju, R., Barroso, L.A., Kozyrakis, C.: Towards energy proportionality for large-scale latency-critical workloads. In: ISCA (2014)
Lo, D., Cheng, L., Govindaraju, R., Ranganathan, P., Kozyrakis, C.: Heracles: improving resource efficiency at scale. In: ISCA (2015)
Mars, J., Tang, L., Hundt, R., Skadron, K., Soffa, M.L.: Bubble-up: increasing utilization in modern warehouse scale computers via sensible co-locations. In: MICRO (2011)
Park, J.J.K., Park, Y., Mahlke, S.: Chimera: collaborative preemption for multitasking on a shared GPU. In: ASPLOS (2015)
Park, J.J.K., Park, Y., Mahlke, S.A.: Dynamic resource management for efficient utilization of multitasking GPUs. In: ASPLOS (2017)
Sutskever, I., Martens, J., Hinton, G.E.: Generating text with recurrent neural networks. In: ICML (2011)
Suzuki, Y., Kato, S., Yamada, H., Kono, K.: GPUvm: why not virtualizing GPUs at the hypervisor? In: USENIX ATC (2014)
Suzuki, Y., Yamada, H., Kato, S., Kono, K.: GLoop: an event-driven runtime for consolidating GPGPU applications. In: SoCC (2017)
Tanasic, I., Gelado, I., Cabezas, J., Ramirez, A., Navarro, N., Valero, M.: Enabling preemptive multiprogramming on GPUs. In: ISCA (2014)
Tian, K., Dong, Y., Cowperthwaite, D.: A full GPU virtualization solution with mediated pass-through. In: USENIX ATC (2014)
Wang, Y., Davidson, A., Pan, Y., Wu, Y., Riffel, A., Owens, J.D.: Gunrock: a high-performance graph processing library on the GPU. In: PPoPP (2015)
Wang, Z., Yang, J., Melhem, R., Childers, B., Zhang, Y., Guo, M.: Quality of service support for fine-grained sharing on GPUs. In: ISCA (2017)
Wu, B., Chen, G., Li, D., Shen, X., Vetter, J.S.: Enabling and exploiting flexible task assignment on GPU through SM-centric program transformations. In: ICS (2015)
Bo, W., Liu, X., Zhou, X., Jiang, C.: FLEP: enabling flexible and efficient preemption on GPUs. In: ASPLOS (2017)
Zhang, W., et al.: Laius: towards latency awareness and improved utilization of spatial multitasking accelerators in datacenters. In: ICS (2019)
Zhang, Y., Laurenzano, M.A., Mars, J., Tang, L.: SMiTe: precise QoS prediction on real-system SMT processors to improve utilization in warehouse scale computers. In: MICRO (2014)
Zhu, H., Erez, M.: Dirigent: enforcing QoS for latency-critical tasks on shared multicore systems. In: ASPLOS (2016)
Acknowledgement
We would like to thank Akihiro Hayashi (our shepherd) and the anonymous reviewers for their constructive comments. This project was supported in part by NSF grant CCF-1823005 and an NSF CAREER Award (CNS-1750760).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Han, W., Mawhirter, D., Wu, B., Ma, L., Tian, C. (2021). FLARE: Flexibly Sharing Commodity GPUs to Enforce QoS and Improve Utilization. In: Pande, S., Sarkar, V. (eds) Languages and Compilers for Parallel Computing. LCPC 2019. Lecture Notes in Computer Science(), vol 11998. Springer, Cham. https://doi.org/10.1007/978-3-030-72789-5_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-72789-5_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-72788-8
Online ISBN: 978-3-030-72789-5
eBook Packages: Computer ScienceComputer Science (R0)