Skip to main content

FLARE: Flexibly Sharing Commodity GPUs to Enforce QoS and Improve Utilization

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11998))

Abstract

A modern GPU integrates tens of streaming multi-processors (SMs) on the chip. When used in data centers, the GPUs often suffer from under-utilization for exclusive access reservations, hence demanding multitasking (i.e., co-running applications) to reduce the total cost of ownership. However, latency-critical applications may experience too much interference to meet Quality-of-Service (QoS) targets. In this paper, we propose a software system, FLARE, to spatially share commodity GPUs between latency-critical applications and best-effort applications to enforce QoS as well as maximize overall throughput. By transforming the kernels of best-effort applications, FLARE enables both SM partitioning and thread block partitioning within an SM for co-running applications. It uses a microbenchmark guided static configuration search combined with online dynamic search to locate the optimal (near-optimal) strategy to partition resources. Evaluated on 11 benchmarks and 2 real-world applications, FLARE improves hardware utilization by an average of 1.39X compared to the preemption-based approach.

W. Han and D. Mawhirter—Equal contribution.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Text classification in TensorFlow (2017). https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/learn#text-classification

  2. Adriaens, J., Compton, K., Kim, N.S., Schulte, M.J.: The case for GPGPU spatial multitasking. In: HPCA (2012)

    Google Scholar 

  3. Allen, T., Feng, X., Ge, R.: Slate: enabling workload-aware efficient multiprocessing for modern GPGPUs. In: IPDPS (2019)

    Google Scholar 

  4. Barroso, L.A., Clidaras, J., Hölzle, U.: The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, 2nd edn. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers (2013)

    Google Scholar 

  5. Che, S., et al.: Rodinia: a benchmark suite for heterogeneous computing. In: IISWC (2009)

    Google Scholar 

  6. Chen, G., Zhao, Y., Shen, X., Zhou, H.: EffiSha: a software framework for enabling efficient preemptive scheduling of GPU. In: Sarkar, V., Rauchwerger, L. (eds.) PPoPP, pp. 3–16. ACM (2017)

    Google Scholar 

  7. Chen, Q., Yang, H., Guo, M., Kannan, R.S., Mars, J., Tang, L.: Prophet: precise QoS prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers. In: ASPLOS (2017)

    Google Scholar 

  8. Chen, Q., Yang, H., Mars, J., Tang, L.: Baymax: QoS awareness and increased utilization of non-preemptive accelerators in warehouse scale computers. In: ASPLOS (2016)

    Google Scholar 

  9. Chung, J., Gülçehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555 (2014)

    Google Scholar 

  10. Danalis, A., et al.: The scalable heterogeneous computing (SHOC) benchmark suite. In: GPGPU (2010)

    Google Scholar 

  11. Han, W., Mawhirter, D., Buland, M., Wu, B.: Graphie: large-scale asynchronous graph traversals on just a GPU. In: PACT 2017 (2017)

    Google Scholar 

  12. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  13. Hong, C.-H., Spence, I.T.A., Nikolopoulos, D.S.: FairGV: fair and fast GPU virtualization. IEEE Trans. Parallel Distrib. Syst. 28(12), 3472–3485 (2017)

    Article  Google Scholar 

  14. Hong, C.-H., Spence, I.T.A., Nikolopoulos, D.S.: GPU virtualization and scheduling methods: a comprehensive survey. ACM Comput. Surv. 50(3), 35:1–35:37 (2017)

    Article  Google Scholar 

  15. Kehne, J., Metter, J., Bellosa, F.: GPUswap: enabling oversubscription of GPU memory through transparent swapping. In: VEE (2015)

    Google Scholar 

  16. Liang, Y., Huynh, H.P., Rupnow, K., Goh, R.S.M., Chen, D.: Efficient GPU spatial-temporal multitasking. IEEE Trans. Parallel Distrib. Syst. 26(3), 748–760 (2015)

    Article  Google Scholar 

  17. Liang, Y., Li, X., Xie, X.: Exploring cache bypassing and partitioning for multi-tasking on GPUs. In: ICCAD (2017)

    Google Scholar 

  18. Lo, D., Cheng, L., Govindaraju, R., Barroso, L.A., Kozyrakis, C.: Towards energy proportionality for large-scale latency-critical workloads. In: ISCA (2014)

    Google Scholar 

  19. Lo, D., Cheng, L., Govindaraju, R., Ranganathan, P., Kozyrakis, C.: Heracles: improving resource efficiency at scale. In: ISCA (2015)

    Google Scholar 

  20. Mars, J., Tang, L., Hundt, R., Skadron, K., Soffa, M.L.: Bubble-up: increasing utilization in modern warehouse scale computers via sensible co-locations. In: MICRO (2011)

    Google Scholar 

  21. Park, J.J.K., Park, Y., Mahlke, S.: Chimera: collaborative preemption for multitasking on a shared GPU. In: ASPLOS (2015)

    Google Scholar 

  22. Park, J.J.K., Park, Y., Mahlke, S.A.: Dynamic resource management for efficient utilization of multitasking GPUs. In: ASPLOS (2017)

    Google Scholar 

  23. Sutskever, I., Martens, J., Hinton, G.E.: Generating text with recurrent neural networks. In: ICML (2011)

    Google Scholar 

  24. Suzuki, Y., Kato, S., Yamada, H., Kono, K.: GPUvm: why not virtualizing GPUs at the hypervisor? In: USENIX ATC (2014)

    Google Scholar 

  25. Suzuki, Y., Yamada, H., Kato, S., Kono, K.: GLoop: an event-driven runtime for consolidating GPGPU applications. In: SoCC (2017)

    Google Scholar 

  26. Tanasic, I., Gelado, I., Cabezas, J., Ramirez, A., Navarro, N., Valero, M.: Enabling preemptive multiprogramming on GPUs. In: ISCA (2014)

    Google Scholar 

  27. Tian, K., Dong, Y., Cowperthwaite, D.: A full GPU virtualization solution with mediated pass-through. In: USENIX ATC (2014)

    Google Scholar 

  28. Wang, Y., Davidson, A., Pan, Y., Wu, Y., Riffel, A., Owens, J.D.: Gunrock: a high-performance graph processing library on the GPU. In: PPoPP (2015)

    Google Scholar 

  29. Wang, Z., Yang, J., Melhem, R., Childers, B., Zhang, Y., Guo, M.: Quality of service support for fine-grained sharing on GPUs. In: ISCA (2017)

    Google Scholar 

  30. Wu, B., Chen, G., Li, D., Shen, X., Vetter, J.S.: Enabling and exploiting flexible task assignment on GPU through SM-centric program transformations. In: ICS (2015)

    Google Scholar 

  31. Bo, W., Liu, X., Zhou, X., Jiang, C.: FLEP: enabling flexible and efficient preemption on GPUs. In: ASPLOS (2017)

    Google Scholar 

  32. Zhang, W., et al.: Laius: towards latency awareness and improved utilization of spatial multitasking accelerators in datacenters. In: ICS (2019)

    Google Scholar 

  33. Zhang, Y., Laurenzano, M.A., Mars, J., Tang, L.: SMiTe: precise QoS prediction on real-system SMT processors to improve utilization in warehouse scale computers. In: MICRO (2014)

    Google Scholar 

  34. Zhu, H., Erez, M.: Dirigent: enforcing QoS for latency-critical tasks on shared multicore systems. In: ASPLOS (2016)

    Google Scholar 

Download references

Acknowledgement

We would like to thank Akihiro Hayashi (our shepherd) and the anonymous reviewers for their constructive comments. This project was supported in part by NSF grant CCF-1823005 and an NSF CAREER Award (CNS-1750760).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Han .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Han, W., Mawhirter, D., Wu, B., Ma, L., Tian, C. (2021). FLARE: Flexibly Sharing Commodity GPUs to Enforce QoS and Improve Utilization. In: Pande, S., Sarkar, V. (eds) Languages and Compilers for Parallel Computing. LCPC 2019. Lecture Notes in Computer Science(), vol 11998. Springer, Cham. https://doi.org/10.1007/978-3-030-72789-5_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-72789-5_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-72788-8

  • Online ISBN: 978-3-030-72789-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics