skip to main content
10.1145/3205289.3205311acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Classification-Driven Search for Effective SM Partitioning in Multitasking GPUs

Published: 12 June 2018 Publication History

Abstract

Graphics processing units (GPUs) feature an increasing number of streaming multiprocessors (SMs) with each successive generation. At the same time, GPUs are increasingly widely adopted in cloud services and data centers to accelerate general-purpose workloads. Running multiple applications on a GPU in such environments requires effective multitasking support. Spatial multitasking in which independent applications co-execute on different sets of SMs is a promising solution to share GPU resources. Unfortunately, how to effectively partition SMs is an open problem.
In this paper, we observe that compared to widely-used even partitioning, dynamic SM partitioning based on the characteristics of the co-executing applications can significantly improve performance and power efficiency. Unfortunately finding an effective SM partition is challenging because the number of possible combinations increases exponentially with the number of SMs and co-executing applications. Through offline analysis, we find that first classifying workloads, and then searching an effective SM partition based on the workload characteristics can significantly reduce the search space, making dynamic SM partitioning tractable.
Based on these insights, we propose Classification-Driven search (CD-search) for low-overhead dynamic SM partitioning in multitasking GPUs. CD-search first classifies workloads using a novel off-SM bandwidth model, after which it enters the performance mode or power mode depending on the workload's characteristics. Both modes follow a specific search strategy to quickly determine the optimum SM partition. Our evaluation shows that CD-search improves system throughput by 10.4% on average (and up to 62.9%) over even partitioning for workloads that are classified for the performance mode. For workloads classified for the power mode, CD-search reduces power consumption by 25% on average (and up to 41.2%). CD-search incurs limited runtime overhead.

References

[1]
Q. Chen, H. Yang, J. Mars, and L. Tang, "Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers," in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 681--696, April 2016.
[2]
V. T. Ravi, M. Becchi, G. Agrawal, and S. Chakradhar, "Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework," in Proceedings of the International Symposium on High Performance Distributed Computing (HPDC), pp. 217--228, June 2011.
[3]
C. Margiolas and M. F. P. O'Boyle, "Portable and Transparent Software Managed Scheduling on Accelerators for Fair Resource Sharing," in Proceedings of the International Symposium on Code Generation and Optimization (CGO), pp. 82--93, March 2016.
[4]
Y. Suzuki, S. Kato, H. Yamada, and K. Kono, "GPUvm: Why Not Virtualizing GPUs at the Hypervisor?," in Proceedings of the USENIX Annual Technical Conference (ATC), pp. 109--120, June 2014.
[5]
Amazon, "Amazon web services." https://aws.amazon.com/cn/ec2/.
[6]
J. T. Adriaens, K. Compton, N. S. Kim, and M.J. Schulte, "The Case for GPGPU Spatial Multitasking," in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), pp. 1--12, February 2012.
[7]
I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero, "Enabling Preemptive Multiprogramming on GPUs," in Proceeding of the International Symposium on Computer Architecture (ISCA), pp. 193--204, June 2014.
[8]
M. Awatramani, J. Zambreno, and D. Rover, "Increasing GPU Throughput using Kernel Interleaved Thread Block Scheduling," in Proceedings of the International Conference on Computer Design (ICCD), pp. 503--506, October 2013.
[9]
Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo, "Simultaneous Multikernel GPU: Multitasking Throughput Processors via Fine-Grained Sharing," in Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp. 358--369, March 2016.
[10]
Q. Xu, H. Jeon, K. Kim, W. W. Ro, and M. Annavaram, "Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming," in Proceedings of the International Symposium on Computer Architecture (ISCA), pp. 230--242, June 2016.
[11]
"NVIDIA Tesla V100 Volta Architecture."
[12]
A. Jog, O. Kayiran, T. Kesten, A. Pattnaik, E. Bolotin, N. Chatterjee, S. W. Keckler, M. T. Kandemir, and C. R. Das, "Anatomy of GPU Memory System for Multi-Application Execution," in Proceedings of the International Symposium on Memory Systems (MEMSYS), pp. 223--234, October 2015.
[13]
J. J. K. Park, Y. Park, and S. Mahlke, "Dynamic Resource Management for Efficient Utilization of Multitasking GPUs," in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 593--606, April 2017.
[14]
R. Ausavarungnirun, J. Landgraf, V. Miller, S. Ghose, J. Gandhi, C. J. Rossbach, and O. Mutlu, "Mosaic: A GPU Memory Manager with Application-transparent Support for Multiple Page Sizes," in Proceedings of the International Symposium on Microarchitecture (MICRO), pp. 136--150, October 2017.
[15]
H. Wang, F. Luo, M. Ibrahim, O. Kayiran, and A. Jog, "Efficient and Fair Multiprogramming in GPUs via Effective Bandwidth Management," in Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), March 2018.
[16]
A. Jadidi, M. Arjomand, M. T. Kandemir, and C. R. Das, "Optimizing Energy Consumption in GPUS Through Feedback-driven CTA Scheduling," in Proceedings of the High Performance Computing Symposium (HPC), pp. 12:1--12:12, April 2017.
[17]
A. Jadidi, "Kernel-Based Energy Optimization In GPUs," Master's thesis, The Pennsylvania State University, December 2015.
[18]
N. Vijaykumar, K. Hsieh, G. Pekhimenko, S. Khan, A. Shrestha, S. Ghose, A. Jog, P. B. Gibbons, and O. Mutlu, "Zorua: A Holistic Approach to Resource Virtualization in GPUs," in Proceedings of the International Symposium on Microarchitecture (MICRO), pp. 1--14, October 2016.
[19]
H. Jeon, G. S. Ravi, N. S. Kim, and M. Annavaram, "GPU Register File Visualization," in Proceedings of the International Symposium on Microarchitecture (MICRO), pp. 420--432, December 2015.
[20]
J. Lee and H. Kim, "TAP: A TLP-Aware Cache Management Policy For a CPU-GPU Heterogeneous Architecture," in Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp. 1--12, February 2012.
[21]
J. J. K. Park, Y. Park, and S. Mahlke, "Chimera: Collaborative Preemption for Multitasking on a Shared GPU," in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 593--606, March 2015.
[22]
Nvidia, "NVIDIA TESLA P100 GPU ACCELERATOR." https://images.nvidia.com/content/tesla/pdf/nvidia-tesla-p100-PCIe-datasheet.pdf, 2016.
[23]
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in Proceeding of the International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 163--174, April 2009.
[24]
J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "GPUWattch: Enabling Energy Optimizations in GPGPUs," in Proceedings of the International Symposium on Computer Architecture (ISCA), pp. 487--498, June 2013.
[25]
"NVIDIA CUDA SDK Code Samples." https://developer.nvidia.com/cuda-downloads.
[26]
J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu, "Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing," tech. rep., March 2012.
[27]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," in Proceedings of the International Symposium on Workload Characterization (IISWC), pp. 44--54, October 2009.
[28]
S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, "Autotuning a High-Level Language Targeted to GPU Codes," in Proceedings of Innovative Parallel Computing(InPar), pp. 1--10, May 2012.
[29]
B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang, "Mars: A MapReduce Framework on Graphics Processors," in Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 260--269, October 2008.
[30]
S. Eyerman and L. Eeckhout, "System-Level Performance Metrics for Multipro-gram Workloads," IEEE Micro, vol. 28, no. 3, pp. 42--53, 2008.
[31]
H. Dai, Z. Lin, C. Li, C. Zhao, F. Wang, N. Zheng, and H. Zhou, "POSTER:Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls," in Proceedings of the International Conference on Parallel Architectures and Compilation (PACT), pp. 144--145, September 2017.
[32]
H. Dai, Z. Lin, C. Li, C. Zhao, F. Wang, N. Zheng, and H. Zhou, "Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls," in Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), March 2018.
[33]
P. Aguilera, K. Morrow, and N. S. Kim, "Fair Share: Allocation of GPU Resources for Both Performance and Fairness," in Proceedings of the International Conference on Computer Design (ICCD), pp. 440--447, October 2014.
[34]
X. Li and Y. Liang, "Efficient Kernel Management on GPUs," in Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE), pp. 115:1--115:24, March 2016.

Cited By

View all
  • (2024)An Analysis of Collocation on GPUs for Deep Learning TrainingProceedings of the 4th Workshop on Machine Learning and Systems10.1145/3642970.3655827(81-90)Online publication date: 22-Apr-2024
  • (2024)ElasticRoom: Multi-Tenant DNN Inference Engine via Co-design with Resource-constrained Compilation and Strong Priority SchedulingProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658654(1-14)Online publication date: 3-Jun-2024
  • (2024)D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUsIEEE Transactions on Cloud Computing10.1109/TCC.2024.347621012:4(1344-1358)Online publication date: Oct-2024
  • Show More Cited By

Index Terms

  1. Classification-Driven Search for Effective SM Partitioning in Multitasking GPUs

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '18: Proceedings of the 2018 International Conference on Supercomputing
    June 2018
    407 pages
    ISBN:9781450357838
    DOI:10.1145/3205289
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 June 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPU
    2. SM partitioning
    3. multitasking

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    ICS '18
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)32
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 17 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)An Analysis of Collocation on GPUs for Deep Learning TrainingProceedings of the 4th Workshop on Machine Learning and Systems10.1145/3642970.3655827(81-90)Online publication date: 22-Apr-2024
    • (2024)ElasticRoom: Multi-Tenant DNN Inference Engine via Co-design with Resource-constrained Compilation and Strong Priority SchedulingProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658654(1-14)Online publication date: 3-Jun-2024
    • (2024)D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUsIEEE Transactions on Cloud Computing10.1109/TCC.2024.347621012:4(1344-1358)Online publication date: Oct-2024
    • (2023)ISPA: Exploiting Intra-SM Parallelism in GPUs via Fine-Grained Resource ManagementIEEE Transactions on Computers10.1109/TC.2022.321408872:5(1473-1487)Online publication date: 1-May-2023
    • (2023)KRISP: Enabling Kernel-wise RIght-sizing for Spatial Partitioned GPU Inference Servers2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071121(624-637)Online publication date: Feb-2023
    • (2022)CoGNNProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571936(1-15)Online publication date: 13-Nov-2022
    • (2022)GPUPoolProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569650(317-332)Online publication date: 8-Oct-2022
    • (2022)Online Optimization with Feedback Delay and Nonlinear Switching CostProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35080376:1(1-34)Online publication date: 28-Feb-2022
    • (2022)NURAProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35080366:1(1-27)Online publication date: 28-Feb-2022
    • (2022)Metrics and Design of an Instruction Roofline Model for AMD GPUsACM Transactions on Parallel Computing10.1145/35052859:1(1-14)Online publication date: 31-Jan-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media