skip to main content
research-article

Efficient Kernel Management on GPUs

Published: 26 May 2017 Publication History

Abstract

Graphics Processing Units (GPUs) have been widely adopted as accelerators for compute-intensive applications due to its tremendous computational power and high memory bandwidth. As the complexity of applications continues to grow, each new generation of GPUs has been equipped with advanced architectural features and more resources to sustain its performance acceleration capability. Recent GPUs have been featured with concurrent kernel execution, which is designed to improve the resource utilization by executing multiple kernels simultaneously. However, it is still a challenge to find a way to manage the resources on GPUs for concurrent kernel execution. Prior works only achieve limited performance improvement as they do not optimize the thread-level parallelism (TLP) and model the resource contention for the concurrently executing kernels.
In this article, we design an efficient kernel management framework that optimizes the performance for concurrent kernel execution on GPUs. Our kernel management framework contains two key components: TLP modulation and cache bypassing. The TLP modulation is employed to adjust the TLP for the concurrently executing kernels. It consists of three parts: kernel categorization, static TLP modulation, and dynamic TLP modulation. The cache bypassing is proposed to mitigate the cache contention by only allowing a subset of a kernel’s blocks to access the L1 data cache. Experiments indicate that our framework can improve the performance by 1.51 × on average (energy-efficiency by 1.39 × on average), compared with the default concurrent kernel execution framework.

References

[1]
Jacob T. Adriaens, Katherine Compton, Nam Sung Kim, and Michael J. Schulte. 2012. The case for GPGPU spatial multitasking. In Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture (HPCA’12).
[2]
Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’09).
[3]
Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A quantitative study of irregular programs on GPUs. In Proceedings of the 2012 IEEE International Symposium on Workload Characterization (IISWC’12).
[4]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC’09).
[5]
Guoyang Chen, Yue Zhao, Xipeng Shen, and Huiyang Zhou. 2017. EffiSha: A software framework for enabling effficient preemptive scheduling of GPU. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’17).
[6]
Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. 2014. Adaptive cache management for energy-efficient GPU computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47).
[7]
Zheng Cui, Yun Liang, K. Rupnow, and Deming Chen. 2012. An accurate GPU performance model for effective control flow divergence optimization. In Proceedings of the 2012 IEEE 26th International Parallel Distributed Processing Symposium (IPDPS’12).
[8]
Wilson W. L. Fung and Tor M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA’11).
[9]
Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-40).
[10]
Mark Gebhart, Stephen W. Keckler, Brucek Khailany, Ronny Krashinsky, and William J. Dally. 2012. Unifying primary cache, scratch, and register file memories in a throughput processor. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45).
[11]
Chris Gregg, Jonathan Dorn, Kim Hazelwood, and Kevin Skadron. 2012. Fine-grained resource sharing for concurrent GPGPU kernels. In Proceedings of the 4th USENIX Conference on Hot Topics in Parallelism (HotPar’12).
[12]
Ari B. Hayes and Eddy Z. Zhang. 2014. Unified on-chip memory allocation for SIMT architecture. In Proceedings of the 28th ACM International Conference on Supercomputing (ICS’14).
[13]
James A. Jablin, Thomas B. Jablin, Onur Mutlu, and Maurice Herlihy. 2014. Warp-aware trace scheduling for GPUs. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14).
[14]
Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2012. Characterizing and improving the use of demand-fetched caches in GPUs. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12).
[15]
Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013a. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13).
[16]
Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013b. Orchestrated scheduling and prefetching for GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13).
[17]
Onur Kayiran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13).
[18]
Jaekyu Lee, N. B. Lakshminarayana, Hyesoon Kim, and R. Vuduc. 2010. Many-thread aware prefetching mechanisms for GPGPU applications. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-43).
[19]
M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu. 2014. Improving GPGPU resource utilization through alternative thread block scheduling. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14).
[20]
Shin-Ying Lee and Carole-Jean Wu. 2014. CAWS: Criticality-aware warp scheduling for GPGPU workloads. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14).
[21]
Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13).
[22]
Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-driven dynamic GPU cache bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15).
[23]
Chao Li, Yi Yang, Zhen Lin, and Huiyang Zhou. 2015. Automatic data placement into GPU on-chip memory resources. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’15).
[24]
Xiuhong Li and Yun Liang. 2016. Efficient kernel management on GPUs. In Proceedings of Design, Automation and Test in Europe (DATE’16).
[25]
Yun Liang, H. P. Huynh, K. Rupnow, R. S. M. Goh, and Deming Chen. 2015a. Efficient GPU spatial-temporal multitasking. IEEE Trans. Parallel Distrib. Syst. 26, 3 (Mar. 2015), 748--760.
[26]
Yun Liang, Xiaolong Xie, Guangyu Sun, and Chen Deming. 2015b. An efficient framework for cache bypassing on GPUs. IEEE Trans. Comput.-Aid. Des. 32, 10 (October 2015), 1677--1690.
[27]
Zhen Lin, Lars Nyland, and Huiyang Zhou. 2016. Enabling efficient preemption for SIMT architectures with lightweight context switching. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’16).
[28]
Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44).
[29]
Sreepathi Pai, Matthew J. Thazhuthaveetil, and R. Govindarajan. 2013. Improving GPGPU concurrency with elastic kernels. SIGPLAN Not. 48, 4 (Mar. 2013), 407--418.
[30]
Timothy G. Rogers, Daniel R. Johnson, Mike O’Connor, and Stephen W. Keckler. 2015. A variable warp size architecture. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA’15).
[31]
Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’12).
[32]
Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2013. Divergence-aware warp scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46).
[33]
John A. Stratton, Christopher Rodrigrues, I-Jui Sung, Nady Obeid, Liwen Chang, Geng Liu, and Wen-Mei W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report. University of Illinois at Urbana-Champaign.
[34]
Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramirez, Nacho Navarro, and Mateo Valero. 2014. Enabling preemptive multiprogramming on GPUs. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA’14).
[35]
Bo Wu, Guoyang Chen, Dong Li, Xipeng Shen, and Jeffrey Vetter. 2015. Enabling and exploiting flexible task assignment on GPU through SM-centric program transformations. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15).
[36]
Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015a. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In Proceedings of the 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-48).
[37]
Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. 2013. An efficient compiler framework for cache bypassing on GPUs. In Proceedings of the International Conference on Computer-Aided Design (ICCAD’13).
[38]
Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, and Tao Wang. 2015b. Coordinated static and dynamic cache bypassing for GPUs. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15).
[39]
Hang Zhang, Xuhao Chen, Nong Xiao, and Fang Liu. 2016. Architecting energy-efficient STT-RAM based register file on GPGPUs via delta compression. In Proceedings of the 53rd Annual Design Automation Conference (DAC’16).

Cited By

View all
  • (2024)Exploration of Fine Management Mode of College Logistics Based on Digital Twin TechnologyApplied Mathematics and Nonlinear Sciences10.2478/amns-2024-13009:1Online publication date: 3-Jun-2024
  • (2023)Detecting Atomicity Violations in Interrupt-Driven Programs via Interruption Points Selecting and Delayed ISR-TriggeringProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616276(1153-1164)Online publication date: 30-Nov-2023
  • (2022)A Survey of GPU Multitasking Methods Supported by Hardware ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.311563033:6(1451-1463)Online publication date: 1-Jun-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 16, Issue 4
Special Issue on Secure and Fault-Tolerant Embedded Computing and Regular Papers
November 2017
614 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3092956
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 26 May 2017
Accepted: 01 March 2017
Revised: 01 January 2017
Received: 01 September 2016
Published in TECS Volume 16, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. General purpose graphics processing unit (GPGPU)
  2. energy-efficiency
  3. kernel management

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • National Science Foundation China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Exploration of Fine Management Mode of College Logistics Based on Digital Twin TechnologyApplied Mathematics and Nonlinear Sciences10.2478/amns-2024-13009:1Online publication date: 3-Jun-2024
  • (2023)Detecting Atomicity Violations in Interrupt-Driven Programs via Interruption Points Selecting and Delayed ISR-TriggeringProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616276(1153-1164)Online publication date: 30-Nov-2023
  • (2022)A Survey of GPU Multitasking Methods Supported by Hardware ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.311563033:6(1451-1463)Online publication date: 1-Jun-2022
  • (2020)Fair and cache blocking aware warp scheduling for concurrent kernel execution on GPUFuture Generation Computer Systems10.1016/j.future.2020.05.023Online publication date: May-2020
  • (2019)Cache-Aware Kernel Tiling: An Approach for System-Level Performance Optimization of GPU-Based Applications2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE.2019.8714861(570-575)Online publication date: Mar-2019
  • (2019)Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel ExecutionACM Transactions on Architecture and Code Optimization10.1145/332612416:3(1-27)Online publication date: 17-Jun-2019
  • (2019)A coordinated tiling and batching framework for efficient GEMM on GPUsProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3295734(229-241)Online publication date: 16-Feb-2019
  • (2019)RDMKE: Applying Reuse Distance Analysis to Multiple GPU Kernel ExecutionsJournal of Circuits, Systems and Computers10.1142/S021812661950245128:14(1950245)Online publication date: 15-Mar-2019
  • (2017)SKERD: Reuse distance analysis for simultaneous multiple GPU kernel executions2017 19th International Symposium on Computer Architecture and Digital Systems (CADS)10.1109/CADS.2017.8310677(1-6)Online publication date: Dec-2017

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media