research-article

Efficient Kernel Management on GPUs

Authors:

Xiuhong LiAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 16, Issue 4

Article No.: 115, Pages 1 - 24

https://doi.org/10.1145/3070710

Published: 26 May 2017 Publication History

Abstract

Graphics Processing Units (GPUs) have been widely adopted as accelerators for compute-intensive applications due to its tremendous computational power and high memory bandwidth. As the complexity of applications continues to grow, each new generation of GPUs has been equipped with advanced architectural features and more resources to sustain its performance acceleration capability. Recent GPUs have been featured with concurrent kernel execution, which is designed to improve the resource utilization by executing multiple kernels simultaneously. However, it is still a challenge to find a way to manage the resources on GPUs for concurrent kernel execution. Prior works only achieve limited performance improvement as they do not optimize the thread-level parallelism (TLP) and model the resource contention for the concurrently executing kernels.

In this article, we design an efficient kernel management framework that optimizes the performance for concurrent kernel execution on GPUs. Our kernel management framework contains two key components: TLP modulation and cache bypassing. The TLP modulation is employed to adjust the TLP for the concurrently executing kernels. It consists of three parts: kernel categorization, static TLP modulation, and dynamic TLP modulation. The cache bypassing is proposed to mitigate the cache contention by only allowing a subset of a kernel’s blocks to access the L1 data cache. Experiments indicate that our framework can improve the performance by 1.51 × on average (energy-efficiency by 1.39 × on average), compared with the default concurrent kernel execution framework.

References

[1]

Jacob T. Adriaens, Katherine Compton, Nam Sung Kim, and Michael J. Schulte. 2012. The case for GPGPU spatial multitasking. In Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture (HPCA’12).

Digital Library

[2]

Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’09).

[3]

Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A quantitative study of irregular programs on GPUs. In Proceedings of the 2012 IEEE International Symposium on Workload Characterization (IISWC’12).

Digital Library

[4]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC’09).

Digital Library

[5]

Guoyang Chen, Yue Zhao, Xipeng Shen, and Huiyang Zhou. 2017. EffiSha: A software framework for enabling effficient preemptive scheduling of GPU. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’17).

Digital Library

[6]

Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. 2014. Adaptive cache management for energy-efficient GPU computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47).

Digital Library

[7]

Zheng Cui, Yun Liang, K. Rupnow, and Deming Chen. 2012. An accurate GPU performance model for effective control flow divergence optimization. In Proceedings of the 2012 IEEE 26th International Parallel Distributed Processing Symposium (IPDPS’12).

Digital Library

[8]

Wilson W. L. Fung and Tor M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA’11).

[9]

Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-40).

Digital Library

[10]

Mark Gebhart, Stephen W. Keckler, Brucek Khailany, Ronny Krashinsky, and William J. Dally. 2012. Unifying primary cache, scratch, and register file memories in a throughput processor. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45).

Digital Library

[11]

Chris Gregg, Jonathan Dorn, Kim Hazelwood, and Kevin Skadron. 2012. Fine-grained resource sharing for concurrent GPGPU kernels. In Proceedings of the 4th USENIX Conference on Hot Topics in Parallelism (HotPar’12).

Digital Library

[12]

Ari B. Hayes and Eddy Z. Zhang. 2014. Unified on-chip memory allocation for SIMT architecture. In Proceedings of the 28th ACM International Conference on Supercomputing (ICS’14).

Digital Library

[13]

James A. Jablin, Thomas B. Jablin, Onur Mutlu, and Maurice Herlihy. 2014. Warp-aware trace scheduling for GPUs. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14).

Digital Library

[14]

Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2012. Characterizing and improving the use of demand-fetched caches in GPUs. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12).

Digital Library

[15]

Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013a. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13).

Digital Library

[16]

Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013b. Orchestrated scheduling and prefetching for GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13).

Digital Library

[17]

Onur Kayiran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13).

Digital Library

[18]

Jaekyu Lee, N. B. Lakshminarayana, Hyesoon Kim, and R. Vuduc. 2010. Many-thread aware prefetching mechanisms for GPGPU applications. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-43).

[19]

M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu. 2014. Improving GPGPU resource utilization through alternative thread block scheduling. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14).

[20]

Shin-Ying Lee and Carole-Jean Wu. 2014. CAWS: Criticality-aware warp scheduling for GPGPU workloads. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14).

Digital Library

[21]

Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13).

Digital Library

[22]

Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-driven dynamic GPU cache bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15).

Digital Library

[23]

Chao Li, Yi Yang, Zhen Lin, and Huiyang Zhou. 2015. Automatic data placement into GPU on-chip memory resources. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’15).

[24]

Xiuhong Li and Yun Liang. 2016. Efficient kernel management on GPUs. In Proceedings of Design, Automation and Test in Europe (DATE’16).

[25]

Yun Liang, H. P. Huynh, K. Rupnow, R. S. M. Goh, and Deming Chen. 2015a. Efficient GPU spatial-temporal multitasking. IEEE Trans. Parallel Distrib. Syst. 26, 3 (Mar. 2015), 748--760.

[26]

Yun Liang, Xiaolong Xie, Guangyu Sun, and Chen Deming. 2015b. An efficient framework for cache bypassing on GPUs. IEEE Trans. Comput.-Aid. Des. 32, 10 (October 2015), 1677--1690.

[27]

Zhen Lin, Lars Nyland, and Huiyang Zhou. 2016. Enabling efficient preemption for SIMT architectures with lightweight context switching. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’16).

[28]

Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44).

Digital Library

[29]

Sreepathi Pai, Matthew J. Thazhuthaveetil, and R. Govindarajan. 2013. Improving GPGPU concurrency with elastic kernels. SIGPLAN Not. 48, 4 (Mar. 2013), 407--418.

Digital Library

[30]

Timothy G. Rogers, Daniel R. Johnson, Mike O’Connor, and Stephen W. Keckler. 2015. A variable warp size architecture. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA’15).

Digital Library

[31]

Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’12).

Digital Library

[32]

Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2013. Divergence-aware warp scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46).

Digital Library

[33]

John A. Stratton, Christopher Rodrigrues, I-Jui Sung, Nady Obeid, Liwen Chang, Geng Liu, and Wen-Mei W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report. University of Illinois at Urbana-Champaign.

[34]

Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramirez, Nacho Navarro, and Mateo Valero. 2014. Enabling preemptive multiprogramming on GPUs. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA’14).

[35]

Bo Wu, Guoyang Chen, Dong Li, Xipeng Shen, and Jeffrey Vetter. 2015. Enabling and exploiting flexible task assignment on GPU through SM-centric program transformations. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15).

Digital Library

[36]

Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015a. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In Proceedings of the 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-48).

Digital Library

[37]

Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. 2013. An efficient compiler framework for cache bypassing on GPUs. In Proceedings of the International Conference on Computer-Aided Design (ICCAD’13).

[38]

Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, and Tao Wang. 2015b. Coordinated static and dynamic cache bypassing for GPUs. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15).

[39]

Hang Zhang, Xuhao Chen, Nong Xiao, and Fang Liu. 2016. Architecting energy-efficient STT-RAM based register file on GPGPUs via delta compression. In Proceedings of the 53rd Annual Design Automation Conference (DAC’16).

Digital Library

Cited By

Yu X(2024)Exploration of Fine Management Mode of College Logistics Based on Digital Twin TechnologyApplied Mathematics and Nonlinear Sciences10.2478/amns-2024-13009:1Online publication date: 3-Jun-2024
https://doi.org/10.2478/amns-2024-1300
Yu BTian CXing HYang ZSu JLu XYang JZhao LLi XDuan ZChandra SBlincoe KTonella P(2023)Detecting Atomicity Violations in Interrupt-Driven Programs via Interruption Points Selecting and Delayed ISR-TriggeringProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616276(1153-1164)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3616276
Zhao CGao WNie FZhou H(2022)A Survey of GPU Multitasking Methods Supported by Hardware ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.311563033:6(1451-1463)Online publication date: 1-Jun-2022
https://doi.org/10.1109/TPDS.2021.3115630
Show More Cited By

Index Terms

Efficient Kernel Management on GPUs
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
      2. Single instruction, multiple data

Recommendations

A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs
FPGA '14: Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Sparse Matrix-Vector Multiplication (SpMxV) is a widely used mathematical operation in many high-performance scientific and engineering applications. In recent years, tuned software libraries for multi-core microprocessors (CPUs) and graphics processing ...
Efficient kernel management on GPUs
DATE '16: Proceedings of the 2016 Conference on Design, Automation & Test in Europe

As the complexity of applications continues to grow, each new generation of GPUs has been equipped with advanced architectural features and more resources to sustain its performance acceleration capability. Recent GPUs have been featured with concurrent ...
Efficient Convex Optimization on GPUs for Embedded Model Predictive Control
GPGPU-10: Proceedings of the General Purpose GPUs

GPU applications have traditionally run on PCs or in larger scale systems. With the introduction of the Tegra line of mobile processors, NVIDIA expanded the types of systems that can exploit the massive parallelism offered by GPU computing ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 16, Issue 4

Special Issue on Secure and Fault-Tolerant Embedded Computing and Regular Papers

November 2017

614 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/3092956

Editor:
Sandeep K. Shukla
Indian Institute of Technology, India

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 26 May 2017

Accepted: 01 March 2017

Revised: 01 January 2017

Received: 01 September 2016

Published in TECS Volume 16, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Foundation China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
249
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yu X(2024)Exploration of Fine Management Mode of College Logistics Based on Digital Twin TechnologyApplied Mathematics and Nonlinear Sciences10.2478/amns-2024-13009:1Online publication date: 3-Jun-2024
https://doi.org/10.2478/amns-2024-1300
Yu BTian CXing HYang ZSu JLu XYang JZhao LLi XDuan ZChandra SBlincoe KTonella P(2023)Detecting Atomicity Violations in Interrupt-Driven Programs via Interruption Points Selecting and Delayed ISR-TriggeringProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616276(1153-1164)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3616276
Zhao CGao WNie FZhou H(2022)A Survey of GPU Multitasking Methods Supported by Hardware ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.311563033:6(1451-1463)Online publication date: 1-Jun-2022
https://doi.org/10.1109/TPDS.2021.3115630
Zhao CGao WNie FWang FZhou H(2020)Fair and cache blocking aware warp scheduling for concurrent kernel execution on GPUFuture Generation Computer Systems10.1016/j.future.2020.05.023Online publication date: May-2020
https://doi.org/10.1016/j.future.2020.05.023
Maghazeh AChattopadhyay SEles PPeng Z(2019)Cache-Aware Kernel Tiling: An Approach for System-Level Performance Optimization of GPU-Based Applications2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE.2019.8714861(570-575)Online publication date: Mar-2019
https://doi.org/10.23919/DATE.2019.8714861
Lin ZDai HMantor MZhou H(2019)Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel ExecutionACM Transactions on Architecture and Code Optimization10.1145/332612416:3(1-27)Online publication date: 17-Jun-2019
https://dl.acm.org/doi/10.1145/3326124
Li XLiang YYan SJia LLi YHollingsworth JKeidar I(2019)A coordinated tiling and batching framework for efficient GEMM on GPUsProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3295734(229-241)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3293883.3295734
Kiani MRajabzadeh A(2019)RDMKE: Applying Reuse Distance Analysis to Multiple GPU Kernel ExecutionsJournal of Circuits, Systems and Computers10.1142/S021812661950245128:14(1950245)Online publication date: 15-Mar-2019
https://doi.org/10.1142/S0218126619502451
Kiani MRajabzadeh A(2017)SKERD: Reuse distance analysis for simultaneous multiple GPU kernel executions2017 19th International Symposium on Computer Architecture and Digital Systems (CADS)10.1109/CADS.2017.8310677(1-6)Online publication date: Dec-2017
https://doi.org/10.1109/CADS.2017.8310677

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents