research-article

A compiler-based approach for GPGPU performance calibration using TLP modulation (WIP paper)

Authors:

LCTES 2019: Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems

Pages 193 - 197

https://doi.org/10.1145/3316482.3326343

Published: 23 June 2019 Publication History

Get Access

Abstract

Modern GPUs are the most successful accelerators as they provide outstanding performance gain by using CUDA or OpenCL programming models. For maximum performance, programmers typically try to maximize the number of thread blocks of target programs, and GPUs also generally attempt to allocate the maximum number of thread blocks to their GPU cores. However, many recent studies have pointed out that simply allocating the maximum number of thread blocks to GPU cores does not always guarantee the best performance, and identifying proper number of thread blocks per GPU core is a major challenge. Despite these studies, most existing architectural techniques cannot be directly applied to current GPU hardware, and the optimal number of thread blocks can vary significantly depending on the target GPU and application characteristics. To solve these problems, this study proposes a just-in-time thread block number adjustment system using CUDA binary modification upon an LLVM compiler framework, referred to as the CTA-Limiter, in order to dynamically maximize GPU performance on real GPUs without reprogramming. The framework gradually reduces the number of concurrent thread blocks of target CUDA workloads using extra shared memory allocation, and compares the execution time with the previous version to automatically identify the optimal number of co-running thread blocks per GPU Core. The results showed meaningful performance improvements, averaging at 30%, 40%, and 44%, in GTX 960, GTX 1050, and GTX 1080 Ti, respectively.

References

[1]

Clang. a C language family frontend for LLVM, 2007. http://clang.llvm.org.

Google Scholar

[2]

A. Bakhoda et al. Analyzing cuda workloads using a detailed gpu simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 163-174. IEEE, 2009.

Crossref

Google Scholar

[3]

S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In Proc. of the IEEE Symposium on Workload Characterization, pages 44-54, 2009.

Digital Library

Google Scholar

[4]

O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das. Neither more nor less: Optimizing thread-level parallelismfor gpgpus. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, PACT'13, pages 157-166, Piscataway, NJ, USA, 2013. IEEE Press.

Digital Library

Google Scholar

[5]

KHRONOS Group. OpenCL - the open standard for parallel programming of heterogeneous systems, 2010. http://www.khronos.org.

Google Scholar

[6]

C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Proc. of the 2004 International Symposium on Code Generation and Optimization, pages 75-86, 2004.

Digital Library

Google Scholar

[7]

M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu. Improving gpgpu resource utilization through alternative thread block scheduling. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pages 260-271. IEEE, 2014.

Crossref

Google Scholar

[8]

J. Nickolls et al. NVIDIA CUDA software and GPU parallel computing architecture. In Microprocessor Forum, May 2007.

Google Scholar

[9]

NVIDIA. CUDA C Programming Guide, May 2011.

Google Scholar

[10]

NVIDIA. Profiler User's guide, 2018. http://docs.nvidia.com/cuda/pdf/CUDA_profiler_Users_Guide.pdf.

Google Scholar

[11]

Polybench. the polyhedral benchmark suite, 2011. http://www.cse.ohio-state.edu/ pouchet/software/polybench.

Google Scholar

[12]

J. A. Stratton et al. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing, 127, 2012.

Google Scholar

Cited By

View all

Yu YSon DLee YPark SRyu GCho MSeo JPark Y(2023)Tailoring CUTLASS GEMM using Supervised Learning2023 IEEE 41st International Conference on Computer Design (ICCD)10.1109/ICCD58817.2023.00077(465-474)Online publication date: 6-Nov-2023
https://doi.org/10.1109/ICCD58817.2023.00077
Lee JKang SYu YJo YKim SPark Y(2020)Optimization of GPU-based Sparse Matrix Multiplication for Large Sparse Networks2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.00085(925-936)Online publication date: Apr-2020
https://doi.org/10.1109/ICDE48307.2020.00085

Index Terms

A compiler-based approach for GPGPU performance calibration using TLP modulation (WIP paper)
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

gpucc: an open-source GPGPU compiler
CGO '16: Proceedings of the 2016 International Symposium on Code Generation and Optimization

Graphics Processing Units have emerged as powerful accelerators for massively parallel, numerically intensive workloads. The two dominant software models for these devices are NVIDIA’s CUDA and the cross-platform OpenCL standard. Until now, there has ...
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights
- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
Abstract
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

GPGPUs have recently emerged as powerful vehicles for general-purpose high-performance computing. Although a new Compute Unified Device Architecture (CUDA) programming model from NVIDIA offers improved programmability for general computing, programming ...

Comments

Information & Contributors

Information

Published In

LCTES 2019: Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems

June 2019

218 pages

ISBN:9781450367240

DOI:10.1145/3316482

General Chair:
Jian-Jia Chen
TU Dortmund, Germany
,
Program Chair:
Aviral Shrivastava
Arizona State University, USA

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

LCTES '19

Sponsor:

LCTES '19: 20th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems

June 23, 2019

AZ, Phoenix, USA

Acceptance Rates

Overall Acceptance Rate 116 of 438 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
126
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Yu YSon DLee YPark SRyu GCho MSeo JPark Y(2023)Tailoring CUTLASS GEMM using Supervised Learning2023 IEEE 41st International Conference on Computer Design (ICCD)10.1109/ICCD58817.2023.00077(465-474)Online publication date: 6-Nov-2023
https://doi.org/10.1109/ICCD58817.2023.00077
Lee JKang SYu YJo YKim SPark Y(2020)Optimization of GPU-based Sparse Matrix Multiplication for Large Sparse Networks2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.00085(925-936)Online publication date: Apr-2020
https://doi.org/10.1109/ICDE48307.2020.00085

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

gpucc: an open-source GPGPU compiler

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations