skip to main content
10.1145/3316482.3326343acmconferencesArticle/Chapter ViewAbstractPublication PagescpsweekConference Proceedingsconference-collections
research-article

A compiler-based approach for GPGPU performance calibration using TLP modulation (WIP paper)

Published: 23 June 2019 Publication History

Abstract

Modern GPUs are the most successful accelerators as they provide outstanding performance gain by using CUDA or OpenCL programming models. For maximum performance, programmers typically try to maximize the number of thread blocks of target programs, and GPUs also generally attempt to allocate the maximum number of thread blocks to their GPU cores. However, many recent studies have pointed out that simply allocating the maximum number of thread blocks to GPU cores does not always guarantee the best performance, and identifying proper number of thread blocks per GPU core is a major challenge. Despite these studies, most existing architectural techniques cannot be directly applied to current GPU hardware, and the optimal number of thread blocks can vary significantly depending on the target GPU and application characteristics. To solve these problems, this study proposes a just-in-time thread block number adjustment system using CUDA binary modification upon an LLVM compiler framework, referred to as the CTA-Limiter, in order to dynamically maximize GPU performance on real GPUs without reprogramming. The framework gradually reduces the number of concurrent thread blocks of target CUDA workloads using extra shared memory allocation, and compares the execution time with the previous version to automatically identify the optimal number of co-running thread blocks per GPU Core. The results showed meaningful performance improvements, averaging at 30%, 40%, and 44%, in GTX 960, GTX 1050, and GTX 1080 Ti, respectively.

References

[1]
Clang. a C language family frontend for LLVM, 2007. http://clang.llvm.org.
[2]
A. Bakhoda et al. Analyzing cuda workloads using a detailed gpu simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 163-174. IEEE, 2009.
[3]
S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In Proc. of the IEEE Symposium on Workload Characterization, pages 44-54, 2009.
[4]
O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das. Neither more nor less: Optimizing thread-level parallelismfor gpgpus. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, PACT'13, pages 157-166, Piscataway, NJ, USA, 2013. IEEE Press.
[5]
KHRONOS Group. OpenCL - the open standard for parallel programming of heterogeneous systems, 2010. http://www.khronos.org.
[6]
C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Proc. of the 2004 International Symposium on Code Generation and Optimization, pages 75-86, 2004.
[7]
M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu. Improving gpgpu resource utilization through alternative thread block scheduling. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pages 260-271. IEEE, 2014.
[8]
J. Nickolls et al. NVIDIA CUDA software and GPU parallel computing architecture. In Microprocessor Forum, May 2007.
[9]
NVIDIA. CUDA C Programming Guide, May 2011.
[10]
NVIDIA. Profiler User's guide, 2018. http://docs.nvidia.com/cuda/pdf/CUDA_profiler_Users_Guide.pdf.
[11]
Polybench. the polyhedral benchmark suite, 2011. http://www.cse.ohio-state.edu/ pouchet/software/polybench.
[12]
J. A. Stratton et al. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing, 127, 2012.

Cited By

View all
  • (2023)Tailoring CUTLASS GEMM using Supervised Learning2023 IEEE 41st International Conference on Computer Design (ICCD)10.1109/ICCD58817.2023.00077(465-474)Online publication date: 6-Nov-2023
  • (2020)Optimization of GPU-based Sparse Matrix Multiplication for Large Sparse Networks2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.00085(925-936)Online publication date: Apr-2020

Index Terms

  1. A compiler-based approach for GPGPU performance calibration using TLP modulation (WIP paper)

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      LCTES 2019: Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems
      June 2019
      218 pages
      ISBN:9781450367240
      DOI:10.1145/3316482
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 23 June 2019

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. GPU
      2. LLVM
      3. Performance Calibration Code Instrumentation

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      LCTES '19

      Acceptance Rates

      Overall Acceptance Rate 116 of 438 submissions, 26%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)11
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 01 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Tailoring CUTLASS GEMM using Supervised Learning2023 IEEE 41st International Conference on Computer Design (ICCD)10.1109/ICCD58817.2023.00077(465-474)Online publication date: 6-Nov-2023
      • (2020)Optimization of GPU-based Sparse Matrix Multiplication for Large Sparse Networks2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.00085(925-936)Online publication date: Apr-2020

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media