research-article

Cost-driven thread coarsening for GPU kernels

Authors:
Prithayan Barua

Georgia Institute of Technology

Georgia Institute of Technology
View Profile

,
Jun Shirako

Georgia Institute of Technology

Georgia Institute of Technology
View Profile

,
Vivek Sarkar

Georgia Institute of Technology

Georgia Institute of Technology
View Profile

PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation TechniquesNovember 2018Article No.: 32Pages 1–14https://doi.org/10.1145/3243176.3243196

Published:01 November 2018Publication History

PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

Pages 1–14

ABSTRACT

Directive-based programming models like OpenACC provide a higher level abstraction and low overhead approach of porting existing applications to GPGPUs and other heterogeneous HPC hardware. Such programming models increase the design space exploration possible at the compiler level to exploit specific features of different architectures. We observed that traditional applications designed for latency optimized out-of-order pipelined CPUs do not exploit the throughput optimized in-order pipelined GPU architecture efficiently. In this paper we develop a model to estimate the memory throughput of a given application. Then we use the loop interleave transformation to improve the memory bandwidth utilization of a given kernel.

We developed a heuristic to estimate the optimal loop interleave factor, and implemented it in the OpenARC compiler for OpenACC. We evaluated our approach on over 216 kernels to achieve a Geo-mean speedup of 1.32×.

Our compiler optimization aims to provide the right balance between performance, portability and productivity.

References

2018. OpenACC. https://www.openacc.org/Google Scholar
2018. OpenMP. https://www.openmp.org/Google Scholar
Hansang Bae, Dheya Mustafa, Jae-Woo Lee, Aurangzeb, Hao Lin, Chirag Dave, Rudolf Eigenmann, and Samuel P. Midkiff. 2013. The Cetus Source-to-Source Compiler Infrastructure: Overview and Evaluation. Int. J. Parallel Program. 41, 6 (Dec. 2013), 753--767. Google ScholarDigital Library
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS Parallel Benchmarks---Summary and Preliminary Results. In Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91). ACM, New York, NY, USA, 158--165. Google ScholarDigital Library
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC) (IISWC '09). IEEE Computer Society, Washington, DC, USA, 44--54. Google ScholarDigital Library
C. Cummins, P. Petoumenos, Z. Wang, and H. Leather. 2017. End-to-End Deep Learning of Optimization Heuristics. In 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). 219--232.Google Scholar
Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. 1987. The Program Dependence Graph and Its Use in Optimization. ACM Trans. Program. Lang. Syst. 9, 3 (July 1987), 319--349. Google ScholarDigital Library
Sunpyo Hong and Hyesoon Kim. 2009. An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA '09). ACM, New York, NY, USA, 152--163. Google ScholarDigital Library
Q. Jia and H. Zhou. 2016. Tuning Stencil codes in OpenCL for FPGAs. In 2016 IEEE 34th International Conference on Computer Design (ICCD). 249--256.Google Scholar
Jungwon Kim, Seyong Lee, and Jeffrey S. Vetter. 2015. An OpenACC-based Unified Programming Model for Multi-accelerator Systems. SIGPLAN Not. 50, 8 (Jan. 2015), 257--258. Google ScholarDigital Library
S. Lee and J. S. Vetter. 2014. OpenARC: Extensible OpenACC Compiler Framework for Directive-Based Accelerator Programming Study. In 2014 First Workshop on Accelerator Programming using Directives. 1--11. Google ScholarDigital Library
Seyong Lee and Jeffrey S Vetter. 2014. OpenARC: Open Accelerator Research Compiler for Directive-Based, Efficient Heterogeneous Computing. In HPDC Proceedings of the ACM Symposium on High-Performance Parallel and Distributed Computing, Short Paper. Google ScholarDigital Library
John D. C. Little. 2011. OR FORUM---Little's Law As Viewed on Its 50th Anniversary. Oper. Res. 59, 3 (May 2011), 536--549. Google ScholarDigital Library
Alberto Magni, Christophe Dubach, and Michael O'Boyle. 2014. Automatic Optimization of Thread-coarsening for Graphics Processors. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT '14). ACM, New York, NY, USA, 455--466. Google ScholarDigital Library
Alberto Magni, Christophe Dubach, and Michael F. P. O'Boyle. 2013. A Large-scale Cross-architecture Evaluation of Thread-coarsening. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13). ACM, New York, NY, USA, Article 11, 11 pages. Google ScholarDigital Library
NVIDIA. 2018. Cuda Programming Guide. (2018). http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.htmlGoogle Scholar
Vivek Sarkar. 2000. Optimized Unrolling of Nested Loops. In Proceedings of the 14th International Conference on Super-computing (ICS '00). ACM, New York, NY, USA, 153--166. Google ScholarDigital Library
Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard Vuduc. 2012. A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '12). ACM, New York, NY, USA, 11--22. Google ScholarDigital Library
John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. IEEE Des. Test 12, 3 (May 2010), 66--73.Google Scholar
Swapneela Unkule, Christopher Shaltz, and Apan Qasem. 2012. Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality. In Proceedings of the 21st International Conference on Compiler Construction (CC'12). Springer-Verlag, Berlin, Heidelberg, 21--40. Google ScholarDigital Library
Vasily Volkov. 2016. Understanding Latency Hiding on GPUs. Ph.D. Dissertation. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.htmlGoogle Scholar
V. Volkov and J.W. Demmel. 2008. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE Press, 111. Google ScholarDigital Library

Index Terms

Cost-driven thread coarsening for GPU kernels
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Predictable Thread Coarsening

Thread coarsening on GPUs combines the work of several threads into one. We show how thread coarsening can be implemented as a fully automated compile-time optimisation that estimates the optimal coarsening factor based on a low-cost, approximate static ...
Read More
Automatic optimization of thread-coarsening for graphics processors
PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation

OpenCL has been designed to achieve functional portability across multi-core devices from different vendors. However, the lack of a single cross-target optimizing compiler severely limits performance portability of OpenCL programs. Programmers need to ...
Read More
A large-scale cross-architecture evaluation of thread-coarsening
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

OpenCL has become the de-facto data parallel programming model for parallel devices in today's high-performance supercomputers. OpenCL was designed with the goal of guaranteeing program portability across hardware from different vendors. However, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques
November 2018
494 pages
ISBN:9781450359863
DOI:10.1145/3243176
General Chair:
Skevos Evripidou
University of Cyprus, Cyprus
,
Program Chairs:
Per Stenström
Chalmers University of Technology, Sweden
,
Michael O'Boyle
University of Edinburgh, UK
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 November 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate121of471submissions,26%
Upcoming Conference
PACT '24

Sponsor:

sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 14 - 16, 2024

Southern California , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 230
  Total Downloads
- Downloads (Last 12 months)35
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Cost-driven thread coarsening for GPU kernels

PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

ABSTRACT

References

Cited By

Index Terms

Recommendations

Predictable Thread Coarsening

Automatic optimization of thread-coarsening for graphics processors

A large-scale cross-architecture evaluation of thread-coarsening

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Cost-driven thread coarsening for GPU kernels

PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

ABSTRACT

References

Cited By

Index Terms

Recommendations

Predictable Thread Coarsening

Automatic optimization of thread-coarsening for graphics processors

A large-scale cross-architecture evaluation of thread-coarsening

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media