research-article

A compiler framework for optimization of affine loop nests for gpgpus

Authors:

Muthu Manikandan Baskaran,

Uday Bondhugula,

Sriram Krishnamoorthy,

Atanas Rountev,

P. SadayappanAuthors Info & Claims

ICS '08: Proceedings of the 22nd annual international conference on Supercomputing

Pages 225 - 234

https://doi.org/10.1145/1375527.1375562

Published: 07 June 2008 Publication History

Abstract

GPUs are a class of specialized parallel architectures with tremendous computational power. The new Compute Unified Device Architecture (CUDA) programming model from NVIDIA facilitates programming of general purpose applications on their GPUs. However, manual development of high-performance parallel code for GPUs is still very challenging. In this paper, a number of issues are addressed towards the goal of developing a compiler framework for automatic parallelization and performance optimization of affine loop nests on GPGPUs: 1) approach to program transformation for efficient data access from GPU global memory, using a polyhedral compiler model of data dependence abstraction and program transformation; 2) determination of optimal padding factors for conflict-minimal data access from GPU shared memory; and 3) model-driven empirical search to determine optimal parameters for unrolling and tiling. Experimental results on a number of kernels demonstrate the effectiveness of the compiler optimization approaches developed.

References

[1]

C. Ancourt and F. Irigoin. Scanning polyhedra with do loops. In PPoPP'91, pages 39--50, 1991.

Digital Library

[2]

M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic data movement and computation mapping for multilevel parallel architectures with explicitly managed memories. In ACM SIGPLAN PPoPP 2008, Feb. 2008.

Digital Library

[3]

C. Bastoul. Code generation in the polyhedral model is easier than you think. In PACT'04, pages 7--16, 2004.

Digital Library

[4]

U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic transformations for communicationminimized parallelization and locality optimization in the polyhedral model. In International Conference on Compiler Construction (ETAPS CC), Apr. 2008.

Digital Library

[5]

U. Bondhugula, A. Hartono, J. Ramanujan, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In ACM SIGPLAN Programming Languages Design and Implementation (PLDI '08), 2008.

Digital Library

[6]

I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: stream computing on graphics hardware. In SIGGRAPH '04, pages 777--786, 2004.

Digital Library

[7]

CLooG: The Chunky Loop Generator. http://www.cloog.org.

[8]

K. Fatahalian, J. Sugerman, and P. Hanrahan. Understanding the efficiency of GPU algorithms for matrixmatrix multiplication. In ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pages 133--137, 2004.

Digital Library

[9]

P. Feautrier. Dataflow analysis of array and scalar references. IJPP, 20(1):23--53, 1991.

[10]

P. Feautrier. Some efficient solutions to the affine scheduling problem, part I: onedimensional time. IJPP, 21(5):313--348, 1992.

Digital Library

[11]

P. Feautrier. Some efficient solutions to the affine scheduling problem, part II: multidimensional time. IJPP, 21(6):389--420, 1992.

Digital Library

[12]

N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha. A memory model for scientific algorithms on graphics processors. In SC '06, 2006.

Digital Library

[13]

GeneralPurpose Computation Using Graphics Hardware. http://www.gpgpu.org/.

[14]

M. Griebl. Automatic Parallelization of Loop Programs for Distributed Memory Architectures. FMI, University of Passau, 2004. Habilitation Thesis.

[15]

S.W. Liao, Z. Du, G. Wu, and G.Y. Lueh. Data and computation transformations for Brook streaming applications on multiprocessors. In CGO '06, pages 196--207, 2006.

Digital Library

[16]

A. Lim. Improving Parallelism And Data Locality With Affine Partitioning. PhD thesis, Stanford University, Aug. 2001.

Digital Library

[17]

A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine transforms. In POPL, pages 201--214, 1997.

Digital Library

[18]

NVIDIA CUDA. http://developer.nvidia.com/object/cuda.html.

[19]

NVIDIA GeForce 8800. http://www.nvidia.com/page/geforce_8800.html.

[20]

PLuTo: A polyhedral automatic parallelizer and locality optimizer for multicores. http://plutocompiler.sourceforge.net.

[21]

L.N. Pouchet, C. Bastoul, A. Cohen, and N. Vasilache. Iterative optimization in the polyhedral model: Part I, onedimensional time. In CGO '07, pages 144--156, 2007.

Digital Library

[22]

W. Pugh. The Omega test: a fast and practical integer programming algorithm for dependence analysis. Communications of the ACM, 8:102--114, Aug. 1992.

Digital Library

[23]

F. Quilleré, S. V. Rajopadhye, and D. Wilde. Generation of efficient nested loops from polyhedra. IJPP, 28(5):469--498, 2000.

Digital Library

[24]

S. Ryoo, C. Rodrigues, S. Baghsorkhi, S. Stone, D. Kirk, and W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In ACM SIGPLAN PPoPP 2008, Feb. 2008.

Digital Library

[25]

S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, and W. Hwu. Program optimization study on a 128core GPU. In The First Workshop on General Purpose Processing on Graphics Processing Units, October 2007.

[26]

S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, J. Stratton, and W. Hwu. Program optimization space pruning for a multithreaded GPU. In CGO, 2008.

Digital Library

[27]

D. Tarditi, S. Puri, and J. Oglesby. Accelerator: using data parallelism to program GPUs for generalpurpose uses. In ASPLOSXII, pages 325--335, 2006.

Digital Library

[28]

N. Vasilache, C. Bastoul, and A. Cohen. Polyhedral code generation in the real world. In International Conference on Compiler Construction (ETAPS CC'06), pages 185--201, Mar. 2006.

Digital Library

Cited By

Nuzman DZaks ABen-Zion ZRodríguez GSadayappan PSukumaran-Rajam A(2024)If-Convert as Early as You MustProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641562(26-38)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641562
Mustafa DAlkhasawneh RObeidat FShatnawi A(2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3372990
Khan SChatterjee BPande SRauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)VICOProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532393(1-14)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532393
Show More Cited By

Index Terms

A compiler framework for optimization of affine loop nests for gpgpus
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Polyhedral parallel code generation for CUDA
Special Issue on High-Performance Embedded Architectures and Compilers

This article addresses the compilation of a sequential program for parallel execution on a modern GPU. To this end, we present a novel source-to-source compiler called PPCG. PPCG singles out for its ability to accelerate computations from any static ...
gpucc: an open-source GPGPU compiler
CGO '16: Proceedings of the 2016 International Symposium on Code Generation and Optimization

Graphics Processing Units have emerged as powerful accelerators for massively parallel, numerically intensive workloads. The two dominant software models for these devices are NVIDIA’s CUDA and the cross-platform OpenCL standard. Until now, there has ...
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights
- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
Abstract
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '08: Proceedings of the 22nd annual international conference on Supercomputing

June 2008

390 pages

ISBN:9781605581583

DOI:10.1145/1375527

General Chairs:
Theo Papatheodorou
University of Patras, Greece
,
Utpal Banerjee
Intel (retired), USA
,
Program Chairs:
Avi Mendelson
Intel, Israel
,
Kyle Gallivan
Florida State University, USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICS08

Sponsor:

ICS08: International Conference on Supercomputing

June 7 - 12, 2008

Island of Kos, Greece

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

144
Total Citations
View Citations
1,626
Total Downloads

Downloads (Last 12 months)44
Downloads (Last 6 weeks)6

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Nuzman DZaks ABen-Zion ZRodríguez GSadayappan PSukumaran-Rajam A(2024)If-Convert as Early as You MustProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641562(26-38)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641562
Mustafa DAlkhasawneh RObeidat FShatnawi A(2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3372990
Khan SChatterjee BPande SRauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)VICOProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532393(1-14)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532393
Faingnaert TBesard TDe Sutter B(2022)Flexible Performant GEMM Kernels on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.313645733:9(2230-2248)Online publication date: 1-Sep-2022
https://doi.org/10.1109/TPDS.2021.3136457
Du ZLi JWang YLi XTan GSun N(2022)AlphaSparse: Generating High Performance SpMV Codes Directly from Sparse MatricesSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00071(1-15)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00071
Jeong IOh YRo WYoon M(2022)TEA-RC: Thread Context-Aware Register Cache for GPUsIEEE Access10.1109/ACCESS.2022.319614910(82049-82062)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3196149
Mustafa D(2022)A Survey of Performance Tuning Techniques and Tools for Parallel ApplicationsIEEE Access10.1109/ACCESS.2022.314784610(15036-15055)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3147846
Bitalebi HSafaei F(2022)Criticality-aware priority to accelerate GPU memory accessThe Journal of Supercomputing10.1007/s11227-022-04657-379:1(188-213)Online publication date: 6-Jul-2022
https://doi.org/10.1007/s11227-022-04657-3
Doroshenko AYatsenko O(2021)Algebra-Dynamic Models for CPU- and GPU-Parallel Program Design and the Model of Auto-TuningFormal and Adaptive Methods for Automation of Parallel Programs Construction10.4018/978-1-5225-9384-3.ch004(112-142)Online publication date: 2021
https://doi.org/10.4018/978-1-5225-9384-3.ch004
Kong M(2021)On the Impact of Affine Loop Transformations in Qubit AllocationACM Transactions on Quantum Computing10.1145/34654092:3(1-40)Online publication date: 30-Sep-2021
https://dl.acm.org/doi/10.1145/3465409
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten