skip to main content
10.1145/1375527.1375562acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

A compiler framework for optimization of affine loop nests for gpgpus

Published: 07 June 2008 Publication History

Abstract

GPUs are a class of specialized parallel architectures with tremendous computational power. The new Compute Unified Device Architecture (CUDA) programming model from NVIDIA facilitates programming of general purpose applications on their GPUs. However, manual development of high-performance parallel code for GPUs is still very challenging. In this paper, a number of issues are addressed towards the goal of developing a compiler framework for automatic parallelization and performance optimization of affine loop nests on GPGPUs: 1) approach to program transformation for efficient data access from GPU global memory, using a polyhedral compiler model of data dependence abstraction and program transformation; 2) determination of optimal padding factors for conflict-minimal data access from GPU shared memory; and 3) model-driven empirical search to determine optimal parameters for unrolling and tiling. Experimental results on a number of kernels demonstrate the effectiveness of the compiler optimization approaches developed.

References

[1]
C. Ancourt and F. Irigoin. Scanning polyhedra with do loops. In PPoPP'91, pages 39--50, 1991.
[2]
M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic data movement and computation mapping for multilevel parallel architectures with explicitly managed memories. In ACM SIGPLAN PPoPP 2008, Feb. 2008.
[3]
C. Bastoul. Code generation in the polyhedral model is easier than you think. In PACT'04, pages 7--16, 2004.
[4]
U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic transformations for communicationminimized parallelization and locality optimization in the polyhedral model. In International Conference on Compiler Construction (ETAPS CC), Apr. 2008.
[5]
U. Bondhugula, A. Hartono, J. Ramanujan, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In ACM SIGPLAN Programming Languages Design and Implementation (PLDI '08), 2008.
[6]
I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: stream computing on graphics hardware. In SIGGRAPH '04, pages 777--786, 2004.
[7]
CLooG: The Chunky Loop Generator. http://www.cloog.org.
[8]
K. Fatahalian, J. Sugerman, and P. Hanrahan. Understanding the efficiency of GPU algorithms for matrixmatrix multiplication. In ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pages 133--137, 2004.
[9]
P. Feautrier. Dataflow analysis of array and scalar references. IJPP, 20(1):23--53, 1991.
[10]
P. Feautrier. Some efficient solutions to the affine scheduling problem, part I: onedimensional time. IJPP, 21(5):313--348, 1992.
[11]
P. Feautrier. Some efficient solutions to the affine scheduling problem, part II: multidimensional time. IJPP, 21(6):389--420, 1992.
[12]
N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha. A memory model for scientific algorithms on graphics processors. In SC '06, 2006.
[13]
GeneralPurpose Computation Using Graphics Hardware. http://www.gpgpu.org/.
[14]
M. Griebl. Automatic Parallelization of Loop Programs for Distributed Memory Architectures. FMI, University of Passau, 2004. Habilitation Thesis.
[15]
S.W. Liao, Z. Du, G. Wu, and G.Y. Lueh. Data and computation transformations for Brook streaming applications on multiprocessors. In CGO '06, pages 196--207, 2006.
[16]
A. Lim. Improving Parallelism And Data Locality With Affine Partitioning. PhD thesis, Stanford University, Aug. 2001.
[17]
A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine transforms. In POPL, pages 201--214, 1997.
[18]
NVIDIA CUDA. http://developer.nvidia.com/object/cuda.html.
[19]
NVIDIA GeForce 8800. http://www.nvidia.com/page/geforce_8800.html.
[20]
PLuTo: A polyhedral automatic parallelizer and locality optimizer for multicores. http://plutocompiler.sourceforge.net.
[21]
L.N. Pouchet, C. Bastoul, A. Cohen, and N. Vasilache. Iterative optimization in the polyhedral model: Part I, onedimensional time. In CGO '07, pages 144--156, 2007.
[22]
W. Pugh. The Omega test: a fast and practical integer programming algorithm for dependence analysis. Communications of the ACM, 8:102--114, Aug. 1992.
[23]
F. Quilleré, S. V. Rajopadhye, and D. Wilde. Generation of efficient nested loops from polyhedra. IJPP, 28(5):469--498, 2000.
[24]
S. Ryoo, C. Rodrigues, S. Baghsorkhi, S. Stone, D. Kirk, and W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In ACM SIGPLAN PPoPP 2008, Feb. 2008.
[25]
S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, and W. Hwu. Program optimization study on a 128core GPU. In The First Workshop on General Purpose Processing on Graphics Processing Units, October 2007.
[26]
S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, J. Stratton, and W. Hwu. Program optimization space pruning for a multithreaded GPU. In CGO, 2008.
[27]
D. Tarditi, S. Puri, and J. Oglesby. Accelerator: using data parallelism to program GPUs for generalpurpose uses. In ASPLOSXII, pages 325--335, 2006.
[28]
N. Vasilache, C. Bastoul, and A. Cohen. Polyhedral code generation in the real world. In International Conference on Compiler Construction (ETAPS CC'06), pages 185--201, Mar. 2006.

Cited By

View all
  • (2024)If-Convert as Early as You MustProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641562(26-38)Online publication date: 17-Feb-2024
  • (2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
  • (2022)VICOProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532393(1-14)Online publication date: 28-Jun-2022
  • Show More Cited By

Index Terms

  1. A compiler framework for optimization of affine loop nests for gpgpus

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '08: Proceedings of the 22nd annual international conference on Supercomputing
    June 2008
    390 pages
    ISBN:9781605581583
    DOI:10.1145/1375527
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 June 2008

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPU
    2. empirical tuning
    3. memory access optimization
    4. polyhedral model

    Qualifiers

    • Research-article

    Conference

    ICS08
    Sponsor:
    ICS08: International Conference on Supercomputing
    June 7 - 12, 2008
    Island of Kos, Greece

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)44
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)If-Convert as Early as You MustProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641562(26-38)Online publication date: 17-Feb-2024
    • (2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
    • (2022)VICOProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532393(1-14)Online publication date: 28-Jun-2022
    • (2022)Flexible Performant GEMM Kernels on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.313645733:9(2230-2248)Online publication date: 1-Sep-2022
    • (2022)AlphaSparse: Generating High Performance SpMV Codes Directly from Sparse MatricesSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00071(1-15)Online publication date: Nov-2022
    • (2022)TEA-RC: Thread Context-Aware Register Cache for GPUsIEEE Access10.1109/ACCESS.2022.319614910(82049-82062)Online publication date: 2022
    • (2022)A Survey of Performance Tuning Techniques and Tools for Parallel ApplicationsIEEE Access10.1109/ACCESS.2022.314784610(15036-15055)Online publication date: 2022
    • (2022)Criticality-aware priority to accelerate GPU memory accessThe Journal of Supercomputing10.1007/s11227-022-04657-379:1(188-213)Online publication date: 6-Jul-2022
    • (2021)Algebra-Dynamic Models for CPU- and GPU-Parallel Program Design and the Model of Auto-TuningFormal and Adaptive Methods for Automation of Parallel Programs Construction10.4018/978-1-5225-9384-3.ch004(112-142)Online publication date: 2021
    • (2021)On the Impact of Affine Loop Transformations in Qubit AllocationACM Transactions on Quantum Computing10.1145/34654092:3(1-40)Online publication date: 30-Sep-2021
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media