skip to main content
10.1145/1345206.1345210acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories

Published: 20 February 2008 Publication History

Abstract

Several parallel architectures such as GPUs and the Cell processor have fast explicitly managed on-chip memories, in addition to slow off-chip memory. They also have very high computational power with multiple levels of parallelism. A significant challenge in programming these architectures is to effectively exploit the parallelism available in the architecture and manage the fast memories to maximize performance.
In this paper we develop an approach to effective automatic data management for on-chip memories, including creation of buffers in on-chip (local) memories for holding portions of data accessed in a computational block, automatic determination of array access functions of local buffer references, and generation of code that moves data between slow off-chip memory and fast local memories. We also address the problem of mapping computation in regular programs to multi-level parallel architectures using a multi-level tiling approach, and study the impact of on-chip memory availability on the selection of tile sizes at various levels. Experimental results on a GPU demonstrate the effectiveness of the proposed approach.

References

[1]
S. Anantharaman and S. Pande. Compiler optimizations for real time execution of loops on limited memory embedded systems. In IEEE Real-Time Systems Symposium, pages 154--164, 1998.
[2]
Automatically Tuned Linear Algebra Software (ATLAS). http://math-atlas.sourceforge.net/.
[3]
F. Balasa, P. Kjeldsberg, M. Palkovic, A. Vandecappelle, and F. Catthoor. Loop transformation methodologies for array-oriented memory management. In 17th IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP'06), pages 205--212, 2006.
[4]
D. P. Bertsekas. Nonlinear Programming: 2nd Edition. Athena Scientific. ISBN 1-886529-00-0.
[5]
G. Bikshandi, J. Guo, D. Hoeflinger, G. Almasi, B. B. Fraguela, M. J. Garzaran, D. Padua, and C. von Praun. Programming for parallelism and locality with hierarchically tiled arrays. In PPoPP, pages 48--57, 2006.
[6]
J. Bilmes, K. Asanovic, C. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC. In Proc. ACM International Conference on Supercomputing, pages 340--347, 1997.
[7]
U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Affine transformations for communication minimal parallelization and locality optimization of arbitrarily nested loop sequences. Technical Report OSU-CISRC-5/07-TR43, Ohio State University, May 2007.
[8]
F. Catthoor, K. Danckaert, C. Kulkarni, E. Brockmeyer, P. Kjeldsberg, T. V. Achteren, and T. Omnes. Data Access and Storage Management for Embedded Programmable Processors. Kluwer Academic Publishers, 2002.
[9]
P. Clauss. Counting solutions to linear and nonlinear constraints through ehrhart polynomials: applications to analyze and transform scientific programs. In ICS '96: Proceedings of the 10th international conference on Supercomputing, pages 278--285, 1996.
[10]
CLooG: The Chunky Loop Generator. http://www.cloog.org.
[11]
A. Darte and F. Vivien. Optimal fine and medium grain parallelism detection in polyhedral reduced dependence graphs. IJPP, 25(6):447--496, Dec. 1997.
[12]
C. Eisenbeis, W. Jalby, D. Windheiser, and F. Bodin. A strategy for array management in local memory. In Advances in Languages and Compilers for Parallel Computing, 1990 Workshop, pages 130--151, Irvine, Calif., 1990. Cambridge, Mass.: MIT Press.
[13]
K. Fatahalian, T. J. Knight, M. Houston, M. Erez, D. R. Horn, L. Leem, J. Y. Park, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: Programming the memory hierarchy. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, 2006.
[14]
P. Feautrier. Parametric integer programming. Operationnelle/Operations Research, 22(3):243--268, 1988.
[15]
P. Feautrier. Dataflow analysis of array and scalar references. IJPP, 20(1):23--53, 1991.
[16]
P. Feautrier. Some efficient solutions to the affine scheduling problem:I. one-dimensional time. IJPP, 21(5):313--348, 1992.
[17]
P. Feautrier. Some efficient solutions to the affine scheduling problem.part II. multidimensional time. IJPP, 21(6):389--420, 1992.
[18]
P. Feautrier. Automatic parallelization in the polytope model. In The Data Parallel Programming Model, pages 79--103, 1996.
[19]
J. Ferrante, V. Sarkar, and W. Thrash. On estimating and enhancing cache effectiveness. In Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing, pages 328--343, London, UK, 1992. Springer-Verlag.
[20]
D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformation. In Proceedings of the 1st International Conference on Supercomputing, pages 229--254, New York, NY, USA, 1988. Springer-Verlag New York, Inc.
[21]
M. Griebl. Automatic Parallelization of Loop Programs for Distributed Memory Architectures. FMI, University of Passau, 2004. Habilitation Thesis.
[22]
I. Issenin, E. Brockmeyer, B. Durinck, and N. Dutt. Multiprocessor system-on-chip data reuse analysis for exploring customized memory hierarchies. In DAC '06: Proceedings of the 43rd annual conference on Design automation, pages 49--52, 2006.
[23]
M. Jimnez, J. M. Llabera, and A. Fernndez. A cost-effective implementation of multilevel tiling. IEEE Trans. Parallel Distrib. Syst., 14(10):1006--1020, 2003.
[24]
M. Kandemir, I. Kadayif, A. Choudhary, J. Ramanujam, and I. Kolcu. Compiler-directed scratch pad memory optimization for embedded multiprocessors. IEEE Transactions on VLSI (TVLSI), 12(3):281--287, 2004.
[25]
M. Kandemir, J. Ramanujam, M. Irwin, V. Narayanan, I. Kadayif, and A. Parikh. A compiler based approach for dynamically managing scratch-pad memories in embedded systems. IEEE Transactions on Computer-Aided Design, 23(2):243--260, 2004.
[26]
D. Kim, L. Renganarayana, D. Rostron, S. Rajopadhye, and M. M. Strout. Multi-level tiling: M for the price of one. In SC, November 2007.
[27]
S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective Automatic Parallelization of Stencil Computations. In ACM SIGPLAN PLDI 2007, July 2007.
[28]
A. Lim. Improving Parallelism And Data Locality With Affine Partitioning. PhD thesis, Stanford University, Aug. 2001.
[29]
A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine transforms. In POPL'97, pages 201--214.
[30]
NVIDIA CUDA. http://developer.nvidia.com/object/cuda.html.
[31]
P. R. Panda, F. Catthoor, N. D. Dutt, K. Danckaert, E. Brockmeyer, C. Kulkarni, A. Vandecappelle, and P. G. Kjeldsberg. Data and memory optimization techniques for embedded systems. ACM Trans. Design Autom. Electr. Syst., 6(2):149--206, 2001.
[32]
PolyLib - A library of polyhedral functions. http://icps.ustrasbg.fr/polylib/.
[33]
L.-N. Pouchet, C. Bastoul, A. Cohen, and N. Vasilache. Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time. In CGO '07, pages 144--156, 2007.
[34]
W. Pugh. The omega test: a fast and practical integer programming algorithm for dependence analysis. Communications of the ACM, 8:102--114, Aug. 1992.
[35]
W. Pugh. Counting solutions to presburger formulas: how and why. In PLDI '94: Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation, pages 121--134, 1994.
[36]
J. Ramanujam, J. Hong, M. Kandemir, and A. Narayan. Reducing memory requirements of nested loops for embedded systems. In DAC '01: Proceedings of the 38th conference on Design automation, pages 359--364, 2001.
[37]
L. Renganarayanan, M. Harthikote-Matha, R. Dewri, and S. V. Rajopadhye. Towards optimal multi-level tiling for stencil computations. In IPDPS, pages 1--10. IEEE, 2007.
[38]
R. Schreiber and D. C. Cronquist. Near Optimal Allocation of Local Memory Arrays. Technical Report HPL-2004-24, HP Laboratories Palo Alto, 2004.
[39]
N. Vasilache, C. Bastoul, S. Girbal, and A. Cohen. Violated dependence analysis. In ACM ICS, June 2006.
[40]
Y. Zhao and S. Malik. Exact memory size estimation for array computations without loop unrolling. In DAC '99: Proceedings of the 36th ACM/IEEE conference on Design automation, pages 811--816, 1999.

Cited By

View all
  • (2019)Easy Universal Translator as an Alternative Compiler-CompilerAdvances in Cyber-Physical Systems10.23939/acps2019.02.1054:2(105-109)Online publication date: 5-Oct-2019
  • (2019)Efficient Mapping of Streaming Applications for Image Processing on Graphics Cards10.1007/978-3-662-58834-5_1(1-20)Online publication date: 23-Feb-2019
  • (2018)Using polyhedral techniques to tighten WCET estimates of optimized code: A case study with array contraction2018 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE.2018.8342142(925-930)Online publication date: Mar-2018
  • Show More Cited By

Index Terms

  1. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
    February 2008
    308 pages
    ISBN:9781595937957
    DOI:10.1145/1345206
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 February 2008

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data movement
    2. graphics processor unit
    3. multi-level tiling
    4. scratchpad memory

    Qualifiers

    • Research-article

    Conference

    PPoPP08
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 230 of 1,014 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)27
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)Easy Universal Translator as an Alternative Compiler-CompilerAdvances in Cyber-Physical Systems10.23939/acps2019.02.1054:2(105-109)Online publication date: 5-Oct-2019
    • (2019)Efficient Mapping of Streaming Applications for Image Processing on Graphics Cards10.1007/978-3-662-58834-5_1(1-20)Online publication date: 23-Feb-2019
    • (2018)Using polyhedral techniques to tighten WCET estimates of optimized code: A case study with array contraction2018 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE.2018.8342142(925-930)Online publication date: Mar-2018
    • (2018)ControllersInternational Journal of High Performance Computing Applications10.1177/109434201770296232:6(838-853)Online publication date: 1-Nov-2018
    • (2018)An Access-Pattern-Aware On-Chip Vector Memory System with Automatic Loading for SIMD Architectures2018 IEEE High Performance extreme Computing Conference (HPEC)10.1109/HPEC.2018.8547551(1-7)Online publication date: Sep-2018
    • (2017)GPUDrano: Detecting Uncoalesced Accesses in GPU ProgramsComputer Aided Verification10.1007/978-3-319-63387-9_25(507-525)Online publication date: 13-Jul-2017
    • (2016)Parallel Locality and Parallelization QualityProceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/2883404.2883410(59-68)Online publication date: 12-Mar-2016
    • (2016)Pragma Directed Shared Memory Centric Optimizations on GPUsJournal of Computer Science and Technology10.1007/s11390-016-1624-831:2(235-252)Online publication date: 7-Mar-2016
    • (2015)Characterizing and enhancing global memory data coalescing on GPUsProceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization10.5555/2738600.2738603(12-22)Online publication date: 7-Feb-2015
    • (2015)A Profile Guided Approach to Optimize Branch Divergence While Transforming Applications for GPUsProceedings of the 8th India Software Engineering Conference10.1145/2723742.2723760(176-185)Online publication date: 18-Feb-2015
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media