research-article

Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories

Authors:

Muthu Manikandan Baskaran,

Uday Bondhugula,

Sriram Krishnamoorthy,

Atanas Rountev,

P. SadayappanAuthors Info & Claims

PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming

Pages 1 - 10

https://doi.org/10.1145/1345206.1345210

Published: 20 February 2008 Publication History

Abstract

Several parallel architectures such as GPUs and the Cell processor have fast explicitly managed on-chip memories, in addition to slow off-chip memory. They also have very high computational power with multiple levels of parallelism. A significant challenge in programming these architectures is to effectively exploit the parallelism available in the architecture and manage the fast memories to maximize performance.

In this paper we develop an approach to effective automatic data management for on-chip memories, including creation of buffers in on-chip (local) memories for holding portions of data accessed in a computational block, automatic determination of array access functions of local buffer references, and generation of code that moves data between slow off-chip memory and fast local memories. We also address the problem of mapping computation in regular programs to multi-level parallel architectures using a multi-level tiling approach, and study the impact of on-chip memory availability on the selection of tile sizes at various levels. Experimental results on a GPU demonstrate the effectiveness of the proposed approach.

References

[1]

S. Anantharaman and S. Pande. Compiler optimizations for real time execution of loops on limited memory embedded systems. In IEEE Real-Time Systems Symposium, pages 154--164, 1998.

Digital Library

[2]

Automatically Tuned Linear Algebra Software (ATLAS). http://math-atlas.sourceforge.net/.

[3]

F. Balasa, P. Kjeldsberg, M. Palkovic, A. Vandecappelle, and F. Catthoor. Loop transformation methodologies for array-oriented memory management. In 17th IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP'06), pages 205--212, 2006.

Digital Library

[4]

D. P. Bertsekas. Nonlinear Programming: 2nd Edition. Athena Scientific. ISBN 1-886529-00-0.

[5]

G. Bikshandi, J. Guo, D. Hoeflinger, G. Almasi, B. B. Fraguela, M. J. Garzaran, D. Padua, and C. von Praun. Programming for parallelism and locality with hierarchically tiled arrays. In PPoPP, pages 48--57, 2006.

Digital Library

[6]

J. Bilmes, K. Asanovic, C. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC. In Proc. ACM International Conference on Supercomputing, pages 340--347, 1997.

Digital Library

[7]

U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Affine transformations for communication minimal parallelization and locality optimization of arbitrarily nested loop sequences. Technical Report OSU-CISRC-5/07-TR43, Ohio State University, May 2007.

[8]

F. Catthoor, K. Danckaert, C. Kulkarni, E. Brockmeyer, P. Kjeldsberg, T. V. Achteren, and T. Omnes. Data Access and Storage Management for Embedded Programmable Processors. Kluwer Academic Publishers, 2002.

Digital Library

[9]

P. Clauss. Counting solutions to linear and nonlinear constraints through ehrhart polynomials: applications to analyze and transform scientific programs. In ICS '96: Proceedings of the 10th international conference on Supercomputing, pages 278--285, 1996.

Digital Library

[10]

CLooG: The Chunky Loop Generator. http://www.cloog.org.

[11]

A. Darte and F. Vivien. Optimal fine and medium grain parallelism detection in polyhedral reduced dependence graphs. IJPP, 25(6):447--496, Dec. 1997.

Digital Library

[12]

C. Eisenbeis, W. Jalby, D. Windheiser, and F. Bodin. A strategy for array management in local memory. In Advances in Languages and Compilers for Parallel Computing, 1990 Workshop, pages 130--151, Irvine, Calif., 1990. Cambridge, Mass.: MIT Press.

[13]

K. Fatahalian, T. J. Knight, M. Houston, M. Erez, D. R. Horn, L. Leem, J. Y. Park, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: Programming the memory hierarchy. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, 2006.

Digital Library

[14]

P. Feautrier. Parametric integer programming. Operationnelle/Operations Research, 22(3):243--268, 1988.

[15]

P. Feautrier. Dataflow analysis of array and scalar references. IJPP, 20(1):23--53, 1991.

[16]

P. Feautrier. Some efficient solutions to the affine scheduling problem:I. one-dimensional time. IJPP, 21(5):313--348, 1992.

Digital Library

[17]

P. Feautrier. Some efficient solutions to the affine scheduling problem.part II. multidimensional time. IJPP, 21(6):389--420, 1992.

Digital Library

[18]

P. Feautrier. Automatic parallelization in the polytope model. In The Data Parallel Programming Model, pages 79--103, 1996.

[19]

J. Ferrante, V. Sarkar, and W. Thrash. On estimating and enhancing cache effectiveness. In Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing, pages 328--343, London, UK, 1992. Springer-Verlag.

Digital Library

[20]

D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformation. In Proceedings of the 1st International Conference on Supercomputing, pages 229--254, New York, NY, USA, 1988. Springer-Verlag New York, Inc.

Digital Library

[21]

M. Griebl. Automatic Parallelization of Loop Programs for Distributed Memory Architectures. FMI, University of Passau, 2004. Habilitation Thesis.

[22]

I. Issenin, E. Brockmeyer, B. Durinck, and N. Dutt. Multiprocessor system-on-chip data reuse analysis for exploring customized memory hierarchies. In DAC '06: Proceedings of the 43rd annual conference on Design automation, pages 49--52, 2006.

Digital Library

[23]

M. Jimnez, J. M. Llabera, and A. Fernndez. A cost-effective implementation of multilevel tiling. IEEE Trans. Parallel Distrib. Syst., 14(10):1006--1020, 2003.

Digital Library

[24]

M. Kandemir, I. Kadayif, A. Choudhary, J. Ramanujam, and I. Kolcu. Compiler-directed scratch pad memory optimization for embedded multiprocessors. IEEE Transactions on VLSI (TVLSI), 12(3):281--287, 2004.

Digital Library

[25]

M. Kandemir, J. Ramanujam, M. Irwin, V. Narayanan, I. Kadayif, and A. Parikh. A compiler based approach for dynamically managing scratch-pad memories in embedded systems. IEEE Transactions on Computer-Aided Design, 23(2):243--260, 2004.

Digital Library

[26]

D. Kim, L. Renganarayana, D. Rostron, S. Rajopadhye, and M. M. Strout. Multi-level tiling: M for the price of one. In SC, November 2007.

Digital Library

[27]

S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective Automatic Parallelization of Stencil Computations. In ACM SIGPLAN PLDI 2007, July 2007.

Digital Library

[28]

A. Lim. Improving Parallelism And Data Locality With Affine Partitioning. PhD thesis, Stanford University, Aug. 2001.

Digital Library

[29]

A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine transforms. In POPL'97, pages 201--214.

Digital Library

[30]

NVIDIA CUDA. http://developer.nvidia.com/object/cuda.html.

[31]

P. R. Panda, F. Catthoor, N. D. Dutt, K. Danckaert, E. Brockmeyer, C. Kulkarni, A. Vandecappelle, and P. G. Kjeldsberg. Data and memory optimization techniques for embedded systems. ACM Trans. Design Autom. Electr. Syst., 6(2):149--206, 2001.

Digital Library

[32]

PolyLib - A library of polyhedral functions. http://icps.ustrasbg.fr/polylib/.

[33]

L.-N. Pouchet, C. Bastoul, A. Cohen, and N. Vasilache. Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time. In CGO '07, pages 144--156, 2007.

Digital Library

[34]

W. Pugh. The omega test: a fast and practical integer programming algorithm for dependence analysis. Communications of the ACM, 8:102--114, Aug. 1992.

Digital Library

[35]

W. Pugh. Counting solutions to presburger formulas: how and why. In PLDI '94: Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation, pages 121--134, 1994.

Digital Library

[36]

J. Ramanujam, J. Hong, M. Kandemir, and A. Narayan. Reducing memory requirements of nested loops for embedded systems. In DAC '01: Proceedings of the 38th conference on Design automation, pages 359--364, 2001.

Digital Library

[37]

L. Renganarayanan, M. Harthikote-Matha, R. Dewri, and S. V. Rajopadhye. Towards optimal multi-level tiling for stencil computations. In IPDPS, pages 1--10. IEEE, 2007.

[38]

R. Schreiber and D. C. Cronquist. Near Optimal Allocation of Local Memory Arrays. Technical Report HPL-2004-24, HP Laboratories Palo Alto, 2004.

[39]

N. Vasilache, C. Bastoul, S. Girbal, and A. Cohen. Violated dependence analysis. In ACM ICS, June 2006.

Digital Library

[40]

Y. Zhao and S. Malik. Exact memory size estimation for array computations without loop unrolling. In DAC '99: Proceedings of the 36th ACM/IEEE conference on Design automation, pages 811--816, 1999.

Digital Library

Cited By

Melnyk AKozak N(2019)Easy Universal Translator as an Alternative Compiler-CompilerAdvances in Cyber-Physical Systems10.23939/acps2019.02.1054:2(105-109)Online publication date: 5-Oct-2019
https://doi.org/10.23939/acps2019.02.105
Membarth RDutta HHannig FTeich J(2019)Efficient Mapping of Streaming Applications for Image Processing on Graphics Cards10.1007/978-3-662-58834-5_1(1-20)Online publication date: 23-Feb-2019
https://doi.org/10.1007/978-3-662-58834-5_1
Lefeuvre TFassi ICullmann CGebhard GKasnakli EPuaut IDerrien S(2018)Using polyhedral techniques to tighten WCET estimates of optimized code: A case study with array contraction2018 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE.2018.8342142(925-930)Online publication date: Mar-2018
https://doi.org/10.23919/DATE.2018.8342142
Show More Cited By

Index Terms

Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Prolonging Lifetime of Non-volatile Last Level Caches with Cluster Mapping
GLSVLSI '16: Proceedings of the 26th edition on Great Lakes Symposium on VLSI

Recently, work has been done on using nonvolatile cells, such as Spin Transfer Torque RAM (STT-RAM) or Magnetic RAM (M-RAM), to construct last level caches (LLC). These structures mitigate the leakage power and density problem found in traditional SRAM ...
Algorithm-Directed Data Placement in Explicitly Managed Non-Volatile Memory
HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

The emergence of many non-volatile memory (NVM) techniques is poised to revolutionize main memory systems because of the relatively high capacity and low lifetime power consumption of NVM. However, to avoid the typical limitation of NVM as the main ...
Scratchpad Memories for Parallel Applications in Multi-core Architectures
WSCAD-SSC '11: Proceedings of the 2011 Simpósio em Sistemas Computacionais

Scratchpad memories are largely used in embedded processorsdue to their reduced energy consumption and areacompared to traditional cache memories. In multi-core architectures, these memories are an interesting solution forthe storage of shared data and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming

February 2008

308 pages

ISBN:9781595937957

DOI:10.1145/1345206

General Chair:
Siddhartha Chatterjee
IBM Research USA
,
Program Chair:
Michael L. Scott
University of Rochester USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PPoPP08

Sponsor:

PPoPP08: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 20 - 23, 2008

UT, Salt Lake City, USA

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

70
Total Citations
View Citations
1,391
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)3

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Melnyk AKozak N(2019)Easy Universal Translator as an Alternative Compiler-CompilerAdvances in Cyber-Physical Systems10.23939/acps2019.02.1054:2(105-109)Online publication date: 5-Oct-2019
https://doi.org/10.23939/acps2019.02.105
Membarth RDutta HHannig FTeich J(2019)Efficient Mapping of Streaming Applications for Image Processing on Graphics Cards10.1007/978-3-662-58834-5_1(1-20)Online publication date: 23-Feb-2019
https://doi.org/10.1007/978-3-662-58834-5_1
Lefeuvre TFassi ICullmann CGebhard GKasnakli EPuaut IDerrien S(2018)Using polyhedral techniques to tighten WCET estimates of optimized code: A case study with array contraction2018 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE.2018.8342142(925-930)Online publication date: Mar-2018
https://doi.org/10.23919/DATE.2018.8342142
Moreton-Fernandez AOrtega-Arranz HGonzalez-Escribano A(2018)ControllersInternational Journal of High Performance Computing Applications10.1177/109434201770296232:6(838-853)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1177/1094342017702962
Geng TDiken EWang TJozwiak LHerbordt M(2018)An Access-Pattern-Aware On-Chip Vector Memory System with Automatic Loading for SIMD Architectures2018 IEEE High Performance extreme Computing Conference (HPEC)10.1109/HPEC.2018.8547551(1-7)Online publication date: Sep-2018
https://doi.org/10.1109/HPEC.2018.8547551
Alur RDevietti JNavarro Leija OSinghania N(2017)GPUDrano: Detecting Uncoalesced Accesses in GPU ProgramsComputer Aided Verification10.1007/978-3-319-63387-9_25(507-525)Online publication date: 13-Jul-2017
https://doi.org/10.1007/978-3-319-63387-9_25
Goossens BParello DPorada KRahmoune D(2016)Parallel Locality and Parallelization QualityProceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/2883404.2883410(59-68)Online publication date: 12-Mar-2016
https://dl.acm.org/doi/10.1145/2883404.2883410
Li JLiu LWu YLiu XGao YFeng XWu C(2016)Pragma Directed Shared Memory Centric Optimizations on GPUsJournal of Computer Science and Technology10.1007/s11390-016-1624-831:2(235-252)Online publication date: 7-Mar-2016
https://doi.org/10.1007/s11390-016-1624-8
Fauzia NPouchet LSadayappan POlukotun KSmith AHundt RMars J(2015)Characterizing and enhancing global memory data coalescing on GPUsProceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization10.5555/2738600.2738603(12-22)Online publication date: 7-Feb-2015
https://dl.acm.org/doi/10.5555/2738600.2738603
Sarkar SMitra SPadmanabhuni SNambiar RDevanbu PRamanathan MSureka A(2015)A Profile Guided Approach to Optimize Branch Divergence While Transforming Applications for GPUsProceedings of the 8th India Software Engineering Conference10.1145/2723742.2723760(176-185)Online publication date: 18-Feb-2015
https://dl.acm.org/doi/10.1145/2723742.2723760
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten