Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications

Sung, I-Jui; Anssari, Nasser; Stratton, John A.; Hwu, Wen-Mei W.

doi:10.1007/s10766-011-0182-5

Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications

Published: 18 August 2011

Volume 40, pages 4–24, (2012)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

I-Jui Sung¹,
Nasser Anssari¹,
John A. Stratton¹ &
…
Wen-Mei W. Hwu¹

206 Accesses
6 Citations
Explore all metrics

Abstract

We present automatic data layout transformation as an effective compiler performance optimization for memory-bound structured grid applications. Structured grid applications include stencil codes and other code structures using a dense, regular grid as the primary data structure. Fluid dynamics and heat distribution, which both solve partial differential equations on a discretized representation of space, are representative of many important structured grid applications. Using the information available through variable-length array syntax, standardized in C99 and other modern languages, we enable automatic data layout transformations for structured grid codes with dynamically allocated arrays. We also present how a tool can guide these transformations to statically choose a good layout given a model of the memory system, using a modern GPU as an example. A transformed layout that distributes concurrent memory requests among parallel memory system components provides substantial speedup for structured grid applications by improving their achieved memory-level parallelism. Even with the overhead of more complex address calculations, we observe up to 10.94X speedup over the original layout, and a 1.16X performance gain in the worst case.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic Data Layout Optimizations for GPUs

Astute Approach to Handling Memory Layouts of Regular Data Structures

Compiler Optimization for Irregular Memory Access Patterns in PGAS Programs

References

Anderson J.M., Amarasinghe S.P., Lam M.S.: Data and computation transformations for multiprocessors. SIGPLAN Not. 30(8), 166–178 (1995)
Article Google Scholar
Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The landscape of parallel computing research: a view from berkeley. Technical report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006
Bakhoda, A., Yuan, G.L., Fung, W.W.L., Wong, H., Aamodt, T.M.: Analyzing cuda workloads using a detailed gpu simulator. In: ISPASS, pp. 163–174. IEEE (2009)
Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam J., Rountev, A., Sadayappan, P.: A compiler framework for optimization of affine loop nests for gpgpus. In: ICS ’08: Proceedings of the 22nd annual international conference on Supercomputing, pp. 225–234. ACM, New York, NY, USA (2008)
Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: SC08: Proceedings of the 2008 Conference on Supercomputing, pp. 1–12. Piscataway, NJ, USA (2008)
Demmel J.W.: Applied Numerical Linear Algebra. Society for Industrial and Applied Mathematics, Philadelphia, PA (1997)
Book MATH Google Scholar
Ferziger J.H., Peric M.: Computational Methods for Fluid Dynamics. Springer, Berlin (1999)
MATH Google Scholar
Girbal S., Vasilache N., Bastoul C., Cohen A., Parello D., Sigler M., Temam O.: Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. Int. J. Parallel Prog. 34(3), 261–317 (2006)
Article MATH Google Scholar
Gundolf C.D., Douglas C.C., Haase G., Hu J., Kowarschik M., Weiss C.: Portable memory hierarchy techniques for PDE solvers, part II. SIAM News 33, 8–9 (2000)
Google Scholar
Ipek E., Mutlu O., Martínez J.F., Caruana R.: Self-optimizing memory controllers: A reinforcement learning approach. Comp. Arch. News 36(3), 39–50 (2008)
Article Google Scholar
Jang, B., Mistry, P., Schaa, D., Dominguez, R., Kaeli, D.: Data transformations enabling loop vectorization on multithreaded data parallel architectures. In: PPoPP ’10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 353–354. ACM, New York, NY, USA (2010)
Ju Y.-L., Dietz, H.G.: Reduction of cache coherence overhead by compiler data layout and loop transformation. In: Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing, pp. 344–358. Springer, London, UK (1992)
Kennedy, K., Kremer, U.: Automatic data layout for high performance fortran. In: Supercomputing ’95: Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), pp. 76. ACM, New York, NY, USA (1995)
Kindratenko, V., Enos, J., Shi, G.: Gpu clusters for high-performance computing. In: Proceedings of the Workshop on Parallel Programming on Accelerator Clusters. Jan 2009
Kwon, Y.-S., Koo, B.-T., Eum, N.-W.: Partial conflict-relieving programmable address shuffler for parallel memories in multi-core processor. In: ASP-DAC ’09: Proceedings of the 2009 Asia and South Pacific Design Automation Conference, pp. 329–334. IEEE Press, Piscataway, NJ, USA (2009)
Lu, Q., Alias, C., Bondhugula, U., Henretty, T., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P., Chen, Y., Lin, H., Ngai, T.-f.: Data layout transformation for enhancing data locality on nuca chip multiprocessors. In: Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques, pp. 348–357 (2009)
Mace M.E.: Memory Storage Patterns in Parallel Processing. Kluwer, Boston (1987)
Book Google Scholar
Mahapatra N.R., Venkatrao B.: The processor-memory bottleneck: problems and solutions. Crossroads 5(3es), 2 (1999)
Article Google Scholar
McVoy, L., Staelin, C.: lmbench: portable tools for performance analysis. In: Proceedings of the 1996 USENIX Annual Technical Conference, pp. 23–23 (1996)
Morton K.W., Mayers D.F.: Numerical Solution of Partial Differential Equations: An Introduction. Cambridge University Press, New York, NY (2005)
Book MATH Google Scholar
Moscibroda, T., Mutlu, O.: Distributed order scheduling and its application to multi-core DRAM controllers. In: Proceedings of the 27th Symposium on Principles of Distributed Computing, pp. 365–374 (2008)
Mutlu O., Moscibroda T.: Parallelism-aware batch scheduling: enhancing both performance and fairness of shared DRAM systems. Comput. Arch. News 36(3), 63–74 (2008)
Article Google Scholar
nVIDIA: nvidia cuda programming guide 2.0 (2008)
Pohl T., Kowarschik M., Wilke J., Iglberger K., Rüde U.: Optimization and profiling of the cache performance of parallel lattice boltzmann codes. Parallel Process. Lett. 13(4), 549–560 (2003)
Article MathSciNet Google Scholar
Qian Y.H., D’Humieres D., Lallemand P.: Lattice BGK models for Navier-Stokes equation. Europhys. Lett. 17(6), 479–484 (1992)
Article MATH Google Scholar
Rivera, G., Tseng, C.-W.: Tiling optimizations for 3D scientific computations. In: SC00: Proceedings of the 2000 conference on Supercomputing, p. 32 (2000)
Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu, W.-m.W.: Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In: Proceedings of the 13th Symposium on Principles and Practice of Parallel Programming, pp. 73–82 (2008)
Sellappa S, Chatterjee S.: Cache-Efficient multigrid algorithms. Int. J. High Perform. Comput. Appl. 18(1), 115–133 (2004)
Article Google Scholar
Shao, J., Davis, B.T.: A burst scheduling access reordering mechanism. In: Proceedings of the 13th International Symposium on High Performance Computer Architecture, pp. 285–294 (2007)
Spradling C.D.: Spec cpu2006 benchmark tools. Comput. Arch. News 35(1), 130–134 (2007)
Article Google Scholar
Volkov, V., Demmel, J.W.: Benchmarking gpus to tune dense linear algebra. In: SC08: Proceedings of the 2008 Conference on Supercomputing, pp. 1–11 (2008)
Zhao Y.: Lattice Boltzmann based PDE solver on the GPU. Vis. Comput. 24(5), 323–333 (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Illinois at Urbana-Champaign, Urbana, IL, USA
I-Jui Sung, Nasser Anssari, John A. Stratton & Wen-Mei W. Hwu

Authors

I-Jui Sung
View author publications
You can also search for this author inPubMed Google Scholar
Nasser Anssari
View author publications
You can also search for this author inPubMed Google Scholar
John A. Stratton
View author publications
You can also search for this author inPubMed Google Scholar
Wen-Mei W. Hwu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to I-Jui Sung.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sung, IJ., Anssari, N., Stratton, J.A. et al. Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications. Int J Parallel Prog 40, 4–24 (2012). https://doi.org/10.1007/s10766-011-0182-5

Download citation

Received: 01 February 2011
Accepted: 13 July 2011
Published: 18 August 2011
Issue Date: February 2012
DOI: https://doi.org/10.1007/s10766-011-0182-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Automatic Data Layout Optimizations for GPUs

Astute Approach to Handling Memory Layouts of Regular Data Structures

Compiler Optimization for Irregular Memory Access Patterns in PGAS Programs

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now