skip to main content
10.1145/1810085.1810096acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Cache oblivious parallelograms in iterative stencil computations

Published: 02 June 2010 Publication History

Abstract

We present a new cache oblivious scheme for iterative stencil computations that performs beyond system bandwidth limitations as though gigabytes of data could reside in an enormous on-chip cache. We compare execution times for 2D and 3D spatial domains with up to 128 million double precision elements for constant and variable stencils against hand-optimized naive code and the automatic polyhedral parallelizer and locality optimizer PluTo and demonstrate the clear superiority of our results.
The performance benefits stem from a tiling structure that caters for data locality, parallelism and vectorization simultaneously. Rather than tiling the iteration space from inside, we take an exterior approach with a predefined hierarchy, simple regular parallelogram tiles and a locality preserving parallelization. These advantages come at the cost of an irregular work-load distribution but a tightly integrated load-balancer ensures a high utilization of all resources.

References

[1]
G. E. Blelloch, P. B. Gibbons, and H. V. Simhadri. Low depth cache-oblivious algorithms. Technical report, Carnegie Mellon University, 2009.
[2]
U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN Not., 43(6):101--113, 2008.
[3]
M. Frigo and V. Strumpen. Cache oblivious stencil computations. In ICS '05: Proceedings of the 19th annual international conference on Supercomputing, pages 361--366. ACM, 2005.
[4]
M. Frigo and V. Strumpen. The cache complexity of multithreaded cache oblivious algorithms. In SPAA '06: Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures, pages 271--280, New York, NY, USA, 2006. ACM.
[5]
M. A. Frumkin and R. F. Van der Wijngaart. Tight bounds on cache use for stencil operations on rectangular grids. Journal of ACM, 49(3):434--453, 2002.
[6]
A. Hartono, M. M. Baskaran, C. Bastoul, A. Cohen, S. Krishnamoorthy, B. Norris, J. Ramanujam, and P. Sadayappan. Parametric multi-level tiling of imperfectly nested loops. In Proceedings of the 23rd International Conference on Supercomputing, pages 147--157, 2009.
[7]
HiTLoG: Hierarchical tiled loop generator. http://www.cs.colostate.edu/MMAlpha/tiling/.
[8]
S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. An auto-tuning framework for parallel multicore stencil computations. In International Parallel & Distributed Processing Symposium (IPDPS), 2010.
[9]
S. Kamil, K. Datta, S. Williams, L. Oliker, J. Shalf, and K. Yelick. Implicit and explicit optimizations for stencil computations. In MSPC '06: Proceedings of the 2006 workshop on Memory system performance and correctness, pages 51--60. ACM, 2006.
[10]
D. Kim, L. Renganarayanan, D. Rostron, S. V. Rajopadhye, and M. M. Strout. Multi-level tiling: M for the price of one. In Proceedings of the ACM/IEEE Conference on Supercomputing, page 51, 2007.
[11]
S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. SIGPLAN Not., 42(6):235--244, 2007.
[12]
D. Orozco and G. Gao. Mapping the FDTD application to many-core chip architectures. Technical report, University of Delaware, Mar. 2009.
[13]
PluTo: A polyhedral automatic parallelizer and locality optimizer for multicores. http://sourceforge.net/projects/pluto-compiler/.
[14]
PrimeTile: A parametric multi-level tiler for imperfect loop nests. http://primetile.sourceforge.net.
[15]
Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation, 1999.
[16]
V. Strumpen and M. Frigo. Software engineering aspects of cache oblivious stencil computations. Technical report, IBM Research, 2006.
[17]
M. Wolf. More iteration space tiling. In Proceedings of Supercomputing '89, 1989.
[18]
D. Wonnacott. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In Proceedings of International Parallel and Distributed Processing Symposium, 2000.

Cited By

View all
  • (2024)LoRAStencil: Low-Rank Adaptation of Stencil Computation on Tensor CoresSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00059(1-17)Online publication date: 17-Nov-2024
  • (2023)Casper: Accelerating Stencil Computations Using Near-Cache ProcessingIEEE Access10.1109/ACCESS.2023.325200211(22136-22154)Online publication date: 2023
  • (2021)Reducing redundancy in data organization and arithmetic calculation for stencil computationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476154(1-15)Online publication date: 14-Nov-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing
June 2010
365 pages
ISBN:9781450300186
DOI:10.1145/1810085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 June 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cache oblivious
  2. memory bound
  3. memory wall
  4. parallelism and locality
  5. stencil
  6. temporal blocking
  7. time skewing

Qualifiers

  • Research-article

Conference

ICS'10
Sponsor:
ICS'10: International Conference on Supercomputing
June 2 - 4, 2010
Ibaraki, Tsukuba, Japan

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)LoRAStencil: Low-Rank Adaptation of Stencil Computation on Tensor CoresSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00059(1-17)Online publication date: 17-Nov-2024
  • (2023)Casper: Accelerating Stencil Computations Using Near-Cache ProcessingIEEE Access10.1109/ACCESS.2023.325200211(22136-22154)Online publication date: 2023
  • (2021)Reducing redundancy in data organization and arithmetic calculation for stencil computationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476154(1-15)Online publication date: 14-Nov-2021
  • (2021)Using the Semi-Stencil Algorithm to Accelerate High-Order Stencils on GPUs2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS54543.2021.00012(63-68)Online publication date: Nov-2021
  • (2021)Accelerating high‐order stencils on GPUsConcurrency and Computation: Practice and Experience10.1002/cpe.646734:20Online publication date: 22-Aug-2021
  • (2020)Accelerating High-Order Stencils on GPUs2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS51919.2020.00014(86-108)Online publication date: Nov-2020
  • (2020)NERO: A Near High-Bandwidth Memory Stencil Accelerator for Weather Prediction Modeling2020 30th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL50879.2020.00014(9-17)Online publication date: Aug-2020
  • (2019)Tessellating Star StencilsProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337835(1-10)Online publication date: 5-Aug-2019
  • (2018)A Code Generator for Energy-Efficient Wavefront Parallelization of Uniform Dependence ComputationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.270974829:9(1923-1936)Online publication date: 1-Sep-2018
  • (2018)Accelerating solutions of one-dimensional unsteady PDEs with GPU-based swept time–space decompositionJournal of Computational Physics10.1016/j.jcp.2017.12.028357:C(338-352)Online publication date: 15-Mar-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media