research-article

Cache oblivious parallelograms in iterative stencil computations

Authors:

Robert Strzodka,

Mohammed Shaheen,

Hans-Peter SeidelAuthors Info & Claims

ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing

Pages 49 - 59

https://doi.org/10.1145/1810085.1810096

Published: 02 June 2010 Publication History

Abstract

We present a new cache oblivious scheme for iterative stencil computations that performs beyond system bandwidth limitations as though gigabytes of data could reside in an enormous on-chip cache. We compare execution times for 2D and 3D spatial domains with up to 128 million double precision elements for constant and variable stencils against hand-optimized naive code and the automatic polyhedral parallelizer and locality optimizer PluTo and demonstrate the clear superiority of our results.

The performance benefits stem from a tiling structure that caters for data locality, parallelism and vectorization simultaneously. Rather than tiling the iteration space from inside, we take an exterior approach with a predefined hierarchy, simple regular parallelogram tiles and a locality preserving parallelization. These advantages come at the cost of an irregular work-load distribution but a tightly integrated load-balancer ensures a high utilization of all resources.

References

[1]

G. E. Blelloch, P. B. Gibbons, and H. V. Simhadri. Low depth cache-oblivious algorithms. Technical report, Carnegie Mellon University, 2009.

[2]

U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN Not., 43(6):101--113, 2008.

Digital Library

[3]

M. Frigo and V. Strumpen. Cache oblivious stencil computations. In ICS '05: Proceedings of the 19th annual international conference on Supercomputing, pages 361--366. ACM, 2005.

Digital Library

[4]

M. Frigo and V. Strumpen. The cache complexity of multithreaded cache oblivious algorithms. In SPAA '06: Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures, pages 271--280, New York, NY, USA, 2006. ACM.

Digital Library

[5]

M. A. Frumkin and R. F. Van der Wijngaart. Tight bounds on cache use for stencil operations on rectangular grids. Journal of ACM, 49(3):434--453, 2002.

Digital Library

[6]

A. Hartono, M. M. Baskaran, C. Bastoul, A. Cohen, S. Krishnamoorthy, B. Norris, J. Ramanujam, and P. Sadayappan. Parametric multi-level tiling of imperfectly nested loops. In Proceedings of the 23rd International Conference on Supercomputing, pages 147--157, 2009.

Digital Library

[7]

HiTLoG: Hierarchical tiled loop generator. http://www.cs.colostate.edu/MMAlpha/tiling/.

[8]

S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. An auto-tuning framework for parallel multicore stencil computations. In International Parallel & Distributed Processing Symposium (IPDPS), 2010.

[9]

S. Kamil, K. Datta, S. Williams, L. Oliker, J. Shalf, and K. Yelick. Implicit and explicit optimizations for stencil computations. In MSPC '06: Proceedings of the 2006 workshop on Memory system performance and correctness, pages 51--60. ACM, 2006.

Digital Library

[10]

D. Kim, L. Renganarayanan, D. Rostron, S. V. Rajopadhye, and M. M. Strout. Multi-level tiling: M for the price of one. In Proceedings of the ACM/IEEE Conference on Supercomputing, page 51, 2007.

Digital Library

[11]

S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. SIGPLAN Not., 42(6):235--244, 2007.

Digital Library

[12]

D. Orozco and G. Gao. Mapping the FDTD application to many-core chip architectures. Technical report, University of Delaware, Mar. 2009.

[13]

PluTo: A polyhedral automatic parallelizer and locality optimizer for multicores. http://sourceforge.net/projects/pluto-compiler/.

[14]

PrimeTile: A parametric multi-level tiler for imperfect loop nests. http://primetile.sourceforge.net.

[15]

Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation, 1999.

Digital Library

[16]

V. Strumpen and M. Frigo. Software engineering aspects of cache oblivious stencil computations. Technical report, IBM Research, 2006.

[17]

M. Wolf. More iteration space tiling. In Proceedings of Supercomputing '89, 1989.

Digital Library

[18]

D. Wonnacott. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In Proceedings of International Parallel and Distributed Processing Symposium, 2000.

Digital Library

Cited By

Zhang YLi KYuan LCheng JZhang YCao TYang M(2024)LoRAStencil: Low-Rank Adaptation of Stencil Computation on Tensor CoresSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00059(1-17)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SC41406.2024.00059
Denzler AOliveira GHajinazar NBera RSingh GGómez-Luna JMutlu O(2023)Casper: Accelerating Stencil Computations Using Near-Cache ProcessingIEEE Access10.1109/ACCESS.2023.325200211(22136-22154)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3252002
Li KYuan LZhang YYue Yde Supinski BHall MGamblin T(2021)Reducing redundancy in data organization and arithmetic calculation for stencil computationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476154(1-15)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476154
Show More Cited By

Index Terms

Cache oblivious parallelograms in iterative stencil computations

Recommendations

Time skewing made simple
PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programming

Time skewing and loop tiling has been known for a long time to be a highly beneficial acceleration technique for nested loops especially on bandwidth hungry multi-core processors, but it is little used in practice because efficient implementations ...
Cache Accurate Time Skewing in Iterative Stencil Computations
ICPP '11: Proceedings of the 2011 International Conference on Parallel Processing

We present a time skewing algorithm that breaks the memory wall for certain iterative stencil computations. A stencil computation, even with constant weights, is a completely memory-bound algorithm. For example, for a large 3D domain of $500^3$ doubles ...
Time skewing made simple
PPoPP '11

Time skewing and loop tiling has been known for a long time to be a highly beneficial acceleration technique for nested loops especially on bandwidth hungry multi-core processors, but it is little used in practice because efficient implementations ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing

June 2010

365 pages

ISBN:9781450300186

DOI:10.1145/1810085

General Chair:
Taisuke Boku
University of Tsukuba
,
Program Chairs:
Hiroshi Nakashima
Kyoto University
,
Avi Mendelson
Microsoft

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 June 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICS'10

Sponsor:

SIGARCH

ICS'10: International Conference on Supercomputing

June 2 - 4, 2010

Ibaraki, Tsukuba, Japan

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

38
Total Citations
View Citations
458
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang YLi KYuan LCheng JZhang YCao TYang M(2024)LoRAStencil: Low-Rank Adaptation of Stencil Computation on Tensor CoresSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00059(1-17)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SC41406.2024.00059
Denzler AOliveira GHajinazar NBera RSingh GGómez-Luna JMutlu O(2023)Casper: Accelerating Stencil Computations Using Near-Cache ProcessingIEEE Access10.1109/ACCESS.2023.325200211(22136-22154)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3252002
Li KYuan LZhang YYue Yde Supinski BHall MGamblin T(2021)Reducing redundancy in data organization and arithmetic calculation for stencil computationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476154(1-15)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476154
Sai RMellor-Crummey JMeng XAraya-Polo MMeng J(2021)Using the Semi-Stencil Algorithm to Accelerate High-Order Stencils on GPUs2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS54543.2021.00012(63-68)Online publication date: Nov-2021
https://doi.org/10.1109/PMBS54543.2021.00012
Sai RMellor‐Crummey JMeng XZhou KAraya‐Polo MMeng J(2021)Accelerating high‐order stencils on GPUsConcurrency and Computation: Practice and Experience10.1002/cpe.646734:20Online publication date: 22-Aug-2021
https://doi.org/10.1002/cpe.6467
Sai RMellor-Crummey JMeng XAraya-Polo MMeng J(2020)Accelerating High-Order Stencils on GPUs2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS51919.2020.00014(86-108)Online publication date: Nov-2020
https://doi.org/10.1109/PMBS51919.2020.00014
Singh GDiamantopoulos DHagleitner CGomez-Luna JStuijk SMutlu OCorporaal H(2020)NERO: A Near High-Bandwidth Memory Stencil Accelerator for Weather Prediction Modeling2020 30th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL50879.2020.00014(9-17)Online publication date: Aug-2020
https://doi.org/10.1109/FPL50879.2020.00014
Yuan LHuang SZhang YCao H(2019)Tessellating Star StencilsProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337835(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337835
Zou YRajopadhye S(2018)A Code Generator for Energy-Efficient Wavefront Parallelization of Uniform Dependence ComputationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.270974829:9(1923-1936)Online publication date: 1-Sep-2018
https://doi.org/10.1109/TPDS.2017.2709748
Magee DNiemeyer K(2018)Accelerating solutions of one-dimensional unsteady PDEs with GPU-based swept time–space decompositionJournal of Computational Physics10.1016/j.jcp.2017.12.028357:C(338-352)Online publication date: 15-Mar-2018
https://dl.acm.org/doi/10.1016/j.jcp.2017.12.028
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten