research-article

The pochoir stencil compiler

Authors:

Rezaul Alam Chowdhury,

Bradley C. Kuszmaul,

Charles E. LeisersonAuthors Info & Claims

SPAA '11: Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures

Pages 117 - 128

https://doi.org/10.1145/1989493.1989508

Published: 04 June 2011 Publication History

Abstract

A stencil computation repeatedly updates each point of a d-dimensional grid as a function of itself and its near neighbors. Parallel cache-efficient stencil algorithms based on "trapezoidal decompositions" are known, but most programmers find them difficult to write. The Pochoir stencil compiler allows a programmer to write a simple specification of a stencil in a domain-specific stencil language embedded in C++ which the Pochoir compiler then translates into high-performing Cilk code that employs an efficient parallel cache-oblivious algorithm. Pochoir supports general d-dimensional stencils and handles both periodic and aperiodic boundary conditions in one unified algorithm. The Pochoir system provides a C++ template library that allows the user's stencil specification to be executed directly in C++ without the Pochoir compiler (albeit more slowly), which simplifies user debugging and greatly simplified the implementation of the Pochoir compiler itself. A host of stencil benchmarks run on a modern multicore machine demonstrates that Pochoir outperforms standard parallelloop implementations, typically running 2-10 times faster. The algorithm behind Pochoir improves on prior cache-efficient algorithms on multidimensional grids by making "hyperspace" cuts, which yield asymptotically more parallelism for the same cache efficiency.

References

[1]

T. Akutsu. Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots. Discrete Applied Mathematics, 104:45--62, 2000.

Digital Library

[2]

R. Bleck, C. Rooth, D. Hu, and L. T. Smith. Salinity-driven Thermocline Transients in a wind- and Thermohaline-forced Isopycnic Coordinate Model of the North Atlantic. Journal of Physical Oceanography, 22(12):1486--1505, 1992.

[3]

R. G. Brickner, W. George, S. L. Johnsson, and A. Ruttenberg. A stencil compiler for the Connection Machine models CM-2/200. In Proceedings of the Fourth Workshop on Compilers for Parallel Computers, 1993.

[4]

M. Bromley, S. Heller, T. McNerney, and G. L. Steele Jr. Fortran at ten Gigaflops: The Connection Machine convolution compiler. In PLDI, pages 145--156, Toronto, Ontario, Canada, June 26-28 1991.

Digital Library

[5]

C++ Standards Committee. Working draft, standard for programming language C++. available from http://www.open-std.org/jtc1/ sc22/wg21/docs/papers/2011/n3242.pdf, 2011. ISO/IEC Document Number N3242=11-0012.

[6]

R. A. Chowdhury, H.-S. Le, and V. Ramachandran. Cache-oblivious dynamic programming for bioinformatics. TCBB, 7(3):495--510, July-Sept. 2010.

Digital Library

[7]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT Press, third edition, 2009.

Digital Library

[8]

K. Datta. Auto-tuning Stencil Codes for Cache-Based Multicore Platforms. PhD thesis, EECS Department, University of California, Berkeley, Dec 2009.

Digital Library

[9]

K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In SC, pages 4:1--4:12, Austin, TX, Nov. 15-18 2008.

Digital Library

[10]

H. Dursun, K.-i. Nomura, L. Peng, R. Seymour, W. Wang, R. K. Kalia, A. Nakano, and P. Vashishta. A multilevel parallelization framework for high-order stencil computations. In Euro-Par, pages 642--653, Delft, The Netherlands, Aug. 25-28 2009.

Digital Library

[11]

H. Dursun, K.-i. Nomura, W. Wang, M. Kunaseth, L. Peng, R. Seymour, R. K. Kalia, A. Nakano, and P. Vashishta. In-core optimization of high-order stencil computations. In PDPTA, pages 533--538, Las Vegas, NV, July13-16 2009.

[12]

J. F. Epperson. An Introduction to Numerical Methods and Analysis. Wiley-Interscience, 2007.

[13]

H. Feshbach and P. Morse. Methods of Theoretical Physics. Feshbach Publishing, 1981.

[14]

M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In FOCS, pages 285--297, New York, NY, Oct. 17-19 1999.

Digital Library

[15]

M. Frigo and V. Strumpen. Cache oblivious stencil computations. In ICS, pages 361--366, Cambridge, MA, June 20-22 2005.

Digital Library

[16]

M. Frigo and V. Strumpen. The cache complexity of multithreaded cache oblivious algorithms. Theory of Computing Systems, 45(2):203--233, 2009.

Digital Library

[17]

M. Gardner. Mathematical Games. Scientific American, 223(4):120--123, 1970.

[18]

O. Gotoh. An improved algorithm for matching biological sequences. Journal of Molecular Biology, 162:705--708, 1982.

[19]

Y. He, C. E. Leiserson, and W. M. Leiserson. The Cilkview scalability analyzer. In SPAA, pages 145--156, Santorini, Greece, June 13--15 2010.

Digital Library

[20]

P. Hudak. Building domain-specific embedded languages. ACM Computing Surveys, 28(4), December 1996.

Digital Library

[21]

Intel software autotuning tool. http://software.intel.com/en- us/articles/intel-software-autotuning-tool/, 2010.

[22]

Intel Corporation. Intel Cilk Plus Language Specification, 2010. Document Number: 324396-001US. Available from http://software.intel.com/sites/products/cilk-plus/ cilk_plus_language_specification.pdf.

[23]

C. John. Options, futures, and other derivatives. Prentice Hall, 2006.

[24]

S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. An auto-tuning framework for parallel multicore stencil computations. In IPDPS, pages 1--12, 2010.

[25]

S. Kamil, K. Datta, S. Williams, L. Oliker, J. Shalf, and K. Yelick. Implicit and explicit optimizations for stencil computations. In MSPC, pages 51--60, San Jose, CA, 2006.

Digital Library

[26]

S. Kamil, P. Husbands, L. Oliker, J. Shalf, and K. Yelick. Impact of modern memory subsystems on cache optimizations for stencil computations. In MSP, pages 36--43, Chicago, IL, June 12 2005.

Digital Library

[27]

S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In PLDI, San Diego, CA, June 10-13 2007.

Digital Library

[28]

https://perf.wiki.kernel.org/index.php/Main_Page.

[29]

R. Mei, W. Shyy, D. Yu, and L. Luo. Lattice Boltzmann method for 3-D flows with curved boundary. J. of Comput. Phys, 161(2):680--699, 2000.

Digital Library

[30]

M. Mernik, J. Heering, and A. M. Sloane. When and how to develop domain-specific languages. ACM Computing Surveys, 37:316--344, December 2005.

Digital Library

[31]

P. Micikevicius. 3D finite difference computation on GPUs using CUDA. In GPPGPU, pages 79--84, Washington, DC, Mar. 8 2009.

Digital Library

[32]

A. Nakano, R. Kalia, and P. Vashishta. Multiresolution molecular dynamics algorithm for realistic materials modeling on parallel computers. Computer Physics Communications, 83(2-3):197--214, 1994.

[33]

A. Nitsure. Implementation and optimization of a cache oblivious lattice boltzmann algorithm. Master's thesis, Institut für Informatic, Friedrich-Alexander-Universität Erlangen-Nürnberg, July 2006.

[34]

OpenMP application program interface, version 2.5. OpenMP specification, May 2005.

[35]

L. Peng, R. Seymour, K.-i. Nomura, R. K. Kalia, A. Nakano, P. Vashishta, A. Loddoch, M. Netzband, W. R. Volz, and C. C. Wong. High-order stencil computations on multicore clusters. In IPDPS, pages 1--11, Rome, Italy, May 23-29 2009.

Digital Library

[36]

S. Peyton Jones. Haskell 98 Language and Libraries: The Revised Report. Cambridge University Press, 1998.

[37]

H. Prokop. Cache-oblivious algorithms. Master's thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, June 1999.

[38]

G. Roth, J. Mellor-Crummey, K. Kennedy, and R. G. Brickner. Compiling stencils in high performance Fortran. In SC, pages 1--20, San Jose, CA, Nov. 16-20 1997. ACM.

Digital Library

[39]

A. Taflove and S. Hagness. Computational Electrodynamics: The Finite-Difference Time-Domain Method. Artech House Norwood, MA, 2000.

[40]

A. van Deursen, P. Klint, and J. Visser. Domain-specific languages: An annotated bibliography. SIGPLAN Not., 35(6):26--36, June 2000.

Digital Library

[41]

S. Williams, J. Carter, L. Oliker, J. Shalf, and K. Yelick. Lattice Boltzmann simulation optimization on leading multicore platforms. In IPDPS, pages 1--14, Miami, FL, Apr. 14-18 2008.

Cited By

Han HLi KCui WBai DZhang YYuan LChen YZhang YCao TYang M(2025)FlashFFTStencil: Bridging Fast Fourier Transforms to Memory-Efficient Stencil Computations on Tensor Core UnitsProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3710848.3710897(355-368)Online publication date: 28-Feb-2025
https://dl.acm.org/doi/10.1145/3710848.3710897
Zhang YLi KYuan LHan HZhang YCao TYang M(2025)Jigsaw: Toward Conflict-free Vectorized Stencil Computation by Tessellating Swizzled RegistersProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3710848.3710886(481-495)Online publication date: 28-Feb-2025
https://dl.acm.org/doi/10.1145/3710848.3710886
Lakshminarasimhan MAntepara OZhao TSepanski BBasu PJohansen HHall MWilliams S(2024)Bricks: A high-performance portability layer for computations on block-structured gridsThe International Journal of High Performance Computing Applications10.1177/1094342024126828838:6(549-567)Online publication date: 19-Aug-2024
https://doi.org/10.1177/10943420241268288
Show More Cited By

Index Terms

The pochoir stencil compiler
1. Mathematics of computing
  1. Mathematical software

Recommendations

gpucc: an open-source GPGPU compiler
CGO '16: Proceedings of the 2016 International Symposium on Code Generation and Optimization

Graphics Processing Units have emerged as powerful accelerators for massively parallel, numerically intensive workloads. The two dominant software models for these devices are NVIDIA’s CUDA and the cross-platform OpenCL standard. Until now, there has ...
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights
- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
Abstract
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
Directive-based parallelization of the NIM weather model for GPUs
WACCPD '14: Proceedings of the First Workshop on Accelerator Programming using Directives

The NIM is a performance-portable model that runs on CPU, GPU and MIC architectures with a single source code. The single source plus efficient code design allows application scientists to maintain the Fortran code, while computer scientists optimize ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SPAA '11: Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures

June 2011

404 pages

ISBN:9781450307437

DOI:10.1145/1989493

Co-chairs:
Friedhelm Meyer auf der Heide
University of Paderborn, Germany
,
Rajmohan Rajaraman
Northeastern University, USA

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

EATCS: European Association for Theoretical Computer Science

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 June 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SPAA '11

Sponsor:

SPAA '11: 23rd ACM Symposium on Parallelism in Algorithms and Architectures

June 4 - 6, 2011

California, San Jose, USA

Acceptance Rates

Overall Acceptance Rate 447 of 1,461 submissions, 31%

Upcoming Conference

SPAA '25

Sponsor:
sigact
sigact

37th ACM Symposium on Parallelism in Algorithms and Architectures

July 28 - August 1, 2025

Portland , OR , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

257
Total Citations
View Citations
866
Total Downloads

Downloads (Last 12 months)43
Downloads (Last 6 weeks)10

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Han HLi KCui WBai DZhang YYuan LChen YZhang YCao TYang M(2025)FlashFFTStencil: Bridging Fast Fourier Transforms to Memory-Efficient Stencil Computations on Tensor Core UnitsProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3710848.3710897(355-368)Online publication date: 28-Feb-2025
https://dl.acm.org/doi/10.1145/3710848.3710897
Zhang YLi KYuan LHan HZhang YCao TYang M(2025)Jigsaw: Toward Conflict-free Vectorized Stencil Computation by Tessellating Swizzled RegistersProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3710848.3710886(481-495)Online publication date: 28-Feb-2025
https://dl.acm.org/doi/10.1145/3710848.3710886
Lakshminarasimhan MAntepara OZhao TSepanski BBasu PJohansen HHall MWilliams S(2024)Bricks: A high-performance portability layer for computations on block-structured gridsThe International Journal of High Performance Computing Applications10.1177/1094342024126828838:6(549-567)Online publication date: 19-Aug-2024
https://doi.org/10.1177/10943420241268288
Liu PRoot AXu ALi YKjolstad FBik A(2024)Compiler Support for Sparse Tensor ConvolutionsProceedings of the ACM on Programming Languages10.1145/36897218:OOPSLA2(275-303)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689721
Zhu FQi XZhang PFang JTang TChe YYu KXie JHuang CRen J(2024)Optimizing Stencil Computation on Multi-core DSPsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673062(679-690)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673062
Del Sozzo EConficconi DSano K(2024)Across Time and Space: Senju’s Approach for Scaling Iterative Stencil Loop Accelerators on Single and Multiple FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/363492017:2(1-33)Online publication date: 30-Apr-2024
https://dl.acm.org/doi/10.1145/3634920
Ahmad ZBrowne RChowdhury RDas RHuang YZhu YLee IChabbi MSteuwer M(2024)Fast American Option Pricing using Nonlinear StencilsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638506(316-332)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638506
Bisbas GLydike ABauer EBrown NFehr MMitchell LRodriguez-Canal GJamieson MKelly PSteuwer MGrosser TTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)A shared compilation stack for distributed-memory parallelism in stencil DSLsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651344(38-56)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651344
Zhang YLi KYuan LCheng JZhang YCao TYang M(2024)LoRAStencil: Low-Rank Adaptation of Stencil Computation on Tensor CoresSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00059(1-17)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SC41406.2024.00059
Sai RMellor-Crummey JXu JAraya-Polo M(2024)Automated Code Generation of High-Order Stencils for a Dataflow ArchitectureProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00025(1-13)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00025
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten