research-article

Hierarchical overlapped tiling

Authors:

Jean-Pierre Giacalone,

María Jesús Garzarán,

Robert H. Kuhn,

David PaduaAuthors Info & Claims

CGO '12: Proceedings of the Tenth International Symposium on Code Generation and Optimization

Pages 207 - 218

https://doi.org/10.1145/2259016.2259044

Published: 31 March 2012 Publication History

Abstract

This paper introduces hierarchical overlapped tiling, a transformation that applies loop tiling and fusion to conventional loops. Overlapped tiling is a useful transformation to reduce communication overhead, but it may also generate a significant amount of redundant computation. Hierarchical overlapped tiling performs overlapped tiling hierarchically to balance communication overhead and redundant computation, and thus has the potential to provide better performance.

In this paper, we describe the hierarchical overlapped tiling optimization and its implementation in an OpenCL compiler. We also evaluate the effectiveness of this optimization using 8 programs that implement different forms of stencil computation. Our results show that hierarchical overlapped tiling achieves an average 37% speedup over traditional tiling on a 32-core workstation.

References

[1]

W. Abu-Sufah, D. Kuck, and D. Lawrie. On the performance enhancement of paging systems through program analysis and transformations. Computers, IEEE Transactions on, C-30(5):341--356, may 1981.

Digital Library

[2]

N. Ahmed, N. Mateev, and K. Pingali. Tiling imperfectly-nested loop nests. In Supercomputing, ACM/IEEE 2000 Conference, page 31, nov. 2000.

Digital Library

[3]

R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann, 2001.

Digital Library

[4]

M. Alpert. Not just Fun and Games. Scientific American, 4, 1999.

[5]

W. F. Ames. Numerical Methods for Partial Differential Equations. Academic, San Diego, CA, sencond edition, 1977.

[6]

R. Andonov, S. Balev, S. Rajopadhye, and N. Yanev. Optimal semi-oblique tiling. In Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures, SPAA '01, pages 153--162, New York, NY, USA, 2001. ACM.

Digital Library

[7]

W. Blume and R. Eigenmann. Symbolic range propagation. In Proceedings. of 9th International Parallel Processing Symposium, 1995., pages 357--363, apr 1995.

Digital Library

[8]

U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation, PLDI '08, pages 101--113, New York, NY, USA, 2008. ACM.

Digital Library

[9]

S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation, PLDI '95, pages 279--290, New York, NY, USA, 1995. ACM.

Digital Library

[10]

K. Högstedt, L. Carter, and J. Ferrante. Selecting tile shape for minimal execution time. In Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures, SPAA '99, pages 201--211, New York, NY, USA, 1999. ACM.

Digital Library

[11]

W. Huang, M. R. Stan, K. Skadron, K. Sankaranarayanan, S. Ghosh, and S. Velusam. Compact thermal modeling for temperature-aware design. In Proceedings of the 41st annual Design Automation Conference, DAC '04, pages 878--883, New York, NY, USA, 2004. ACM.

Digital Library

[12]

T. Johnson, S.-I. Lee, L. Fei, A. Basumallik, G. Upadhyaya, R. Eigenmann, and S. Midkiff. Experiences in using cetus for source-to-source transformations. In Languages and Compilers for High Performance Computing, volume 3602 of Lecture Notes in Computer Science, pages 922--922. 2005.

Digital Library

[13]

W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott. The omega library interface guide. Technical report, 1995.

Digital Library

[14]

Khronos OpenCL Working Group. The OpenCL Specification, version 1.0.29, 8 December 2008.

[15]

G. Kreisel and J.-L. Krivine. Elements of mathematical logic. North-Holland Pub. Co., 1967.

[16]

S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, PLDI '07, pages 235--244, New York, NY, USA, 2007. ACM.

Digital Library

[17]

D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe. Dependence graphs and compiler optimizations. In Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL '81, pages 207--218, New York, NY, USA, 1981. ACM.

Digital Library

[18]

A. Leung, N. Vasilache, B. Meister, M. Baskaran, D. Wohlford, C. Bastoul, and R. Lethin. A mapping path for multi-gpgpu accelerated computers from a portable high level programming abstraction. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU '10, pages 51--61, New York, NY, USA, 2010. ACM.

Digital Library

[19]

J. Liu, Y. Zhang, W. Ding, and M. Kandemir. On-chip cache hierarchy-aware tile scheduling for multicore machines. In Code Generation and Optimization (CGO), 2011 9th Annual IEEE/ACM International Symposium on, pages 161--170, april 2011.

Digital Library

[20]

D. B. Loveman. Program improvement by source-to-source transformation. Journal of the ACM, 24:121--145, 1977.

Digital Library

[21]

J. Meng and K. Skadron. Performance modeling and automatic ghost zone optimization for iterative stencil loops on gpus. In Proceedings of the 23rd international conference on Supercomputing, ICS '09, pages 256--265, New York, NY, USA, 2009. ACM.

Digital Library

[22]

J. Ramanujam and P. Sadayappan. Tiling multidimensional iteration spaces for nonshared memory machines. In Proceedings of the 1991 ACM/IEEE conference on Supercomputing, Supercomputing '91, pages 111--120, New York, NY, USA, 1991. ACM.

Digital Library

[23]

M. Ripeanu, A. Iamnitchi, and I. Foster. Cactus application: Performance predictions in grid environments. In Euro-Par 2001 Parallel Processing, volume 2150 of Lecture Notes in Computer Science, pages 807--816, 2001.

Digital Library

[24]

Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. In Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation, PLDI '99, pages 215--228, New York, NY, USA, 1999. ACM.

Digital Library

[25]

Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The pochoir stencil compiler. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, SPAA '11, pages 117--128, New York, NY, USA, 2011. ACM.

Digital Library

[26]

M. Wolf and M. Lam. A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems, 2(4):452--471, oct 1991.

Digital Library

[27]

M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation, PLDI '91, pages 30--44, New York, NY, USA, 1991. ACM.

Digital Library

[28]

M. Wolfe. Iteration space tiling for memory hierarchies. In Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing, pages 357--361, Philadelphia, PA, USA, 1989. Society for Industrial and Applied Mathematics.

Digital Library

[29]

M. Wolfe. More iteration space tiling. In Proceedings of the 1989 ACM/IEEE conference on Supercomputing, Supercomputing '89, pages 655--664, New York, NY, USA, 1989. ACM.

Digital Library

[30]

D. Wonnacott. Time skewing for parallel computers. In Languages and Compilers for Parallel Computing, volume 1863, pages 477--480, 2000.

Digital Library

Cited By

Lakshminarasimhan MAntepara OZhao TSepanski BBasu PJohansen HHall MWilliams S(2024)Bricks: A high-performance portability layer for computations on block-structured gridsThe International Journal of High Performance Computing Applications10.1177/1094342024126828838:6(549-567)Online publication date: 19-Aug-2024
https://doi.org/10.1177/10943420241268288
Hsu CZheng HLiu YYeh T(2024)StreamNet++: Memory-Efficient Streaming TinyML Model Compilation on MicrocontrollersACM Transactions on Embedded Computing Systems10.1145/370610724:2(1-26)Online publication date: 29-Nov-2024
https://dl.acm.org/doi/10.1145/3706107
Antepara OWilliams SJohansen HHall M(2024)High-Performance, Scalable Geometric Multigrid via Fine-Grain Data Blocking for GPUsProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00159(1177-1191)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SCW63240.2024.00159
Show More Cited By

Index Terms

Hierarchical overlapped tiling
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Optimal Parallelogram Selection for Hierarchical Tiling

Loop tiling is an effective optimization to improve performance of multiply nested loops, which are the most time-consuming parts in many programs. Most massively parallel systems today are organized hierarchically, and different levels of the hierarchy ...
Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

Intel® Xeon Phi™ coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long SIMD vector units. Efficiently exploiting SIMD ...
Writing productive stencil codes with overlapped tiling
Compilers for Parallel Computers 2007 Workshop (CPC 2007)

Stencil computations constitute the kernel of many scientific applications. Tiling is often used to improve the performance of stencil codes for data locality and parallelism. However, tiled stencil codes typically require shadow regions, whose management ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '12: Proceedings of the Tenth International Symposium on Code Generation and Optimization

March 2012

285 pages

ISBN:9781450312066

DOI:10.1145/2259016

General Chairs:
Carol Eidt
Microsoft
,
Anne Holler
VMware
,
Program Chairs:
Uma Srinivasan
Intel
,
Saman Amarasinghe
MIT

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 March 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

CGO '12

Sponsor:

CGO '12: Annual IEEE/ACM International Symposium on Code Generation and Optimization

March 31 - April 4, 2012

California, San Jose

Acceptance Rates

CGO '12 Paper Acceptance Rate 26 of 90 submissions, 29%;

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

80
Total Citations
View Citations
499
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)3

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lakshminarasimhan MAntepara OZhao TSepanski BBasu PJohansen HHall MWilliams S(2024)Bricks: A high-performance portability layer for computations on block-structured gridsThe International Journal of High Performance Computing Applications10.1177/1094342024126828838:6(549-567)Online publication date: 19-Aug-2024
https://doi.org/10.1177/10943420241268288
Hsu CZheng HLiu YYeh T(2024)StreamNet++: Memory-Efficient Streaming TinyML Model Compilation on MicrocontrollersACM Transactions on Embedded Computing Systems10.1145/370610724:2(1-26)Online publication date: 29-Nov-2024
https://dl.acm.org/doi/10.1145/3706107
Antepara OWilliams SJohansen HHall M(2024)High-Performance, Scalable Geometric Multigrid via Fine-Grain Data Blocking for GPUsProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00159(1177-1191)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SCW63240.2024.00159
Antepara OWilliams SJohansen HZhao THirsch SGoyal PHall M(2023)Performance Portability Evaluation of Blocked Stencil Computations on GPUsProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624177(1007-1018)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624177
Liu SZhang ZWu W(2023)DHTS: A Dynamic Hybrid Tiling Strategy for Optimizing Stencil Computation on GPUsIEEE Transactions on Computers10.1109/TC.2023.327106072:10(2795-2807)Online publication date: Oct-2023
https://doi.org/10.1109/TC.2023.3271060
Kelefouras VDjemame KKeramidas GVoros N(2022)A Methodology for Efficient Tile Size Selection for Affine Loop KernelsInternational Journal of Parallel Programming10.1007/s10766-022-00734-550:3-4(405-432)Online publication date: 23-May-2022
https://doi.org/10.1007/s10766-022-00734-5
Kelefouras VDjemame KKeramidas GVoros N(2022)An Analytical Model for Loop Tiling TransformationEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-031-04580-6_7(95-107)Online publication date: 27-Apr-2022
https://doi.org/10.1007/978-3-031-04580-6_7
Abdelaal KKong MZhou HMoreira JMueller FEtsion Y(2021)Tile size selection of affine programs for GPGPUs using polyhedral cross-compilationProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460369(13-26)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3460369
Szustak LWyrzykowski RKuczynski LOlas T(2021)Architectural Adaptation and Performance-Energy Optimization for CFD Application on AMD EPYC RomeIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.307815332:12(2852-2866)Online publication date: 1-Dec-2021
https://doi.org/10.1109/TPDS.2021.3078153
Kœhler TSteuwer MLee J(2021)Towards a domain-extensible compilerProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370337(27-38)Online publication date: 27-Feb-2021
https://dl.acm.org/doi/10.1109/CGO51591.2021.9370337
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten