skip to main content
10.1145/2259016.2259044acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
research-article

Hierarchical overlapped tiling

Published: 31 March 2012 Publication History

Abstract

This paper introduces hierarchical overlapped tiling, a transformation that applies loop tiling and fusion to conventional loops. Overlapped tiling is a useful transformation to reduce communication overhead, but it may also generate a significant amount of redundant computation. Hierarchical overlapped tiling performs overlapped tiling hierarchically to balance communication overhead and redundant computation, and thus has the potential to provide better performance.
In this paper, we describe the hierarchical overlapped tiling optimization and its implementation in an OpenCL compiler. We also evaluate the effectiveness of this optimization using 8 programs that implement different forms of stencil computation. Our results show that hierarchical overlapped tiling achieves an average 37% speedup over traditional tiling on a 32-core workstation.

References

[1]
W. Abu-Sufah, D. Kuck, and D. Lawrie. On the performance enhancement of paging systems through program analysis and transformations. Computers, IEEE Transactions on, C-30(5):341--356, may 1981.
[2]
N. Ahmed, N. Mateev, and K. Pingali. Tiling imperfectly-nested loop nests. In Supercomputing, ACM/IEEE 2000 Conference, page 31, nov. 2000.
[3]
R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann, 2001.
[4]
M. Alpert. Not just Fun and Games. Scientific American, 4, 1999.
[5]
W. F. Ames. Numerical Methods for Partial Differential Equations. Academic, San Diego, CA, sencond edition, 1977.
[6]
R. Andonov, S. Balev, S. Rajopadhye, and N. Yanev. Optimal semi-oblique tiling. In Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures, SPAA '01, pages 153--162, New York, NY, USA, 2001. ACM.
[7]
W. Blume and R. Eigenmann. Symbolic range propagation. In Proceedings. of 9th International Parallel Processing Symposium, 1995., pages 357--363, apr 1995.
[8]
U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation, PLDI '08, pages 101--113, New York, NY, USA, 2008. ACM.
[9]
S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation, PLDI '95, pages 279--290, New York, NY, USA, 1995. ACM.
[10]
K. Högstedt, L. Carter, and J. Ferrante. Selecting tile shape for minimal execution time. In Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures, SPAA '99, pages 201--211, New York, NY, USA, 1999. ACM.
[11]
W. Huang, M. R. Stan, K. Skadron, K. Sankaranarayanan, S. Ghosh, and S. Velusam. Compact thermal modeling for temperature-aware design. In Proceedings of the 41st annual Design Automation Conference, DAC '04, pages 878--883, New York, NY, USA, 2004. ACM.
[12]
T. Johnson, S.-I. Lee, L. Fei, A. Basumallik, G. Upadhyaya, R. Eigenmann, and S. Midkiff. Experiences in using cetus for source-to-source transformations. In Languages and Compilers for High Performance Computing, volume 3602 of Lecture Notes in Computer Science, pages 922--922. 2005.
[13]
W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott. The omega library interface guide. Technical report, 1995.
[14]
Khronos OpenCL Working Group. The OpenCL Specification, version 1.0.29, 8 December 2008.
[15]
G. Kreisel and J.-L. Krivine. Elements of mathematical logic. North-Holland Pub. Co., 1967.
[16]
S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, PLDI '07, pages 235--244, New York, NY, USA, 2007. ACM.
[17]
D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe. Dependence graphs and compiler optimizations. In Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL '81, pages 207--218, New York, NY, USA, 1981. ACM.
[18]
A. Leung, N. Vasilache, B. Meister, M. Baskaran, D. Wohlford, C. Bastoul, and R. Lethin. A mapping path for multi-gpgpu accelerated computers from a portable high level programming abstraction. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU '10, pages 51--61, New York, NY, USA, 2010. ACM.
[19]
J. Liu, Y. Zhang, W. Ding, and M. Kandemir. On-chip cache hierarchy-aware tile scheduling for multicore machines. In Code Generation and Optimization (CGO), 2011 9th Annual IEEE/ACM International Symposium on, pages 161--170, april 2011.
[20]
D. B. Loveman. Program improvement by source-to-source transformation. Journal of the ACM, 24:121--145, 1977.
[21]
J. Meng and K. Skadron. Performance modeling and automatic ghost zone optimization for iterative stencil loops on gpus. In Proceedings of the 23rd international conference on Supercomputing, ICS '09, pages 256--265, New York, NY, USA, 2009. ACM.
[22]
J. Ramanujam and P. Sadayappan. Tiling multidimensional iteration spaces for nonshared memory machines. In Proceedings of the 1991 ACM/IEEE conference on Supercomputing, Supercomputing '91, pages 111--120, New York, NY, USA, 1991. ACM.
[23]
M. Ripeanu, A. Iamnitchi, and I. Foster. Cactus application: Performance predictions in grid environments. In Euro-Par 2001 Parallel Processing, volume 2150 of Lecture Notes in Computer Science, pages 807--816, 2001.
[24]
Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. In Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation, PLDI '99, pages 215--228, New York, NY, USA, 1999. ACM.
[25]
Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The pochoir stencil compiler. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, SPAA '11, pages 117--128, New York, NY, USA, 2011. ACM.
[26]
M. Wolf and M. Lam. A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems, 2(4):452--471, oct 1991.
[27]
M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation, PLDI '91, pages 30--44, New York, NY, USA, 1991. ACM.
[28]
M. Wolfe. Iteration space tiling for memory hierarchies. In Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing, pages 357--361, Philadelphia, PA, USA, 1989. Society for Industrial and Applied Mathematics.
[29]
M. Wolfe. More iteration space tiling. In Proceedings of the 1989 ACM/IEEE conference on Supercomputing, Supercomputing '89, pages 655--664, New York, NY, USA, 1989. ACM.
[30]
D. Wonnacott. Time skewing for parallel computers. In Languages and Compilers for Parallel Computing, volume 1863, pages 477--480, 2000.

Cited By

View all
  • (2024)Bricks: A high-performance portability layer for computations on block-structured gridsThe International Journal of High Performance Computing Applications10.1177/1094342024126828838:6(549-567)Online publication date: 19-Aug-2024
  • (2024)StreamNet++: Memory-Efficient Streaming TinyML Model Compilation on MicrocontrollersACM Transactions on Embedded Computing Systems10.1145/370610724:2(1-26)Online publication date: 29-Nov-2024
  • (2024)High-Performance, Scalable Geometric Multigrid via Fine-Grain Data Blocking for GPUsProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00159(1177-1191)Online publication date: 17-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CGO '12: Proceedings of the Tenth International Symposium on Code Generation and Optimization
March 2012
285 pages
ISBN:9781450312066
DOI:10.1145/2259016
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 March 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. compiler optimization
  2. loop tiling and fusion
  3. stencil computation

Qualifiers

  • Research-article

Funding Sources

Conference

CGO '12

Acceptance Rates

CGO '12 Paper Acceptance Rate 26 of 90 submissions, 29%;
Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)3
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Bricks: A high-performance portability layer for computations on block-structured gridsThe International Journal of High Performance Computing Applications10.1177/1094342024126828838:6(549-567)Online publication date: 19-Aug-2024
  • (2024)StreamNet++: Memory-Efficient Streaming TinyML Model Compilation on MicrocontrollersACM Transactions on Embedded Computing Systems10.1145/370610724:2(1-26)Online publication date: 29-Nov-2024
  • (2024)High-Performance, Scalable Geometric Multigrid via Fine-Grain Data Blocking for GPUsProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00159(1177-1191)Online publication date: 17-Nov-2024
  • (2023)Performance Portability Evaluation of Blocked Stencil Computations on GPUsProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624177(1007-1018)Online publication date: 12-Nov-2023
  • (2023)DHTS: A Dynamic Hybrid Tiling Strategy for Optimizing Stencil Computation on GPUsIEEE Transactions on Computers10.1109/TC.2023.327106072:10(2795-2807)Online publication date: Oct-2023
  • (2022)A Methodology for Efficient Tile Size Selection for Affine Loop KernelsInternational Journal of Parallel Programming10.1007/s10766-022-00734-550:3-4(405-432)Online publication date: 23-May-2022
  • (2022)An Analytical Model for Loop Tiling TransformationEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-031-04580-6_7(95-107)Online publication date: 27-Apr-2022
  • (2021)Tile size selection of affine programs for GPGPUs using polyhedral cross-compilationProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460369(13-26)Online publication date: 3-Jun-2021
  • (2021)Architectural Adaptation and Performance-Energy Optimization for CFD Application on AMD EPYC RomeIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.307815332:12(2852-2866)Online publication date: 1-Dec-2021
  • (2021)Towards a domain-extensible compilerProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370337(27-38)Online publication date: 27-Feb-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media