Abstract
Complex tile shapes maximize parallelism and locality of stencil computations by enabling tile-wise concurrent start, i.e., all tiles along a particular tiling direction of the iteration space can be started concurrently. We study split tiling—a tiling technique exploiting tile-wise concurrent start at the expense of additional synchronizations, in the context of polyhedral compilation. Derived from classical parallelogram tiling, our approach first splits a parallelogram tile into multiple phases that can be executed simultaneously with those of the neighboring tiles. The technique then minimizes the amount of synchronizations by merging boundary phases of consecutive tiles along the time-tiled direction. We implement our approach on top of a well-defined polyhedral representation, generating code for both CPUs and GPUs. The experimental results on a 16-core Intel Xeon Silver show that our approach can achieve an average improvement of 2
Similar content being viewed by others
Notes
The code is available at https://github.com/yaozhujia/ppcg.
The diamond tiling paper [3] reported the execution time compiled using ICC compiler, but we observe that the OpenMP code can achieve better performance when compiled using GCC on our platform.
https://sourceforge.net/projects/polybench
References
Bastoul C (2004) Code generation in the polyhedral model is easier than you think. In: Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, PACT ’04, pp. 7–16. IEEE Computer Society, Washington, DC, USA https://doi.org/10.1109/PACT.2004.11
Bondhugula U, Bandishti V, Cohen A, Potron G, Vasilache N (2014) Tiling and optimizing time-iterated computations on periodic domains. In: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT ’14, pp. 39–50. ACM, New York, NY, USA . https://doi.org/10.1145/2628071.2628106
Bondhugula U, Bandishti V, Pananilath I (2017) Diamond tiling: Tiling techniques to maximize parallelism for stencil computations. IEEE Trans Parall Distrib Syst 28(5):1285–1298. https://doi.org/10.1109/TPDS.2016.2615094
Bondhugula U, Hartono A, Ramanujam J, Sadayappan P (2008) A practical automatic polyhedral parallelizer and locality optimizer. In: Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’08, pp. 101–113. ACM, New York, NY, USA . https://doi.org/10.1145/1375581.1375595
Chen C (2012) Polyhedra scanning revisited. In: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI’12, pp. 499–508. ACM, New York, NY, USA . https://doi.org/10.1145/2254064.2254123
Christen M, Schenk O, Burkhart H (2011) Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In: 2011 IEEE International Parallel Distributed Processing Symposium, pp. 676–687 . https://doi.org/10.1109/IPDPS.2011.70
Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K (2008) Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: SC’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pp. 1–12. IEEE
Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K (2008) Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: SC’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pp. 1–12 (2008). https://doi.org/10.1109/SC.2008.5222004
Di P, Xue J, Hu C, Zhou J (2009) A cache-efficient parallel gauss-seidel solver with alternating tiling. In: 2009 15th International Conference on Parallel and Distributed Systems, pp. 244–251 . https://doi.org/10.1109/ICPADS.2009.126
Feautrier P (1991) Dataflow analysis of array and scalar references. Int J Parall Prog 20(1):23–53. https://doi.org/10.1007/BF01407931
Feautrier P (1992) Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time. Int J Parall Program 21(6):389–420
Feautrier P, Lengauer C (2011) Polyhedron model. Encyclopedia of parallel computing 1:1581–1592
Gardner M (1970) Mathematical games. Sci Am 222(6):132–140
Grosser T (2014) A decoupled approach to high-level loop optimization : tile shapes, polyhedral building blocks and low-level compilers. Theses, Université Pierre et Marie Curie–Paris VI (2014). https://tel.archives-ouvertes.fr/tel-01144563
Grosser T, Cohen A, Holewinski J, Sadayappan P, Verdoolaege S (2014) Hybrid hexagonal/classical tiling for gpus. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’14, pp. 66:66–66:75. ACM, New York, NY, USA . https://doi.org/10.1145/2544137.2544160
Grosser T, Cohen A, Kelly PHJ, Ramanujam J, Sadayappan P, Verdoolaege S (2013) Split tiling for gpus: Automatic parallelization using trapezoidal tiles. In: Proc. of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, GPGPU-6, pp. 24–31. ACM, New York, NY, USA https://doi.org/10.1145/2458523.2458526
Grosser T, Groesslinger A, Lengauer C (2012) Polly-performing polyhedral optimizations on a low-level intermediate representation. Parall Process Lett 22(04):1250010
Grosser T, Verdoolaege S, Cohen A (2015) Polyhedral ast generation is more than scanning polyhedra. ACM Trans. Program. Lang. Syst. 37(4), 12:1–12:50 https://doi.org/10.1145/2743016
Grosser T, Verdoolaege S, Cohen A, Sadayappan P (2014) The relation between diamond tiling and hexagonal tiling. Parall Process Lett 24(03):1441002
Hagedorn B, Stoltzfus L, Steuwer M, Gorlatch S, Dubach C (2018) High performance stencil code generation with lift. In: Proceedings of the 2018 International Symposium on Code Generation and Optimization, CGO 2018, pp. 100–112. ACM, New York, NY, USA . https://doi.org/10.1145/3168824
Holewinski J, Pouchet LN, Sadayappan P (2012) High-performance code generation for stencil computations on gpu architectures. In: Proceedings of the 26th ACM International Conference on Supercomputing, ICS’12, pp. 311–320. ACM, New York, NY, USA. https://doi.org/10.1145/2304576.2304619
Hull JC (2003) Options futures and other derivatives. Pearson Education India
Irigoin F, Triolet R (1988) Supernode partitioning. In: Proceedings of the 15th ACM SIGPLAN-SIGACT Symp. on Principles of Programming Languages, POPL ’88, pp. 319–329. ACM, New York, NY, USA https://doi.org/10.1145/73560.73588
Kim D, Renganarayanan L, Rostron D, Rajopadhye S, Strout MM (2007) Multi-level tiling: M for the price of one. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC ’07, pp. 51:1–51:12. ACM, New York, NY, USA https://doi.org/10.1145/1362622.1362691
Krishnamoorthy S, Baskaran M, Bondhugula U, Ramanujam J, Rountev A, Sadayappan P (2007) Effective automatic parallelization of stencil computations. In: Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’07, pp. 235–244. ACM, New York, NY, USA. https://doi.org/10.1145/1250734.1250761
Lam MS, Wolf ME (2004) A data locality optimizing algorithm. SIGPLAN Not 39(4):442–459. https://doi.org/10.1145/989393.989437
Mullapudi RT, Vasista V, Bondhugula U (2015) Polymage: Automatic optimization for image processing pipelines. In: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’15, pp. 429–443. ACM, New York, NY, USA https://doi.org/10.1145/2694344.2694364
Pananilath I, Acharya A, Vasista V, Bondhugula U (2015) An optimizing code generator for a class of lattice-boltzmann computations. ACM Trans. Archit. Code Optim. 12(2), 14:1–14:23 https://doi.org/10.1145/2739047
Ragan-Kelley J, Barnes C, Adams A, Paris S, Durand F, Amarasinghe S (2013) Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, pp. 519–530. ACM, New York, NY, USA . https://doi.org/10.1145/2491956.2462176
Rawat PS, Hong C, Ravishankar M, Grover V, Pouchet LN, Rountev A, Sadayappan P (2016) Resource conscious reuse-driven tiling for gpus. In: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT’16, pp. 99–111. ACM, New York, NY, USA . https://doi.org/10.1145/2967938.2967967
Rawat PS, Rastello F, Sukumaran-Rajam A, Pouchet LN, Rountev A, Sadayappan P (2018) Register optimizations for stencils on gpus. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’18, pp. 168–182. ACM, New York, NY, USA . https://doi.org/10.1145/3178487.3178500
Rawat PS, Vaidya M, Sukumaran-Rajam A, Ravishankar M, Grover V, Rountev A, Pouchet L, Sadayappan P (2018) Domain-specific optimization and generation of high-performance gpu code for stencil computations. Proc IEEE 106(11):1902–1920. https://doi.org/10.1109/JPROC.2018.2862896
Renganarayana L, Harthikote-Matha M, Dewri R, Rajopadhye S (2007) Towards optimal multi-level tiling for stencil computations. In: 2007 IEEE International Parallel and Distributed Processing Symposium, pp. 1–10 . https://doi.org/10.1109/IPDPS.2007.370291
Roth G, Mellor-Crummey J, Kennedy K, Brickner RG (1997) Compiling stencils in high performance fortran. In: Proceedings of the 1997 ACM/IEEE Conference on Supercomputing, SC’97, pp. 1–20. ACM, New York, NY, USA https://doi.org/10.1145/509593.509605
Shrestha S, Gao GR, Manzano J, Marquez A, Feo J (2015) Locality aware concurrent start for stencil applications. In: Proc. of the 13th Annual IEEE/ACM Intl. Symp. on Code Generation and Optimization, CGO ’15, pp. 157–166. IEEE CS, Washington, DC, USA
Strzodka R, Shaheen M, Pajak D, Seidel H (2011) Cache accurate time skewing in iterative stencil computations. In: 2011 International Conference on Parallel Processing, pp. 571–581 https://doi.org/10.1109/ICPP.2011.47
Tang Y, Chowdhury RA, Kuszmaul BC, Luk CK, Leiserson CE (2011) The pochoir stencil compiler. In: Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA’11, pp. 117–128. ACM, New York, NY, USA https://doi.org/10.1145/1989493.1989508
Vasilache N, Zinenko O, Theodoridis T, Goyal P, Devito Z, Moses WS, Verdoolaege S, Adams A, Cohen A (2019) The next 700 accelerated layers: From mathematical expressions of network computation graphs to accelerated gpu kernels, automatically. ACM Trans. Archit. Code Optim. 16(4) . https://doi.org/10.1145/3355606
Vasista V, Narasimhan K, Bhat S, Bondhugula U (2017) Optimizing geometric multigrid method computation using a dsl approach. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC’17. ACM, New York, NY, USA . https://doi.org/10.1145/3126908.3126968
Verdoolaege S (2010) Isl: An integer set library for the polyhedral model. In: Proceedings of the Third International Congress Conference on Mathematical Software, ICMS’10, pp. 299–302. Springer-Verlag, Berlin, Heidelberg
Verdoolaege S, Carlos Juega J, Cohen A, Ignacio Gómez J, Tenllado C, Catthoor F (2013) Polyhedral parallel code generation for cuda. ACM Trans Archit Code Optim 9(4), 54:1–54:23 https://doi.org/10.1145/2400682.2400713
Verdoolaege S, Cohen A, Beletska A (2011) Transitive closures of affine integer tuple relations and their overapproximations. In: Proceedings of the 18th International Conference on Static Analysis, SAS’11, pp. 216–232. Springer, Berlin
Wonnacott D (2000) Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In: Proceedings of the 14th International Symposium on Parallel and Distributed Processing, IPDPS’00, pp. 171–180. IEEE Computer Society, USA
Wonnacott DG, Strout MM (2013) On the scalability of loop tiling techniques. IMPACT 2013:3
Zhao J, Cohen A (2019) Flextended tiles: a flexible extension of overlapped tiles for polyhedral compilation. ACM Trans Archit Code Optim 16(4) https://doi.org/10.1145/3369382
Zhao T, Basu P, Williams S, Hall M, Johansen H (2019) Exploiting reuse and vectorization in blocked stencil computations on cpus and gpus. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC’19. ACM, New York, NY, USA https://doi.org/10.1145/3295500.3356210
Zhou X, Giacalone JP, Garzarán MJ, Kuhn RH, Ni Y, Padua D (2012) Hierarchical overlapped tiling. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, CGO ’12, pp. 207–218. ACM, New York, NY, USA https://doi.org/10.1145/2259016.2259044
Acknowledgements
This work was supported by the National Natural Science Foundation of China under Grant Nos. 61702546, 61802434 and U20A20226.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, Y., Sun, H. & Pang, J. Revisiting split tiling for stencil computations in polyhedral compilation. J Supercomput 78, 440–470 (2022). https://doi.org/10.1007/s11227-021-03835-z
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-03835-z