Skip to main content
Log in

Revisiting split tiling for stencil computations in polyhedral compilation

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Complex tile shapes maximize parallelism and locality of stencil computations by enabling tile-wise concurrent start, i.e., all tiles along a particular tiling direction of the iteration space can be started concurrently. We study split tiling—a tiling technique exploiting tile-wise concurrent start at the expense of additional synchronizations, in the context of polyhedral compilation. Derived from classical parallelogram tiling, our approach first splits a parallelogram tile into multiple phases that can be executed simultaneously with those of the neighboring tiles. The technique then minimizes the amount of synchronizations by merging boundary phases of consecutive tiles along the time-tiled direction. We implement our approach on top of a well-defined polyhedral representation, generating code for both CPUs and GPUs. The experimental results on a 16-core Intel Xeon Silver show that our approach can achieve an average improvement of 2

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. The code is available at https://github.com/yaozhujia/ppcg.

  2. The diamond tiling paper [3] reported the execution time compiled using ICC compiler, but we observe that the OpenMP code can achieve better performance when compiled using GCC on our platform.

  3. https://sourceforge.net/projects/polybench

References

  1. Bastoul C (2004) Code generation in the polyhedral model is easier than you think. In: Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, PACT ’04, pp. 7–16. IEEE Computer Society, Washington, DC, USA https://doi.org/10.1109/PACT.2004.11

  2. Bondhugula U, Bandishti V, Cohen A, Potron G, Vasilache N (2014) Tiling and optimizing time-iterated computations on periodic domains. In: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT ’14, pp. 39–50. ACM, New York, NY, USA . https://doi.org/10.1145/2628071.2628106

  3. Bondhugula U, Bandishti V, Pananilath I (2017) Diamond tiling: Tiling techniques to maximize parallelism for stencil computations. IEEE Trans Parall Distrib Syst 28(5):1285–1298. https://doi.org/10.1109/TPDS.2016.2615094

    Article  Google Scholar 

  4. Bondhugula U, Hartono A, Ramanujam J, Sadayappan P (2008) A practical automatic polyhedral parallelizer and locality optimizer. In: Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’08, pp. 101–113. ACM, New York, NY, USA . https://doi.org/10.1145/1375581.1375595

  5. Chen C (2012) Polyhedra scanning revisited. In: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI’12, pp. 499–508. ACM, New York, NY, USA . https://doi.org/10.1145/2254064.2254123

  6. Christen M, Schenk O, Burkhart H (2011) Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In: 2011 IEEE International Parallel Distributed Processing Symposium, pp. 676–687 . https://doi.org/10.1109/IPDPS.2011.70

  7. Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K (2008) Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: SC’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pp. 1–12. IEEE

  8. Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K (2008) Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: SC’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pp. 1–12 (2008). https://doi.org/10.1109/SC.2008.5222004

  9. Di P, Xue J, Hu C, Zhou J (2009) A cache-efficient parallel gauss-seidel solver with alternating tiling. In: 2009 15th International Conference on Parallel and Distributed Systems, pp. 244–251 . https://doi.org/10.1109/ICPADS.2009.126

  10. Feautrier P (1991) Dataflow analysis of array and scalar references. Int J Parall Prog 20(1):23–53. https://doi.org/10.1007/BF01407931

    Article  MATH  Google Scholar 

  11. Feautrier P (1992) Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time. Int J Parall Program 21(6):389–420

    Article  MathSciNet  Google Scholar 

  12. Feautrier P, Lengauer C (2011) Polyhedron model. Encyclopedia of parallel computing 1:1581–1592

    Google Scholar 

  13. Gardner M (1970) Mathematical games. Sci Am 222(6):132–140

    Article  Google Scholar 

  14. Grosser T (2014) A decoupled approach to high-level loop optimization : tile shapes, polyhedral building blocks and low-level compilers. Theses, Université Pierre et Marie Curie–Paris VI (2014). https://tel.archives-ouvertes.fr/tel-01144563

  15. Grosser T, Cohen A, Holewinski J, Sadayappan P, Verdoolaege S (2014) Hybrid hexagonal/classical tiling for gpus. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’14, pp. 66:66–66:75. ACM, New York, NY, USA . https://doi.org/10.1145/2544137.2544160

  16. Grosser T, Cohen A, Kelly PHJ, Ramanujam J, Sadayappan P, Verdoolaege S (2013) Split tiling for gpus: Automatic parallelization using trapezoidal tiles. In: Proc. of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, GPGPU-6, pp. 24–31. ACM, New York, NY, USA https://doi.org/10.1145/2458523.2458526

  17. Grosser T, Groesslinger A, Lengauer C (2012) Polly-performing polyhedral optimizations on a low-level intermediate representation. Parall Process Lett 22(04):1250010

    Article  MathSciNet  Google Scholar 

  18. Grosser T, Verdoolaege S, Cohen A (2015) Polyhedral ast generation is more than scanning polyhedra. ACM Trans. Program. Lang. Syst. 37(4), 12:1–12:50 https://doi.org/10.1145/2743016

  19. Grosser T, Verdoolaege S, Cohen A, Sadayappan P (2014) The relation between diamond tiling and hexagonal tiling. Parall Process Lett 24(03):1441002

    Article  MathSciNet  Google Scholar 

  20. Hagedorn B, Stoltzfus L, Steuwer M, Gorlatch S, Dubach C (2018) High performance stencil code generation with lift. In: Proceedings of the 2018 International Symposium on Code Generation and Optimization, CGO 2018, pp. 100–112. ACM, New York, NY, USA . https://doi.org/10.1145/3168824

  21. Holewinski J, Pouchet LN, Sadayappan P (2012) High-performance code generation for stencil computations on gpu architectures. In: Proceedings of the 26th ACM International Conference on Supercomputing, ICS’12, pp. 311–320. ACM, New York, NY, USA. https://doi.org/10.1145/2304576.2304619

  22. Hull JC (2003) Options futures and other derivatives. Pearson Education India

  23. Irigoin F, Triolet R (1988) Supernode partitioning. In: Proceedings of the 15th ACM SIGPLAN-SIGACT Symp. on Principles of Programming Languages, POPL ’88, pp. 319–329. ACM, New York, NY, USA https://doi.org/10.1145/73560.73588

  24. Kim D, Renganarayanan L, Rostron D, Rajopadhye S, Strout MM (2007) Multi-level tiling: M for the price of one. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC ’07, pp. 51:1–51:12. ACM, New York, NY, USA https://doi.org/10.1145/1362622.1362691

  25. Krishnamoorthy S, Baskaran M, Bondhugula U, Ramanujam J, Rountev A, Sadayappan P (2007) Effective automatic parallelization of stencil computations. In: Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’07, pp. 235–244. ACM, New York, NY, USA. https://doi.org/10.1145/1250734.1250761

  26. Lam MS, Wolf ME (2004) A data locality optimizing algorithm. SIGPLAN Not 39(4):442–459. https://doi.org/10.1145/989393.989437

    Article  Google Scholar 

  27. Mullapudi RT, Vasista V, Bondhugula U (2015) Polymage: Automatic optimization for image processing pipelines. In: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’15, pp. 429–443. ACM, New York, NY, USA https://doi.org/10.1145/2694344.2694364

  28. Pananilath I, Acharya A, Vasista V, Bondhugula U (2015) An optimizing code generator for a class of lattice-boltzmann computations. ACM Trans. Archit. Code Optim. 12(2), 14:1–14:23 https://doi.org/10.1145/2739047

  29. Ragan-Kelley J, Barnes C, Adams A, Paris S, Durand F, Amarasinghe S (2013) Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, pp. 519–530. ACM, New York, NY, USA . https://doi.org/10.1145/2491956.2462176

  30. Rawat PS, Hong C, Ravishankar M, Grover V, Pouchet LN, Rountev A, Sadayappan P (2016) Resource conscious reuse-driven tiling for gpus. In: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT’16, pp. 99–111. ACM, New York, NY, USA . https://doi.org/10.1145/2967938.2967967

  31. Rawat PS, Rastello F, Sukumaran-Rajam A, Pouchet LN, Rountev A, Sadayappan P (2018) Register optimizations for stencils on gpus. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’18, pp. 168–182. ACM, New York, NY, USA . https://doi.org/10.1145/3178487.3178500

  32. Rawat PS, Vaidya M, Sukumaran-Rajam A, Ravishankar M, Grover V, Rountev A, Pouchet L, Sadayappan P (2018) Domain-specific optimization and generation of high-performance gpu code for stencil computations. Proc IEEE 106(11):1902–1920. https://doi.org/10.1109/JPROC.2018.2862896

    Article  Google Scholar 

  33. Renganarayana L, Harthikote-Matha M, Dewri R, Rajopadhye S (2007) Towards optimal multi-level tiling for stencil computations. In: 2007 IEEE International Parallel and Distributed Processing Symposium, pp. 1–10 . https://doi.org/10.1109/IPDPS.2007.370291

  34. Roth G, Mellor-Crummey J, Kennedy K, Brickner RG (1997) Compiling stencils in high performance fortran. In: Proceedings of the 1997 ACM/IEEE Conference on Supercomputing, SC’97, pp. 1–20. ACM, New York, NY, USA https://doi.org/10.1145/509593.509605

  35. Shrestha S, Gao GR, Manzano J, Marquez A, Feo J (2015) Locality aware concurrent start for stencil applications. In: Proc. of the 13th Annual IEEE/ACM Intl. Symp. on Code Generation and Optimization, CGO ’15, pp. 157–166. IEEE CS, Washington, DC, USA

  36. Strzodka R, Shaheen M, Pajak D, Seidel H (2011) Cache accurate time skewing in iterative stencil computations. In: 2011 International Conference on Parallel Processing, pp. 571–581 https://doi.org/10.1109/ICPP.2011.47

  37. Tang Y, Chowdhury RA, Kuszmaul BC, Luk CK, Leiserson CE (2011) The pochoir stencil compiler. In: Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA’11, pp. 117–128. ACM, New York, NY, USA https://doi.org/10.1145/1989493.1989508

  38. Vasilache N, Zinenko O, Theodoridis T, Goyal P, Devito Z, Moses WS, Verdoolaege S, Adams A, Cohen A (2019) The next 700 accelerated layers: From mathematical expressions of network computation graphs to accelerated gpu kernels, automatically. ACM Trans. Archit. Code Optim. 16(4) . https://doi.org/10.1145/3355606

  39. Vasista V, Narasimhan K, Bhat S, Bondhugula U (2017) Optimizing geometric multigrid method computation using a dsl approach. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC’17. ACM, New York, NY, USA . https://doi.org/10.1145/3126908.3126968

  40. Verdoolaege S (2010) Isl: An integer set library for the polyhedral model. In: Proceedings of the Third International Congress Conference on Mathematical Software, ICMS’10, pp. 299–302. Springer-Verlag, Berlin, Heidelberg

  41. Verdoolaege S, Carlos Juega J, Cohen A, Ignacio Gómez J, Tenllado C, Catthoor F (2013) Polyhedral parallel code generation for cuda. ACM Trans Archit Code Optim 9(4), 54:1–54:23 https://doi.org/10.1145/2400682.2400713

  42. Verdoolaege S, Cohen A, Beletska A (2011) Transitive closures of affine integer tuple relations and their overapproximations. In: Proceedings of the 18th International Conference on Static Analysis, SAS’11, pp. 216–232. Springer, Berlin

  43. Wonnacott D (2000) Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In: Proceedings of the 14th International Symposium on Parallel and Distributed Processing, IPDPS’00, pp. 171–180. IEEE Computer Society, USA

  44. Wonnacott DG, Strout MM (2013) On the scalability of loop tiling techniques. IMPACT 2013:3

    Google Scholar 

  45. Zhao J, Cohen A (2019) Flextended tiles: a flexible extension of overlapped tiles for polyhedral compilation. ACM Trans Archit Code Optim 16(4) https://doi.org/10.1145/3369382

  46. Zhao T, Basu P, Williams S, Hall M, Johansen H (2019) Exploiting reuse and vectorization in blocked stencil computations on cpus and gpus. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC’19. ACM, New York, NY, USA https://doi.org/10.1145/3295500.3356210

  47. Zhou X, Giacalone JP, Garzarán MJ, Kuhn RH, Ni Y, Padua D (2012) Hierarchical overlapped tiling. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, CGO ’12, pp. 207–218. ACM, New York, NY, USA https://doi.org/10.1145/2259016.2259044

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant Nos. 61702546, 61802434 and U20A20226.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yingying Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Sun, H. & Pang, J. Revisiting split tiling for stencil computations in polyhedral compilation. J Supercomput 78, 440–470 (2022). https://doi.org/10.1007/s11227-021-03835-z

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-03835-z

Keywords

Navigation