Skip to main content

Loop Transformation Recipes for Code Generation and Auto-Tuning

  • Conference paper
Languages and Compilers for Parallel Computing (LCPC 2009)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5898))

Abstract

In this paper, we describe transformation recipes, which provide a high-level interface to the code transformation and code generation capability of a compiler. These recipes can be generated by compiler decision algorithms or savvy software developers. This interface is part of an auto-tuning framework that explores a set of different implementations of the same computation and automatically selects the best-performing implementation. Along with the original computation, a transformation recipe specifies a range of implementations of the computation resulting from composing a set of high-level code transformations. In our system, an underlying polyhedral framework coupled with transformation algorithms takes this set of transformations, composes them and automatically generates correct code. We first describe an abstract interface for transformation recipes, which we propose to facilitate interoperability with other transformation frameworks. We then focus on the specific transformation recipe interface used in our compiler and present performance results on its application to kernel and library tuning and tuning of key computations in high-end applications. We also show how this framework can be used to generate and auto-tune parallel OpenMP or CUDA code from a high-level specification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. http://www.peri-scidac.org/wiki/index.php/Main_Page

  2. http://rosecompiler.org/

  3. http://www.gnu.org/prep/standards/html_node/Errors.html

  4. http://nek5000.mcs.anl.gov/index.php/Main_Page

  5. Ahmed, N., Mateev, N., Pingali, K.: Synthesizing transformations for locality enhancement of imperfectly-nested loop nests. In: Proceedings of the 2000 ACM International Conference on Supercomputing (May 2000)

    Google Scholar 

  6. Almagor, L., Cooper, K.D., Grosul, A., Harvey, T.J., Reeves, S.W., Subramanian, D., Torczon, L., Waterman, T.: Finding effective compilation sequences. In: Proceedings of ACM SIGPLAN Workshop on Languages, Compilers, and Tools for Embedded Systems, LCTES 2004 (June 2004)

    Google Scholar 

  7. Anderson, E., Sorensen, D., Bai, Z., Dongarra, J., Greenbaum, A., McKenney, A., Croz, J.D., Hammarling, S., Demmel, J., Bischof, C.H.: LAPACK: A portable linear algebra library for high-performance computers. In: Proceedings of Supercomputing 1990 (November 1990)

    Google Scholar 

  8. Carr, S., Kennedy, K.: Improving the ratio of memory operations to floating-point operations in loops. ACM Transactions on Programming Languages and Systems 16(6), 1768–1810 (1994)

    Article  Google Scholar 

  9. Chen, C.: Model-Guided Empirical Optimization for Memory Hierarchy. PhD thesis, University of Southern California (May 2007)

    Google Scholar 

  10. Chen, C., Chame, J., Hall, M.: CHiLL: A framework for composing high-level loop transformations. Technical Report 08-897, University of Southern California (June 2008)

    Google Scholar 

  11. Chen, C., Chame, J., Hall, M.W.: Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In: Proceedings of the International Symposium on Code Generation and Optimization (March 2005)

    Google Scholar 

  12. Cooper, K.D., Subramanian, D., Torczon, L.: Adaptive optimizing compilers for the 21st century. The Journal of Supercomputing 23(1), 7–22 (2002)

    Article  MATH  Google Scholar 

  13. Donadio, S., Brodman, J., Roeder, T., Yotov, K., Barthou, D., Cohen, A., Garzarán, M.J., Padua, D., Pingali, K.: A language for the compact representation of multiple program versions. In: Ayguadé, E., Baumgartner, G., Ramanujam, J., Sadayappan, P. (eds.) LCPC 2005. LNCS, vol. 4339, pp. 136–151. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  14. Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proceedings of the IEEE: Special Issue on Program Generation, Optimization, and Platform Adaptation 93(2), 216–231 (2005)

    Google Scholar 

  15. Girbal, S., Vasilache, N., Bastoul, C., Cohen, A., Parello, D., Sigler, M., Temam, O.: Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. International Journal of Parallel Programming 34(3), 261–317 (2006)

    Article  MATH  Google Scholar 

  16. Hartono, A., Norris, B., Sadayappan, P.: Annotation-based empirical performance tuning using Orio. In: Proceedings of the 23rd International Parallel and Distributed Processing Symposium (May 2009)

    Google Scholar 

  17. Herrero, J.R., Navarro, J.J.: Improving performance of hypermatrix cholesky factorization. In: Kosch, H., Böszörményi, L., Hellwagner, H. (eds.) Euro-Par 2003. LNCS, vol. 2790, pp. 461–469. Springer, Heidelberg (2003)

    Google Scholar 

  18. Jiménez, M., Llabería, J.M., Fernández, A.: Register tiling in nonrectangular iteration spaces. ACM Transactions on Programming Languages and Systems 24(4), 409–453 (2002)

    Article  Google Scholar 

  19. Kaushik, D.K., Gropp, W., Minkoff, M., Smith, B.: Improving the performance of tensor matrix vector multiplication in cumulative reaction probability based quantum chemistry codes. In: Sadayappan, P., Parashar, M., Badrinath, R., Prasanna, V.K. (eds.) HiPC 2008. LNCS, vol. 5374, pp. 120–130. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  20. Kelly, W., Pugh, W.: A framework for unifying reordering transformations. Technical Report CS-TR-3193, Department of Computer Science, University of Maryland (1993)

    Google Scholar 

  21. Kisuki, T., Knijnenburg, P.M.W., O’Boyle, M.F.P.: Combined selection of tile sizes and unroll factors using iterative compilation. In: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (October 2000)

    Google Scholar 

  22. Knijnenburg, P.M.W., Kisuki, T., Gallivan, K., O’Boyle, M.F.P.: The effect of cache models on iterative compilation for combined tiling and unrolling. Concurrency and Computation: Practice and Experience 16(2-3), 247–270 (2004)

    Article  Google Scholar 

  23. Kodukula, I., Ahmed, N., Pingali, K.: Data-centric multi-level blocking. In: Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation (June 1997)

    Google Scholar 

  24. Lee, Y., Diniz, P., Hall, M., Lucas, R.: Empirical optimization for a sparse linear solver: A case study. International Journal of Parallel Programming 33 (2005)

    Google Scholar 

  25. Lim, A.W., Lam, M.S.: Maximizing parallelism and minimizing synchronization with affine partitioning. In: Proceedings of ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL 1997) (January 1997)

    Google Scholar 

  26. Lim, A.W., Liao, S.-W., Lam, M.S.: Blocking and array contraction across arbitrarily nested loops using affine partitioning. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (June 2001)

    Google Scholar 

  27. Lu, Q., Krishnamoorthy, S., Sadaypppan, P.: Combining analytical and empirical approaches in tuning matrix transposition. In: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (September 2006)

    Google Scholar 

  28. McKinley, K.S., Carr, S., Tseng, C.-W.: Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems 18(4), 424–453 (1996)

    Article  Google Scholar 

  29. Norris, B., Hartono, A., Jessup, E., Siek, J.: Generating empirically optimized composed matrix kernels from matlab prototypes. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) Computational Science – ICCS 2009. LNCS, vol. 5544, pp. 248–258. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  30. Pop, S., Cohen, A., Bastoul, C., Girbal, S., Silber, G.-A., Vasilache, N.: GRAPHITE: Polyhedral analyses and optimizations for GCC. In: Proceedings of the 4th GCC Developers’ Summit (June 2006)

    Google Scholar 

  31. Pouchet, L.-N., Bastoul, C., Cohen, A., Cavazos, J.: Iterative optimization in the polyhedral model: Part I, one-dimensional time. In: Proceedings of the International Symposium on Code Generation and Optimization (March 2007)

    Google Scholar 

  32. Pouchet, L.-N., Bastoul, C., Cohen, A., Vasilache, N.: Iterative optimization in the polyhedral model: Part II, multi-dimensional time. In: Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation (June 2008)

    Google Scholar 

  33. Pugh, B., Rosser, E.: Iteration space slicing for locality. In: Carter, L., Ferrante, J. (eds.) LCPC 1999. LNCS, vol. 1863, p. 164. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  34. Qasem, A., Jin, G., Mellor-Crummey, J.: Improving performance with integrated program transformations. Technical Report TR03-419, Rice University (October 2003)

    Google Scholar 

  35. Qasem, A., Kennedy, K.: Profitable loop fusion and tiling using model-driven empirical search. In: Proceedings of the 2006 ACM International Conference on Supercomputing (June 2006)

    Google Scholar 

  36. Ren, M., Park, J.Y., Houston, M., Aiken, A., Dally, W.J.: A tuning framework for software-managed memory hierarchies. In: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (October 2008)

    Google Scholar 

  37. Rivera, G., Tseng, C.-W.: Data transformations for eliminating conflict misses. In: Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation (June 1998)

    Google Scholar 

  38. Sarkar, V., Thekkath, R.: A general framework for iteration-reordering loop transformations. In: Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation (June 1992)

    Google Scholar 

  39. Shin, J., Hall, M.W., Chame, J., Chen, C., Hovland, P.D.: Autotuning and specialization: Speeding up matrix multiply for small matrices with compiler technology. In: The Fourth International Workshop on Automatic Performance Tuning (October 2009)

    Google Scholar 

  40. Temam, O., Granston, E.D., Jalby, W.: To copy or not to copy: A compile-time technique for assessing when data copying should be used to eliminate cache conflicts. In: Proceedings of Supercomputing 1993 (November 1993)

    Google Scholar 

  41. Tiwari, A., Chen, C., Chame, J., Hall, M., Hollingsworth, J.K.: A scalable auto-tuning framework for compiler optimization. In: Proceedings of the 24th International Parallel and Distributed Processing Symposium (April 2009)

    Google Scholar 

  42. Tufo, H.M., Fischer, P.F.: Terascale spectral element algorithms and implementations. In: ACM/IEEE conference on Supercomputing, Portland, OR (1999)

    Google Scholar 

  43. Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimization of software and the ATLAS project. Parallel Computing 27(1-2), 3–35 (2001)

    Article  MATH  Google Scholar 

  44. Clint Whaley, R., Whaley, D.B.: Tuning high performance kernels through empirical compilation. In: Proceedings of the 34th International Conference on Parallel Processing (June 2005)

    Google Scholar 

  45. Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. In: Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation (June 1991)

    Google Scholar 

  46. Wolf, M.E., Lam, M.S.: A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems 2(4), 452–471 (1991)

    Article  Google Scholar 

  47. Wolfe, M.: Data dependence and program restructuring. The Journal of Supercomputing 4(4), 321–344 (1991)

    Article  Google Scholar 

  48. Wolfe, M.: Compilers and more: Optimizing gpu kernels (October 2008), http://www.hpcwire.com/features/Compilers_and_More_Optimizing_GPU_Kernels.html

  49. Yi, Q., Seymour, K., You, H., Vuduc, R., Quinlan, D.: POET: parameterized optimizations for empirical tuning. In: Proceedings of the 21st International Parallel and Distributed Processing Symposium (March 2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hall, M., Chame, J., Chen, C., Shin, J., Rudy, G., Khan, M.M. (2010). Loop Transformation Recipes for Code Generation and Auto-Tuning. In: Gao, G.R., Pollock, L.L., Cavazos, J., Li, X. (eds) Languages and Compilers for Parallel Computing. LCPC 2009. Lecture Notes in Computer Science, vol 5898. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13374-9_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-13374-9_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-13373-2

  • Online ISBN: 978-3-642-13374-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics