ABSTRACT
Tiling is a widely used loop transformation for exposing/exploiting parallelism and data locality. Effective use of tiling requires selection and tuning of the tile sizes. This is usually achieved by hand-crafting tile size selection (TSS) models that characterize the performance of the tiled program as a function of tile sizes. The best tile sizes are selected by either directly using the TSS model or by using the TSS model together with an empirical search. Hand-crafting accurate TSS models is hard, and adapting them to different architecture/compiler, or even keeping them up-to-date with respect to the evolution of a single compiler is often just as hard. Instead of hand-crafting TSS models, can we automatically learn or create them? In this paper, we show that for a specific class of programs fairly accurate TSS models can be automatically created by using a combination of simple program features, synthetic kernels, and standard machine learning techniques. The automatic TSS model generation scheme can also be directly used for adapting the model and/or keeping it up-to-date. We evaluate our scheme on six different architecture-compiler combinations (chosen from three different architectures and four different compilers). The models learned by our method have consistently shown near-optimal performance (within 5% of the optimal on average) across all architecture-compiler combinations.
- Intel 64 and IA-32 Architectures Optimization Reference Manual.Google Scholar
- C.M. Bishop et al. Pattern recognition and machine learning.Springer New York:, 2006. Google ScholarDigital Library
- Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan.A practical automatic polyhedral program optimization system.In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2008.Google ScholarDigital Library
- Brad Calder, Dirk Grunwald, Michael Jones, Donald Lindsay, James Martin, Michael Mozer, and Benjamin Zoren. Evidence-based static branch prediction using machine learning. ACM Transactions on Programming Languages and Systems, 19(1):188--222, January 1997. Google ScholarDigital Library
- J. Cavazos and J.E.B. Moss. Inducing heuristics to decide whether to schedule. In Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation, pages 183--194,2004. Google ScholarDigital Library
- J. Cavazos and M.F.P. O'Boyle. Method-specific dynamic compilation using logistic regression. In Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming languages,systems, and applications, pages 229--240, 2006. Google ScholarDigital Library
- Jacqueline Chame and Sungdo Moon. A tile selection algorithm for data locality and cache interference. In 1999 ACM International Conference on Supercomputing, pages 492--499. ACM Press, 1999. Google ScholarDigital Library
- Chun Chen, Jacqueline Chame, and Mary Hall. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In CGO '05: Proceedings of the international symposium on Code generation and optimization, pages 111--122,Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
- S. Coleman and K.S. McKinley. Tile size selection using cache organization and data layout. In Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation,pages 279--290. ACM New York, NY, USA, 1995. Google ScholarDigital Library
- J. Demmel, J. Dongarra, V. Eijkhout, E. Fuentes, A. Petitet, R. Vuduc, R.C.Whaley, and K. Yelick. Self-Adapting Linear Algebra Algorithms and Software. In Proceedings of the IEEE, 93(2):293,2005.Google Scholar
- Arkady Epshteyn, María Jesús Garzarán, Gerald DeJong, David A.Padua, Gang Ren, Xiaoming Li, Kamen Yotov, and Keshav Pingali. Analytic models and empirical search: A hybrid approach to code optimization. In Proceedings of the International Workshop on Languages and Compilers for Parallel Computing, pages 259--273,2005. Google ScholarDigital Library
- K. Esseghir. Improving data locality for caches. Master's thesis, Rice University, 1993.Google Scholar
- Basilio B. Fraguela, M. G. Carmueja, and Diego Andrade. Optimal tile size selection guided by analytical models. In PARCO, pages 565--572, 2005.Google Scholar
- A. Hartono, M.M. Baskaran, C. Bastoul, A. Cohen, S. Krishnamoorthy,B. Norris, J. Ramanujam, and P. Sadayappan. Parametric multilevel tiling of imperfectly nested loops. In Proceedings of the 23rdinternational conference on Conference on Supercomputing, pages 147--157. ACM New York, NY, USA, 2009. Google ScholarDigital Library
- Chung-Hsing Hsu and Ulrich Kremer. A quantitative analysis of tile size selection algorithms. J. Supercomput., 27(3):279--294, 2004. Google ScholarDigital Library
- F. Irigoin and R. Triolet. Super node partitioning. In 15th ACM Symposium on Principles of Programming Languages, pages 319--328. ACM, Jan 1988. Google ScholarDigital Library
- Shoaib Kamil, Parry Husbands, Leonid Oliker, John Shalf, and Katherine Yelick. Impact of modern memory subsystems on cache optimizations for stencil computations. In Proceedings of the Workshop on Memory System Performance, pages 36--43, New York,NY, USA, 2005. ACM Press. Google ScholarDigital Library
- DaeGon Kim and Sanjay Rajopadhye. Efficient tiled loop generation:D-tiling. In The 22nd International Workshop on Languages and Compilers for Parallel Computing, 2009. Google ScholarDigital Library
- T. Kisuki, P.M.W. Knijnenburg, and MFP O' Boyle. Combined selection of tile sizes and unroll factors using iterative compilation.In Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques, page 237. Citeseer, 2000. Google ScholarDigital Library
- P. M. W. Knijnenburg, T. Kisuki, K. Gallivan, and M. F. P. O'Boyle.The effect of cache models on iterative compilation for combined tiling and unrolling. Concurr. Comput.: Pract. Exper., 16(2-3):247--270, 2004. Google ScholarDigital Library
- M.D. Lam, E.E. Rothberg, and M.E. Wolf. The cache performance and optimizations of blocked algorithms. Proceedings of the 4thinternational conference on architectural support for programming languages and operating systems, 25:63--74, 1991. Google ScholarDigital Library
- Monica S. Lam and Michael E. Wolf. A data locality optimizing algorithm (with retrospective). In Best of PLDI, pages 442--459,1991.Google Scholar
- Xiaoming Li and María Jesús Garzaran. Optimizing matrix multiplication with a classifier learning system. In Workshop on Languages and Compilers for Parallel Computing, pages 121--135,2005. Google ScholarDigital Library
- A. McGovern, E. Moss, and A. Barto. Scheduling straight-line code using reinforcement learning and rollouts. (UM-CS-1999-023), ,1999. Google ScholarDigital Library
- N. Mitchell, N. Hogstedt, L. Carter, and J. Ferrante. Quantifying the multi-level nature of tiling interactions. International Journal of Parallel Programming, 26(6):641--670, 1998. Google ScholarDigital Library
- Martin F. Møller. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6:525--533, 1993. Google ScholarDigital Library
- A.Monsifrot, F. Bodin, and R. Quiniou. A machine learning approach to automatic production of compiler heuristics. Lecture notes in computer science, pages 41--50, 2002.Google Scholar
- Eliot Moss, Paul Utgoff, John Cavazos, Doina Precup, Darko Stefanovic, Carla Brodley, and David Scheeff. Learning to schedule straight-line code. In Proceedings of Neural Information Processing Symposium, pages 929--935. MIT Press, 1997. Google ScholarDigital Library
- Saeed Parsa and Shahriar Lotfi. A new genetic algorithm for loop tiling. The Journal of Supercomputing, 37(3):249--269, 2006. Google ScholarDigital Library
- Apan Qasem and Ken Kennedy. Profitable loop fusion and tiling using model-driven empirical search. In ICS '06: Proceedings of the 20th annual international conference on Supercomputing, pages 249--258, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- Lakshminarayanan Renganarayana and Sanjay Rajopadhye. Positivity, posynomials and tile size selection. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--12,Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarDigital Library
- Lakshminarayanan Renganarayanan, DaeGon Kim, Sanjay Rajopadhye,and Michelle Mills Strout. Parameterized tiled loops for free.In PLDI '07: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, pages 405--414,New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- Gabriel Rivera and Chau wen Tseng. A comparison of compiler tiling algorithms. In Proceedings of the 8th International Conference on Compiler Construction (CC'99, pages 168--182, 1999. Google ScholarDigital Library
- V. Sarkar, N. Megiddo, I.B.M.T.J.W.R. Center, and Y. Heights. An analytical model for loop tiling and its solution. Performance Analysis of Systems and Software, 2000. ISPASS. 2000 IEEE International Symposium on, pages 146--153, 2000. Google ScholarDigital Library
- R. Schreiber and J. Dongarra. Automatic blocking of nested loops.Technical Report 90.38, RIACS, NASA Ames Research Center, Aug1990.Google ScholarDigital Library
- M. Stephenson and S. Amarasinghe. Predicting unroll factors using supervised classification. In Proceedings of International Symposium on Code Generation and Optimization (CGO), pages 123--134, 2005. Google ScholarDigital Library
- Mark Stephenson, Saman Amarasinghe, Martin Martin, and Una-May O'Reilly. Meta optimization: Improving compiler heuristics with machine learning. In Proceedings of the ACM SIGPLAN '03Conference on Programming Language Design and Implementation,pages 77--90. ACM Press, 2002. Google ScholarDigital Library
- Xavier Vera, Jaume Abella, Antonio González, and Josep Llosa.Optimizing program locality through cmes and gas. In PACT'03: Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques, page 68, Washington,DC, USA, 2003. IEEE Computer Society. Google ScholarDigital Library
- R. Clint Whaley and Jack J. Dongarra. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing (CDROM), pages 1--27. IEEE Computer Society,1998. Google ScholarDigital Library
- R. Clint Whaley and Antoine Petitet. Minimizing development and maintenance costs in supporting persistently optimized BLAS.Software: Practice and Experience, 35(2):101--121, February 2005. Google ScholarDigital Library
- Jingling Xue. Loop Tiling For Parallelism. Kluwer Academic Publishers, 2000. Google ScholarDigital Library
- K. Yotov, Xiaoming Li, Gang Ren, M. J. S. Garzaran, D. Padua,K. Pingali, and P. Stodghill. Is search really necessary to generate high-performance BLAS? In Proceedings of the IEEE, 93:358--386,2005.Google ScholarCross Ref
- Kamen Yotov, Keshav Pingali, and Paul Stodghill. Think globally,search locally. In ICS '05: Proceedings of the 19th annual international conference on Supercomputing, pages 141--150, NewYork, NY, USA, 2005. ACM. Google ScholarDigital Library
Index Terms
- Automatic creation of tile size selection models
Recommendations
A practical tile size selection model for affine loop nests
ICS '21: Proceedings of the ACM International Conference on SupercomputingLoop tiling for locality is an important transformation for general-purpose and domain-specific compilation as it allows programs to exploit the benefits of deep memory hierarchies. Most code generation tools with the infrastructure to perform automatic ...
Tile size selection revisited
Loop tiling is a widely used loop transformation to enhance data locality and allow data reuse. In the tiled code, however, tiles of different sizes can lead to significant variation in performance. Thus, selection of an optimal tile size is critical to ...
Optimal Tile Size Selection Problem Using Machine Learning
ICMLA '12: Proceedings of the 2012 11th International Conference on Machine Learning and Applications - Volume 02One of the key feature of modern architectures is deep memory hierarchies. In order to exploit this feature, one has to expose data locality with-in a program. Loop tiling is an optimization phase in modern compilers which is used to transform a loop ...
Comments