skip to main content
10.1145/3330345.3330377acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Efficient hierarchical online-autotuning: a case study on polyhedral accelerator mapping

Published: 26 June 2019 Publication History

Abstract

Identifying the (near) optimal program variants an optimizing and parallelizing compiler should generate is known to be difficult. Autotuning is the best solution to navigate the often high-dimensional space of possible options. However, to be practical an autotuner should (a) have high convergence speed and (b) be robust in face of varying inputs. Current techniques for offline tuning, where convergence speed is less important, provide solutions only for known inputs, whereas online tuning can be input sensitive but currently lacks in convergence speed. In this paper, we present hierarchical online-autotuning, a novel technique to exploit structure in the search space and the underlying tuning problem to increase convergence speed during online tuning. By modeling symmetries and redundancies in configurations and by exploiting domain knowledge to predict performance we reduce the search space size by orders of magnitudes. Combining our tuner with a polyhedral parallelizing compiler for GPUs, we show that the performance of a GEMM GPU kernel generated with default parameters is increased by 6× and that the convergence speed of the tuning process is increased by a factor of up to 1.7 compared to OpenTuner. With hierarchical tuning we make the deployment of always-on online-autotuning practical.

References

[1]
Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. 2009. PetaBricks: A Language and Compiler for Algorithmic Choice. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 09). ACM, New York, NY, USA.
[2]
Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O'Reilly, and Saman Amarasinghe. 2014. OpenTuner: An Extensible Framework for Program Autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT 14). ACM, New York, NY, USA.
[3]
Prasanna Balaprakash, Robert B. Gramacy, and Stefan M. Wild. 2013. Active-Learning-Based Surrogate Models for Empirical Performance Tuning. In 2013 IEEE International Conference on Cluster Computing (CLUSTER).
[4]
Wenlei Bao, Sriram Krishnamoorthy, Louis-Noel Pouchet, and P. Sadayappan. 2017. Analytical Modeling of Cache Behavior for Affine Programs. Proceedings of the ACM on Programming Languages 2, Issue POPL, Article 32 (Dec. 2017).
[5]
Alexander Barvinok. 2008. Integer Points in Polyhedra. European Mathematical Society.
[6]
Muthu Manikandan Baskaran, Jj Ramanujam, and P Sadayappan. 2010. Automatic C-to-CUDA Code Generation for Affine Programs. In International Conference on Compiler Construction. Springer.
[7]
James Bergstra, Nicolas Pinto, and David Cox. 2012. Machine Learning for Predictive Auto-Tuning with Boosted Regression Trees. In Innovative Parallel Computing (InPar). IEEE, Washington, DC, USA.
[8]
Kristof Beyls and Erik D'Hollander. 2001. Reuse distance as a metric for cache behavior. In Proceedings of the IASTED Conference on Parallel and Distributed Computing and systems, Vol. 14.
[9]
Uday Bondhugula, A Hartono, J Ramanujam, and P. Sadayappan. 2008. Pluto: A practical and fully automatic polyhedral program optimization system. In Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation (PLDI 08).
[10]
John Cavazos, Grigori Fursin, Felix Agakov, Edwin Bonilla, Michael F. P. O'Boyle, and Olivier Temam. 2007. Rapidly Selecting Good Compiler Optimizations Using Performance Counters. In Proceedings of the International Symposium on Code Generation and Optimization (CGO 07). IEEE Computer Society, Washington, DC, USA.
[11]
Jacqueline Chame and Sungdo Moon. 1999. A tile selection algorithm for data locality and cache interference. In Proceedings of the 13th international conference on Supercomputing. ACM.
[12]
Siddhartha Chatterjee, Erin Parker, Philip J Hanlon, and Alvin R Lebeck. 2001. Exact analysis of the cache behavior of nested loops. ACM SIGPLAN Notices 36, Issue 5 (2001).
[13]
Chun Chen, Jacqueline Chame, and Mary Hall. 2005. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In International Symposium on Code Generation and Optimization (CGO 05). IEEE.
[14]
Chun Chen, Jacqueline Chame, and Mary Hall. 2008. CHiLL: A Framework for Composing High-Level Loop Transformations. Technical Report.
[15]
Marvin Damschen, Christian Plessl, Andreas Agne, Markus Happe, Ariane Keller, Enno Lübbers, Bernhard Plattner, Marco Platzner, Sebastian Meisner, Achim Lösch, and others. 2015. Transparent Offloading of Computational Hotspots from Binary Code to Xeon Phi. Proceedings of the 2015 Conference on Design, Automation and Test in Europe (DATE) (2015).
[16]
Chirag Dave and Rudolf Eigenmann. 2010. Automatically Tuning Parallel and Parallelized Programs. In Languages and Compilers for Parallel Computing (Lecture Notes in Computer Science). Springer, Berlin, Heidelberg.
[17]
Jim Demmel, Jack Dongarra, Victor Eijkhout, Erika Fuentes, Antoine Petitet, Rich Vuduc, R Clint Whaley, and Katherine Yelick. 2005. Self-adapting linear algebra algorithms and software. Proc. IEEE 93, Issue 2 (2005).
[18]
Paul Feautrier and Christian Lengauer. 2011. Polyhedron Model. In Encyclopedia of Parallel Computing, David Padua (Ed.). Springer US, Boston, MA.
[19]
Basilio B Fraguela, Martın G Carmueja, and Diego Andrade. 2005. Optimal tile size selection guided by analytical models. Procedings of Parallel Computing 10 (2005).
[20]
Tobias Grosser and Torsten Hoefler. 2016. Polly-ACC Transparent Compilation to Heterogeneous Hardware. In Proceedings of the 2016 International Conference on Supercomputing. ACM.
[21]
Tobias Grosser, Sven Verdoolaege, and Albert Cohen. 2015. Polyhedral AST Generation Is More Than Scanning Polyhedra. ACM Transactions on Programming Languages and Systems 37, Issue 4 (July 2015).
[22]
Bastian Hagedorn, Larisa Stoltzfus, Michel Steuwer, Sergei Gorlatch, and Christophe Dubach. 2018. High Performance Stencil Code Generation with Lift. In International Symposium on Code Generation and Optimization (CGO 18). ACM, New York, NY, USA.
[23]
Chung-hsing Hsu and Ulrich Kremer. 2004. A quantitative analysis of tile size selection algorithms. The Journal of Super-computing 27, Issue 3 (2004).
[24]
Jeffrey A. Joines and Christopher R. Houck. 1994. On the Use of Non-Stationary Penalty Functions to Solve Nonlinear Constrained Optimization Problems with GA's. In Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence.
[25]
Peter MW Knijnenburg, Toru Kisuki, Kyle Gallivan, and Michael FP O'Boyle. 2004. The effect of cache models on iterative compilation for combined tiling and unrolling. Concurrency and Computation: Practice and Experience 16, Issues 2--3 (2004).
[26]
Peter MW Knijnenburg, Toru Kisuki, and Michael FP O'Boyle. 2003. Combined selection of tile sizes and unroll factors using iterative compilation. The Journal of Supercomputing 24, Issue 1 (2003).
[27]
Monica D Lam, Edward E Rothberg, and Michael E Wolf. 1991. The cache performance and optimizations of blocked algorithms. In ACM SIGARCH Computer Architecture News, Vol. 19. ACM.
[28]
Chunhua Liao, Daniel J. Quinlan, Richard Vuduc, and Thomas Panas. 2009. Effective Source-to-Source Outlining to Support Whole Program Empirical Optimization. In Languages and Compilers for Parallel Computing (Lecture Notes in Computer Science). Springer, Berlin, Heidelberg.
[29]
Tze Meng Low, Francisco D. Igual, Tyler M Smith, and Enrique S Quintana-Orti. 2016. Analytical Modeling Is Enough for High-Performance BLIS. ACM Transactions on Mathematical Software (TOMS) 43, Issue 2 (2016).
[30]
Benoit Meister and Sven Verdoolaege. 2008. Polynomial approximations in the polytope model: Bringing the power of quasi-polynomials to the masses. In Proceedings of 6th Workshop on Optimizations for DSP and Embedded Systems (ODES-6).
[31]
Dmitry Mikushin, Nikolay Likhogrud, Eddy Z. Zhang, and Christopher Bergström. 2014. KernelGen - The Design and Implementation of a Next Generation Compiler Platform for Accelerating Numerical Models on GPUs. In 2014 IEEE International Parallel Distributed Processing Symposium Workshops.
[32]
Nicholas Mitchell, Karin Högstedt, Larry Carter, and Jeanne Ferrante. 1998. Quantifying the multi-level nature of tiling interactions. International Journal of Parallel Programming 26, Issue 6 (1998).
[33]
Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic Optimization for Image Processing Pipelines. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 15). ACM, New York, NY, USA.
[34]
J. A. Nelder and R. Mead. 1965. A Simplex Method for Function Minimization. Comput. J. 7, Issue 4 (Jan. 1965).
[35]
Cedric Nugteren and Henk Corporaal. 2014. Bones: An Automatic Skeleton-Based C-to-CUDA Compiler for GPUs. ACM Transactions on Architecture and Code Optimization 11, Issue 4 (Dec. 2014).
[36]
Philip Pfaffe, Martin Tillmann, Sigmar Walter, and Walter F. Tichy. 2017. Online-Autotuning in the Presence of Algorithmic Choice. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[37]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 13). ACM, New York, NY, USA.
[38]
Ari Rasch, Michael Haidl, and Sergei Gorlatch. 2017. ATF: A Generic Auto-Tuning Framework. In 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).
[39]
Gabriel Rivera and Chau-Wen Tseng. 1999. A comparison of compiler tiling algorithms. In International Conference on Compiler Construction. Springer.
[40]
Gabe Rudy. 2010. CUDA-CHiLL: A Programming Language Interface for GPGPU Optimizations and Code Generation. The University of Utah.
[41]
Vivek Sarkar and Nimrod Megiddo. 2000. An analytical model for loop tiling and its solution. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2000). IEEE.
[42]
Robert Schreiber and Jack J Dongarra. 1990. Automatic blocking of nested loops. Technical Report.
[43]
Michel Steuwer, Toomas Remmelg, and Christophe Dubach. 2017. LIFT: A Functional Data-Parallel IR for High-Performance GPU Code Generation. In International Symposium on Code Generation and Optimization (CGO 17). IEEE, Washington, DC, USA.
[44]
Cristian Ţăpuş, I-Hsin Chung, and Jeffrey K. Hollingsworth. 2002. Active Harmony: Towards Automated Performance Tuning. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing (SC 02). IEEE Computer Society Press, Los Alamitos, CA, USA.
[45]
Ananta Tiwari, Chun Chen, Jacqueline Chame, Mary Hall, and Jeffrey K. Hollingsworth. 2009. A Scalable Auto-Tuning Framework for Compiler Optimization. In 2009 IEEE International Symposium on Parallel Distributed Processing.
[46]
Ananta Tiwari and Jeffrey K. Hollingsworth. 2011. Online Adaptive Code Generation and Tuning. In 2011 IEEE International Parallel Distributed Processing Symposium.
[47]
Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions. arXiv preprint arXiv:1802.04730 (2018).
[48]
Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, Jose Ignacio Gomez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral Parallel Code Generation for CUDA. ACM Transactions on Architecture and Code Optimization (TACO) 9, Issue 4 (2013).
[49]
Sven Verdoolaege and Tobias Grosser. 2012. Polyhedral Extraction Tool. In Second International Workshop on Polyhedral Compilation Techniques (IMPACT 12), Paris, France.
[50]
Sven Verdoolaege, Serge Guelton, Tobias Grosser, and Albert Cohen. 2014. Schedule Trees. In Fourth International Workshop on Polyhedral Compilation Techniques, Vienna, Austria (IMPACT 14).
[51]
Sven Verdoolaege, Rachid Seghir, Kristof Beyls, Vincent Loechner, and Maurice Bruynooghe. 2007. Counting Integer Points in Parametric Polytopes Using Barvinok's Rational Functions. Algorithmica 48, Issue 1 (May 2007).
[52]
Zheng Wang and Micheal F.P. O'Boyle. 2008. Mapping Parallelism to Multi-Cores: A Machine Learning Based Approach. ACM, New York, NY, USA.
[53]
R Clint Whaley and Jack J Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing. IEEE Computer Society.
[54]
R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. 2001. Automated Empirical Optimizations of Software and the ATLAS Project. Parallel Comput. 27, Issues 1--2 (Jan. 2001).
[55]
Kamen Yotov, Xiaoming Li, Gang Ren, MJS Garzaran, David Padua, Keshav Pingali, and Paul Stodghill. 2005. Is search really necessary to generate high-performance BLAS? Proc. IEEE 93, Issue 2 (2005).
[56]
Tomofumi Yuki, Lakshminarayanan Renganarayanan, Sanjay Rajopadhye, Charles Anderson, Alexandre E Eichenberger, and Kevin O'Brien. 2010. Automatic creation of tile size selection models. In Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization. ACM.

Cited By

View all
  • (2024)A Survey of General-purpose Polyhedral CompilersACM Transactions on Architecture and Code Optimization10.1145/367473521:4(1-26)Online publication date: 22-Jun-2024
  • (2024)Adaptive Auto-Tuning Framework for Global Exploration of Stencil Optimization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332563035:1(20-33)Online publication date: Jan-2024
  • (2024)Convergence-aware operator-wise mixed-precision trainingCCF Transactions on High Performance Computing10.1007/s42514-024-00208-9Online publication date: 31-Dec-2024
  • Show More Cited By
  1. Efficient hierarchical online-autotuning: a case study on polyhedral accelerator mapping

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '19: Proceedings of the ACM International Conference on Supercomputing
    June 2019
    533 pages
    ISBN:9781450360791
    DOI:10.1145/3330345
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 June 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPGPU
    2. online-autotuning
    3. performance optimization
    4. polyhedral compilation

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ICS '19
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 25 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Survey of General-purpose Polyhedral CompilersACM Transactions on Architecture and Code Optimization10.1145/367473521:4(1-26)Online publication date: 22-Jun-2024
    • (2024)Adaptive Auto-Tuning Framework for Global Exploration of Stencil Optimization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332563035:1(20-33)Online publication date: Jan-2024
    • (2024)Convergence-aware operator-wise mixed-precision trainingCCF Transactions on High Performance Computing10.1007/s42514-024-00208-9Online publication date: 31-Dec-2024
    • (2024)ytopt: Autotuning Scientific Applications for Energy Efficiency at Large ScalesConcurrency and Computation: Practice and Experience10.1002/cpe.832237:1Online publication date: 30-Oct-2024
    • (2022)A Motivating Case Study on Code Variant Selection by Reinforcement LearningHigh Performance Computing10.1007/978-3-031-07312-0_15(293-312)Online publication date: 29-May-2022
    • (2021)DeepCuts: a deep learning optimization framework for versatile GPU workloadsProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454038(190-205)Online publication date: 19-Jun-2021
    • (2021)Efficient Auto-Tuning of Parallel Programs with Interdependent Tuning Parameters via Auto-Tuning Framework (ATF)ACM Transactions on Architecture and Code Optimization10.1145/342709318:1(1-26)Online publication date: 20-Jan-2021
    • (2021)Accelerating Distributed-Memory Autotuning via Statistical Analysis of Execution Paths2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00014(46-57)Online publication date: May-2021
    • (2021)csTuner: Scalable Auto-tuning Framework for Complex Stencil Computation on GPUs2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00037(192-203)Online publication date: Sep-2021
    • (2021)YaskSiteProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370316(174-186)Online publication date: 27-Feb-2021
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media