research-article

Efficient hierarchical online-autotuning: a case study on polyhedral accelerator mapping

Authors:

Tobias Grosser,

Martin TillmannAuthors Info & Claims

ICS '19: Proceedings of the ACM International Conference on Supercomputing

Pages 354 - 366

https://doi.org/10.1145/3330345.3330377

Published: 26 June 2019 Publication History

Abstract

Identifying the (near) optimal program variants an optimizing and parallelizing compiler should generate is known to be difficult. Autotuning is the best solution to navigate the often high-dimensional space of possible options. However, to be practical an autotuner should (a) have high convergence speed and (b) be robust in face of varying inputs. Current techniques for offline tuning, where convergence speed is less important, provide solutions only for known inputs, whereas online tuning can be input sensitive but currently lacks in convergence speed. In this paper, we present hierarchical online-autotuning, a novel technique to exploit structure in the search space and the underlying tuning problem to increase convergence speed during online tuning. By modeling symmetries and redundancies in configurations and by exploiting domain knowledge to predict performance we reduce the search space size by orders of magnitudes. Combining our tuner with a polyhedral parallelizing compiler for GPUs, we show that the performance of a GEMM GPU kernel generated with default parameters is increased by 6× and that the convergence speed of the tuning process is increased by a factor of up to 1.7 compared to OpenTuner. With hierarchical tuning we make the deployment of always-on online-autotuning practical.

References

[1]

Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. 2009. PetaBricks: A Language and Compiler for Algorithmic Choice. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 09). ACM, New York, NY, USA.

Digital Library

[2]

Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O'Reilly, and Saman Amarasinghe. 2014. OpenTuner: An Extensible Framework for Program Autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT 14). ACM, New York, NY, USA.

Digital Library

[3]

Prasanna Balaprakash, Robert B. Gramacy, and Stefan M. Wild. 2013. Active-Learning-Based Surrogate Models for Empirical Performance Tuning. In 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[4]

Wenlei Bao, Sriram Krishnamoorthy, Louis-Noel Pouchet, and P. Sadayappan. 2017. Analytical Modeling of Cache Behavior for Affine Programs. Proceedings of the ACM on Programming Languages 2, Issue POPL, Article 32 (Dec. 2017).

Digital Library

[5]

Alexander Barvinok. 2008. Integer Points in Polyhedra. European Mathematical Society.

[6]

Muthu Manikandan Baskaran, Jj Ramanujam, and P Sadayappan. 2010. Automatic C-to-CUDA Code Generation for Affine Programs. In International Conference on Compiler Construction. Springer.

Digital Library

[7]

James Bergstra, Nicolas Pinto, and David Cox. 2012. Machine Learning for Predictive Auto-Tuning with Boosted Regression Trees. In Innovative Parallel Computing (InPar). IEEE, Washington, DC, USA.

[8]

Kristof Beyls and Erik D'Hollander. 2001. Reuse distance as a metric for cache behavior. In Proceedings of the IASTED Conference on Parallel and Distributed Computing and systems, Vol. 14.

[9]

Uday Bondhugula, A Hartono, J Ramanujam, and P. Sadayappan. 2008. Pluto: A practical and fully automatic polyhedral program optimization system. In Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation (PLDI 08).

[10]

John Cavazos, Grigori Fursin, Felix Agakov, Edwin Bonilla, Michael F. P. O'Boyle, and Olivier Temam. 2007. Rapidly Selecting Good Compiler Optimizations Using Performance Counters. In Proceedings of the International Symposium on Code Generation and Optimization (CGO 07). IEEE Computer Society, Washington, DC, USA.

Digital Library

[11]

Jacqueline Chame and Sungdo Moon. 1999. A tile selection algorithm for data locality and cache interference. In Proceedings of the 13th international conference on Supercomputing. ACM.

Digital Library

[12]

Siddhartha Chatterjee, Erin Parker, Philip J Hanlon, and Alvin R Lebeck. 2001. Exact analysis of the cache behavior of nested loops. ACM SIGPLAN Notices 36, Issue 5 (2001).

Digital Library

[13]

Chun Chen, Jacqueline Chame, and Mary Hall. 2005. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In International Symposium on Code Generation and Optimization (CGO 05). IEEE.

Digital Library

[14]

Chun Chen, Jacqueline Chame, and Mary Hall. 2008. CHiLL: A Framework for Composing High-Level Loop Transformations. Technical Report.

[15]

Marvin Damschen, Christian Plessl, Andreas Agne, Markus Happe, Ariane Keller, Enno Lübbers, Bernhard Plattner, Marco Platzner, Sebastian Meisner, Achim Lösch, and others. 2015. Transparent Offloading of Computational Hotspots from Binary Code to Xeon Phi. Proceedings of the 2015 Conference on Design, Automation and Test in Europe (DATE) (2015).

Digital Library

[16]

Chirag Dave and Rudolf Eigenmann. 2010. Automatically Tuning Parallel and Parallelized Programs. In Languages and Compilers for Parallel Computing (Lecture Notes in Computer Science). Springer, Berlin, Heidelberg.

Digital Library

[17]

Jim Demmel, Jack Dongarra, Victor Eijkhout, Erika Fuentes, Antoine Petitet, Rich Vuduc, R Clint Whaley, and Katherine Yelick. 2005. Self-adapting linear algebra algorithms and software. Proc. IEEE 93, Issue 2 (2005).

[18]

Paul Feautrier and Christian Lengauer. 2011. Polyhedron Model. In Encyclopedia of Parallel Computing, David Padua (Ed.). Springer US, Boston, MA.

[19]

Basilio B Fraguela, Martın G Carmueja, and Diego Andrade. 2005. Optimal tile size selection guided by analytical models. Procedings of Parallel Computing 10 (2005).

[20]

Tobias Grosser and Torsten Hoefler. 2016. Polly-ACC Transparent Compilation to Heterogeneous Hardware. In Proceedings of the 2016 International Conference on Supercomputing. ACM.

Digital Library

[21]

Tobias Grosser, Sven Verdoolaege, and Albert Cohen. 2015. Polyhedral AST Generation Is More Than Scanning Polyhedra. ACM Transactions on Programming Languages and Systems 37, Issue 4 (July 2015).

Digital Library

[22]

Bastian Hagedorn, Larisa Stoltzfus, Michel Steuwer, Sergei Gorlatch, and Christophe Dubach. 2018. High Performance Stencil Code Generation with Lift. In International Symposium on Code Generation and Optimization (CGO 18). ACM, New York, NY, USA.

[23]

Chung-hsing Hsu and Ulrich Kremer. 2004. A quantitative analysis of tile size selection algorithms. The Journal of Super-computing 27, Issue 3 (2004).

Digital Library

[24]

Jeffrey A. Joines and Christopher R. Houck. 1994. On the Use of Non-Stationary Penalty Functions to Solve Nonlinear Constrained Optimization Problems with GA's. In Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence.

[25]

Peter MW Knijnenburg, Toru Kisuki, Kyle Gallivan, and Michael FP O'Boyle. 2004. The effect of cache models on iterative compilation for combined tiling and unrolling. Concurrency and Computation: Practice and Experience 16, Issues 2--3 (2004).

[26]

Peter MW Knijnenburg, Toru Kisuki, and Michael FP O'Boyle. 2003. Combined selection of tile sizes and unroll factors using iterative compilation. The Journal of Supercomputing 24, Issue 1 (2003).

[27]

Monica D Lam, Edward E Rothberg, and Michael E Wolf. 1991. The cache performance and optimizations of blocked algorithms. In ACM SIGARCH Computer Architecture News, Vol. 19. ACM.

Digital Library

[28]

Chunhua Liao, Daniel J. Quinlan, Richard Vuduc, and Thomas Panas. 2009. Effective Source-to-Source Outlining to Support Whole Program Empirical Optimization. In Languages and Compilers for Parallel Computing (Lecture Notes in Computer Science). Springer, Berlin, Heidelberg.

Digital Library

[29]

Tze Meng Low, Francisco D. Igual, Tyler M Smith, and Enrique S Quintana-Orti. 2016. Analytical Modeling Is Enough for High-Performance BLIS. ACM Transactions on Mathematical Software (TOMS) 43, Issue 2 (2016).

Digital Library

[30]

Benoit Meister and Sven Verdoolaege. 2008. Polynomial approximations in the polytope model: Bringing the power of quasi-polynomials to the masses. In Proceedings of 6th Workshop on Optimizations for DSP and Embedded Systems (ODES-6).

[31]

Dmitry Mikushin, Nikolay Likhogrud, Eddy Z. Zhang, and Christopher Bergström. 2014. KernelGen - The Design and Implementation of a Next Generation Compiler Platform for Accelerating Numerical Models on GPUs. In 2014 IEEE International Parallel Distributed Processing Symposium Workshops.

Digital Library

[32]

Nicholas Mitchell, Karin Högstedt, Larry Carter, and Jeanne Ferrante. 1998. Quantifying the multi-level nature of tiling interactions. International Journal of Parallel Programming 26, Issue 6 (1998).

[33]

Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic Optimization for Image Processing Pipelines. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 15). ACM, New York, NY, USA.

Digital Library

[34]

J. A. Nelder and R. Mead. 1965. A Simplex Method for Function Minimization. Comput. J. 7, Issue 4 (Jan. 1965).

[35]

Cedric Nugteren and Henk Corporaal. 2014. Bones: An Automatic Skeleton-Based C-to-CUDA Compiler for GPUs. ACM Transactions on Architecture and Code Optimization 11, Issue 4 (Dec. 2014).

Digital Library

[36]

Philip Pfaffe, Martin Tillmann, Sigmar Walter, and Walter F. Tichy. 2017. Online-Autotuning in the Presence of Algorithmic Choice. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[37]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 13). ACM, New York, NY, USA.

Digital Library

[38]

Ari Rasch, Michael Haidl, and Sergei Gorlatch. 2017. ATF: A Generic Auto-Tuning Framework. In 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[39]

Gabriel Rivera and Chau-Wen Tseng. 1999. A comparison of compiler tiling algorithms. In International Conference on Compiler Construction. Springer.

Digital Library

[40]

Gabe Rudy. 2010. CUDA-CHiLL: A Programming Language Interface for GPGPU Optimizations and Code Generation. The University of Utah.

[41]

Vivek Sarkar and Nimrod Megiddo. 2000. An analytical model for loop tiling and its solution. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2000). IEEE.

Digital Library

[42]

Robert Schreiber and Jack J Dongarra. 1990. Automatic blocking of nested loops. Technical Report.

Digital Library

[43]

Michel Steuwer, Toomas Remmelg, and Christophe Dubach. 2017. LIFT: A Functional Data-Parallel IR for High-Performance GPU Code Generation. In International Symposium on Code Generation and Optimization (CGO 17). IEEE, Washington, DC, USA.

Digital Library

[44]

Cristian Ţăpuş, I-Hsin Chung, and Jeffrey K. Hollingsworth. 2002. Active Harmony: Towards Automated Performance Tuning. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing (SC 02). IEEE Computer Society Press, Los Alamitos, CA, USA.

Digital Library

[45]

Ananta Tiwari, Chun Chen, Jacqueline Chame, Mary Hall, and Jeffrey K. Hollingsworth. 2009. A Scalable Auto-Tuning Framework for Compiler Optimization. In 2009 IEEE International Symposium on Parallel Distributed Processing.

[46]

Ananta Tiwari and Jeffrey K. Hollingsworth. 2011. Online Adaptive Code Generation and Tuning. In 2011 IEEE International Parallel Distributed Processing Symposium.

Digital Library

[47]

Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions. arXiv preprint arXiv:1802.04730 (2018).

[48]

Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, Jose Ignacio Gomez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral Parallel Code Generation for CUDA. ACM Transactions on Architecture and Code Optimization (TACO) 9, Issue 4 (2013).

Digital Library

[49]

Sven Verdoolaege and Tobias Grosser. 2012. Polyhedral Extraction Tool. In Second International Workshop on Polyhedral Compilation Techniques (IMPACT 12), Paris, France.

[50]

Sven Verdoolaege, Serge Guelton, Tobias Grosser, and Albert Cohen. 2014. Schedule Trees. In Fourth International Workshop on Polyhedral Compilation Techniques, Vienna, Austria (IMPACT 14).

[51]

Sven Verdoolaege, Rachid Seghir, Kristof Beyls, Vincent Loechner, and Maurice Bruynooghe. 2007. Counting Integer Points in Parametric Polytopes Using Barvinok's Rational Functions. Algorithmica 48, Issue 1 (May 2007).

Digital Library

[52]

Zheng Wang and Micheal F.P. O'Boyle. 2008. Mapping Parallelism to Multi-Cores: A Machine Learning Based Approach. ACM, New York, NY, USA.

Digital Library

[53]

R Clint Whaley and Jack J Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing. IEEE Computer Society.

Digital Library

[54]

R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. 2001. Automated Empirical Optimizations of Software and the ATLAS Project. Parallel Comput. 27, Issues 1--2 (Jan. 2001).

[55]

Kamen Yotov, Xiaoming Li, Gang Ren, MJS Garzaran, David Padua, Keshav Pingali, and Paul Stodghill. 2005. Is search really necessary to generate high-performance BLAS? Proc. IEEE 93, Issue 2 (2005).

[56]

Tomofumi Yuki, Lakshminarayanan Renganarayanan, Sanjay Rajopadhye, Charles Anderson, Alexandre E Eichenberger, and Kevin O'Brien. 2010. Automatic creation of tile size selection models. In Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization. ACM.

Digital Library

Cited By

Thangamani ALoechner VGenaud S(2024)A Survey of General-purpose Polyhedral CompilersACM Transactions on Architecture and Code Optimization10.1145/367473521:4(1-26)Online publication date: 22-Jun-2024
https://dl.acm.org/doi/10.1145/3674735
Sun QLiu YYang HJiang ZLuan ZQian D(2024)Adaptive Auto-Tuning Framework for Global Exploration of Stencil Optimization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332563035:1(20-33)Online publication date: Jan-2024
https://doi.org/10.1109/TPDS.2023.3325630
Dai WJia ZBai YSun Q(2024)Convergence-aware operator-wise mixed-precision trainingCCF Transactions on High Performance Computing10.1007/s42514-024-00208-9Online publication date: 31-Dec-2024
https://doi.org/10.1007/s42514-024-00208-9
Show More Cited By

Efficient hierarchical online-autotuning: a case study on polyhedral accelerator mapping
1. Software and its engineering
  1. Software notations and tools

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Optimized HPL for AMD GPU and multi-core CPU usage

The installation of the LOEWE-CSC ( http://csc.uni-frankfurt.de/csc/__ __51 ) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for ...
Efficient Hierarchical Agglomerative Clustering Algorithms on GPU Using Data Partitioning
PDCAT '11: Proceedings of the 2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies

We explore the capabilities of today's high-end Graphics processing units (GPU) on desktops to efficiently perform hierarchical agglomerative clustering (HAC) through partitioning of data. Traditional HAC has high time and memory complexities leading to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '19: Proceedings of the ACM International Conference on Supercomputing

June 2019

533 pages

ISBN:9781450360791

DOI:10.1145/3330345

General Chair:
Rudolf Eigenmann
University of Delaware
,
Program Chairs:
Chen Ding
University of Rochester
,
Sally A. McKee
Clemson University

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Deutsche Forschungsgemeinschaft

Conference

ICS '19

Sponsor:

SIGARCH

ICS '19: 2019 International Conference on Supercomputing

June 26 - 28, 2019

Arizona, Phoenix

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
312
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Thangamani ALoechner VGenaud S(2024)A Survey of General-purpose Polyhedral CompilersACM Transactions on Architecture and Code Optimization10.1145/367473521:4(1-26)Online publication date: 22-Jun-2024
https://dl.acm.org/doi/10.1145/3674735
Sun QLiu YYang HJiang ZLuan ZQian D(2024)Adaptive Auto-Tuning Framework for Global Exploration of Stencil Optimization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332563035:1(20-33)Online publication date: Jan-2024
https://doi.org/10.1109/TPDS.2023.3325630
Dai WJia ZBai YSun Q(2024)Convergence-aware operator-wise mixed-precision trainingCCF Transactions on High Performance Computing10.1007/s42514-024-00208-9Online publication date: 31-Dec-2024
https://doi.org/10.1007/s42514-024-00208-9
Wu XBalaprakash PKruse MKoo JVideau BHovland PTaylor VGeltz BJana SHall M(2024)ytopt: Autotuning Scientific Applications for Energy Efficiency at Large ScalesConcurrency and Computation: Practice and Experience10.1002/cpe.832237:1Online publication date: 30-Oct-2024
https://doi.org/10.1002/cpe.8322
Hacker OKorch MSeiferth J(2022)A Motivating Case Study on Code Variant Selection by Reinforcement LearningHigh Performance Computing10.1007/978-3-031-07312-0_15(293-312)Online publication date: 29-May-2022
https://dl.acm.org/doi/10.1007/978-3-031-07312-0_15
Jung WDao TLee JFreund SYahav E(2021)DeepCuts: a deep learning optimization framework for versatile GPU workloadsProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454038(190-205)Online publication date: 19-Jun-2021
https://dl.acm.org/doi/10.1145/3453483.3454038
Rasch ASchulze RSteuwer MGorlatch S(2021)Efficient Auto-Tuning of Parallel Programs with Interdependent Tuning Parameters via Auto-Tuning Framework (ATF)ACM Transactions on Architecture and Code Optimization10.1145/342709318:1(1-26)Online publication date: 20-Jan-2021
https://dl.acm.org/doi/10.1145/3427093
Hutter ESolomonik E(2021)Accelerating Distributed-Memory Autotuning via Statistical Analysis of Execution Paths2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00014(46-57)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00014
Sun QLiu YYang HJiang ZLiu XDun MLuan ZQian D(2021)csTuner: Scalable Auto-tuning Framework for Complex Stencil Computation on GPUs2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00037(192-203)Online publication date: Sep-2021
https://doi.org/10.1109/Cluster48925.2021.00037
Alappat CSeiferth JHager GKorch MRauber TWellein GLee J(2021)YaskSiteProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370316(174-186)Online publication date: 27-Feb-2021
https://dl.acm.org/doi/10.1109/CGO51591.2021.9370316
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten