research-article

An Autotuning Protocol to Rapidly Build Autotuners

Authors:

Ninghui SunAuthors Info & Claims

ACM Transactions on Parallel Computing (TOPC), Volume 5, Issue 2

Article No.: 9, Pages 1 - 25

https://doi.org/10.1145/3291527

Published: 04 January 2019 Publication History

Abstract

Automatic performance tuning (Autotuning) is an increasingly critical tuning technique for the high portable performance of Exascale applications. However, constructing an autotuner from scratch remains a challenge, even for domain experts. In this work, we propose a performance tuning and knowledge management suite (PAK) to help rapidly build autotuners. In order to accommodate existing autotuning techniques, we present an autotuning protocol that is composed of an extractor, producer, optimizer, evaluator, and learner. To achieve modularity and reusability, we also define programming interfaces for each protocol component as the fundamental infrastructure, which provides a customizable mechanism to deploy knowledge mining in the performance database. PAK’s usability is demonstrated by studying two important computational kernels: stencil computation and sparse matrix-vector multiplication (SpMV). Our proposed autotuner based on PAK shows comparable performance and higher productivity than traditional autotuners by writing just a few tens of code using our autotuning protocol.

References

[1]

Ayaz Ali, Lennart Johnsson, and Jaspal Subhlok. 2007. Scheduling FFT computation on SMP and multicore systems. In Proceedings of the 21st Annual International Conference on Supercomputing. ACM, 293--301.

Digital Library

[2]

Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. 2009. PetaBricks: A language and compiler for algorithmic choice. In Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation, Vol. 44. ACM.

Digital Library

[3]

Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. OpenTuner: An extensible framework for program autotuning. In International Conference on Parallel Architectures and Compilation Techniques. http://groups.csail.mit.edu/commit/papers/2014/ansel-pact14-opentuner.pdf

Digital Library

[4]

P. Basu, M. Hall, M. Khan, S. Maindola, S. Muralidharan, S. Ramalingam, A. Rivera, M. Shantharam, and A. Venkat. 2013a. Towards making autotuning mainstream. The International Journal of High Performance Computing Applications 27, 4 (2013), 379--393.

Digital Library

[5]

P. Basu, A. Venkat, M. Hall, S. Williams, B. Van Straalen, and L. Oliker. 2013b. Compiler generation and autotuning of communication-avoiding operators for geometric multigrid. In Proceedings of the 20th Annual International Conference on High Performance Computing. 452--461.

[6]

Protonu Basu, Samuel Williams, Brian Van Straalen, Leonid Oliker, Phillip Colella, and Mary Hall. 2017. Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers. Parallel Computing 64 (2017), 50--64. http://www.sciencedirect.com/science/article/pii/S0167819117300376 High-End Computing for Next-Generation Scientific Discovery.

Digital Library

[7]

Jeff Bilmes, Krste Asanovic, Chee-Whye Chin, and Jim Demmel. 1997. Optimizing matrix multiply using PHiPAC: A portable, high-performance, ANSI C coding methodology. In Proceedings of the 11th International Conference on Supercomputing. ACM, 340--347.

Digital Library

[8]

Cy Chan, Jason Ansel, Yee Lok Wong, Saman Amarasinghe, and Alan Edelman. 2009. Autotuning multigrid with PetaBricks. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC’09). ACM, New York, Article 5, 12 pages.

Digital Library

[9]

Chun Chen, Jacqueline Chame, and Mary Hall. 2005. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In International Symposium on Code Generation and Optimization. IEEE, 111--122.

Digital Library

[10]

Jee W. Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Vol. 45. ACM, 115--126.

Digital Library

[11]

Matthias Christen, Olaf Schenk, and Yifeng Cui. 2012. Patus for convenient high-performance stencils: Evaluation in earthquake simulations. In Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’12). IEEE, 1--10.

Digital Library

[12]

Keith D. Cooper, Philip J. Schielke, and Devika Subramanian. 1999. Optimizing for reduced code space using genetic algorithms. ACM SIGPLAN Notices 34, 7 (1999), 1--9.

Digital Library

[13]

Kaushik Datta, Shoaib Kamil, Samuel Williams, Leonid Oliker, John Shalf, and Katherine Yelick. 2009. Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Review 51, 1 (2009), 129--159.

Digital Library

[14]

Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, and Katherine Yelick. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. IEEE Press, 4.

Digital Library

[15]

Matteo Frigo and Steven G. Johnson. 2005. The design and implementation of FFTW3. Proceedings of the IEEE 93, 2 (2005), 216--231.

[16]

Archana Ganapathi, Kaushik Datta, Armando Fox, and David Patterson. 2009. A case for machine learning to optimize multicore performance. In Proceedings of the 1st USENIX Workshop on Hot Topics in Parallelism (HotPar’09).

Digital Library

[17]

Albert Hartono, Boyana Norris, and Ponnuswamy Sadayappan. 2009. Annotation-based empirical performance tuning using Orio. In Proceedings of the IEEE International Symposium on Parallel 8 Distributed Processing (IPDPS’09). IEEE, 1--11.

Digital Library

[18]

Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2013. A stencil compiler for short-vector simd architectures. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ACM, 13--24.

Digital Library

[19]

R. Himeno. 2011. Himeno benchmark. Retrieved from http://accc.riken.jp/2444.htm.

[20]

K. Hou, H. Wang, and W. C. Feng. 2016. AAlign: A SIMD framework for pairwise sequence alignment on x86-based multi-and many-core processors. In Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS’16). 780--789.

[21]

Kaixi Hou, Hao Wang, and Wu-chun Feng. 2017. GPU-UniCache: Automatic code generation of spatial blocking for stencils on GPUs. In Proceedings of the Computing Frontiers Conference (CF’17). ACM, New York, 107--116.

Digital Library

[22]

K. Hou, H. Wang, and W. C. Feng. 2018. A framework for the automatic vectorization of parallel sort on x86-based processors. IEEE Transactions on Parallel and Distributed Systems 29, 5 (2018), 958--972.

[23]

Shoaib Kamil, Cy Chan, Leonid Oliker, John Shalf, and Samuel Williams. 2010. An auto-tuning framework for parallel multicore stencil computations. In Proceedings of the 2010 IEEE International Symposium on Parallel 8 Distributed Processing (IPDPS’10). IEEE, 1--12.

[24]

Malik Khan, Protonu Basu, Gabe Rudy, Mary Hall, Chun Chen, and Jacqueline Chame. 2013. A script-based autotuning compiler system to generate high-performance CUDA code. ACM Transactions on Architecture and Code Optimization (TACO) 9, 4 (2013), 31.

Digital Library

[25]

Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R. Bishop. 2014. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM Journal on Scientific Computing 36, 5 (2014), C401--C423.

Digital Library

[26]

Karl S. Kunz and Raymond J. Luebbers. 1993. The Finite Difference Time Domain Method for Electromagnetics. CRC Press.

[27]

Ang Li, Weifeng Liu, Mads R. B. Kristensen, Brian Vinter, Hao Wang, Kaixi Hou, Andres Marquez, and Shuaiwen Leon Song. 2017b. Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). 26:1--26:14.

Digital Library

[28]

J. Li, J. Choi, I. Perros, J. Sun, and R. Vuduc. 2017a. Model-driven sparse CP decomposition for higher-order tensors. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS’17). 1048--1057.

[29]

Jiajia Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2013. SMAT: An input adaptive auto-tuner for sparse matrix-vector multiplication. In ACM SIGPLAN Notices, Vol. 48. ACM, 117--126.

Digital Library

[30]

Junhong Liu, Xin He, Weifeng Liu, and Guangming Tan. 2018. Register-based implementation of the sparse general matrix-matrix multiplication on GPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18). ACM, New York, 407--408.

Digital Library

[31]

Junhong Liu, Guangming Tan, Yulong Luo, Zeyao Mo, and Ninghui Sun. 2015. PAK. Retrieved from https://github.com/PAA-NCIC/PAK.

[32]

Weifeng Liu. 2015. Parallel and Scalable Sparse Basic Linear Algebra Subprograms. Ph.D. dissertation. University of Copenhagen.

[33]

Weifeng Liu, Ang Li, Jonathan D. Hogg, Iain S. Duff, and Brian Vinter. 2017. Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides. Concurrency and Computation: Practice and Experience 29, 21 (2017), e4244--n/a.

[34]

Weifeng Liu and Brian Vinter. 2015a. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM International Conference on Supercomputing (ICS’15). ACM, 339--350.

Digital Library

[35]

Weifeng Liu and Brian Vinter. 2015b. A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors. Journal of Parallel and Distributed Computing 85, C (Nov. 2015), 47--61.

Digital Library

[36]

Yulong Luo, Guangming Tan, Zeyao Mo, and Ninghui Sun. 2015. FAST: A fast stencil autotuning framework based on an optimal-solution space model. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15). ACM, New York, 187--196.

Digital Library

[37]

Thibaut Lutz, Christian Fensch, and Murray Cole. 2013. Partans: An autotuning framework for stencil computation on multi-gpu systems. ACM Transactions on Architecture and Code Optimization (TACO) 9, 4 (2013), 59.

Digital Library

[38]

Allen D. Malony, Jan Cuny, and Sameer Shende. 1999. TAU: Tuning and Analysis Utilities. Technical Report. LALP-99--205. Los Alamos National Laboratory Publication.

[39]

Azamat Mametjanov, Daniel Lowell, Ching-Chen Ma, and Boyana Norris. 2012. Autotuning stencil-based computations on GPUs. In Proceedings of the 2012 IEEE International Conference on Cluster Computing (CLUSTER’12). IEEE, 266--274.

Digital Library

[40]

Naoya Maruyama, Tatsuo Nomura, Kento Sato, and Satoshi Matsuoka. 2011. Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). IEEE, 1--12.

Digital Library

[41]

Christen Matthias, Schenk Olaf, and Burkhart Helmar. 2011. Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In Proceedings of the 2011 IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’11). IEEE, 676--687.

Digital Library

[42]

Jiayuan Meng and Kevin Skadron. 2009. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In Proceedings of the 23rd International Conference on Supercomputing. ACM, 256--265.

Digital Library

[43]

Paulius Micikevicius. 2009. 3D finite difference computation on GPUs using CUDA. In Proceedings of the 2nd Workshop on General Purpose Processing on Graphics Processing Units. ACM, 79--84.

Digital Library

[44]

Saurav Muralidharan, Michael Garland, Albert Sidelnik, and Mary Hall. 2016. Designing a tunable nested data-parallel programming system. ACM Transactions on Architecture and Code Optimization 13, 447 (Dec. 2016), Article, 24 pages.

Digital Library

[45]

Anthony Nguyen, Nadathur Satish, Jatin Chhugani, Changkyu Kim, and Pradeep Dubey. 2010. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 1--13.

Digital Library

[46]

Apan Qasem, Michael Jason Cade, and Dan Tamir. 2012. Improved energy efficiency for multithreaded kernels through model-based autotuning. In Proceedings of the 2012 IEEE Green Technologies Conference. IEEE, 1--6.

[47]

Guangming Tan, Junhong Liu, and Jiajia Li. 2018. Design and implementation of adaptive SpMV library for multicore and manycore architecture. ACM Transactions on Mathematical Software 44, 4 (Aug. 2018), Article 46.

Digital Library

[48]

Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The pochoir stencil compiler. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures. ACM, 117--128.

Digital Library

[49]

Cristian Ţăpuş, I-Hsin Chung, Jeffrey K. Hollingsworth, et al. 2002. Active harmony: Towards automated performance tuning. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing. IEEE Computer Society Press, 1--11.

Digital Library

[50]

A. Tiwari, C. Chen, J. Chame, M. Hall, and J. K. Hollingsworth. 2009. A scalable autotuning framework for compiler optimization. In Proceedings of the International Parallel and Distributed Processing Symposium.

Digital Library

[51]

Didem Unat, Xing Cai, and Scott B. Baden. 2011. Mint: Realizing CUDA performance in 3D stencil methods with annotated C. In Proceedings of the International Conference on Supercomputing. ACM, 214--224.

Digital Library

[52]

Richard Vuduc, James W. Demmel, and Katherine A. Yelick. 2005. OSKI: A library of automatically tuned sparse matrix kernels. In Journal of Physics: Conference Series, Vol. 16. IOP Publishing, 521.

[53]

Hao Wang, Weifeng Liu, Kaixi Hou, and Wu-chun Feng. 2016. Parallel transposition of sparse data structures. In Proceedings of the 2016 International Conference on Supercomputing (ICS’16). 33:1--33:13.

Digital Library

[54]

H. Wang, J. Zhang, D. Zhang, S. Pumma, and W. C. Feng. 2017. PaPar: A parallel data partitioning framework for big data applications. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS’17). 605--614.

[55]

Xinliang Wang, Weifeng Liu, Wei Xue, and Li Wu. 2018. swSpTRSV: A fast sparse triangular solve with sparse level tile layout on sunway architectures. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18). ACM, New York, 338--353.

Digital Library

[56]

R. Clint Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing. IEEE Computer Society, 1--27.

Digital Library

[57]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM 52, 4 (2009), 65--76.

Digital Library

[58]

Yue Zhao, Jiajia Li, Chunhua Liao, and Xipeng Shen. 2018. Bridging the gap between deep learning and sparse matrix format selection. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18). ACM, New York, 94--108.

Digital Library

Cited By

Batool EGillani SNaz SBukhari MMaqsood MYeo SRho S(2023)POSNet: a hybrid deep learning model for efficient person re-identificationThe Journal of Supercomputing10.1007/s11227-023-05169-479:12(13090-13118)Online publication date: 22-Mar-2023
https://dl.acm.org/doi/10.1007/s11227-023-05169-4
Wang TJain NBoehme DBeckingsale DMueller FGamblin TAyguadé EHwu WBadia RHofstee H(2020)CodeSeerProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392741(1-11)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392741
Liu JHe XLiu WTan G(2019)Register-Aware Optimizations for Parallel Sparse Matrix---Matrix MultiplicationInternational Journal of Parallel Programming10.1007/s10766-018-0604-847:3(403-417)Online publication date: 1-Jun-2019
https://dl.acm.org/doi/10.1007/s10766-018-0604-8

Recommendations

clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs
ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

Sparse matrix vector multiplication (SpMV) kernel is a key computation in linear algebra. Most iterative methods are composed of SpMV operations with BLAS1 updates. Therefore, researchers make extensive efforts to optimize the SpMV kernel in sparse ...
Scientific computing Kernels on the cell processor

In this work, we examine the potential of using the recently-released STI Cell processor as a building block for future high-end scientific computing systems. Our work contains several novel contributions. First, we introduce a performance model for ...
Explicit Fourth-Order Runge---Kutta Method on Intel Xeon Phi Coprocessor

This paper concerns an Intel Xeon Phi implementation of the explicit fourth-order Runge---Kutta method (RK4) for very sparse matrices with very short rows. Such matrices arise during Markovian modeling of computer and telecommunication networks. In this ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing

ACM Transactions on Parallel Computing Volume 5, Issue 2

June 2018

113 pages

ISSN:2329-4949

EISSN:2329-4957

DOI:10.1145/3299751

Editor:
David Bader
Georgia Institute of Technology, USA

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 January 2019

Accepted: 01 June 2018

Revised: 01 January 2018

Received: 01 January 2017

Published in TOPC Volume 5, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Natural Science Foundation of China
National Key Research and Development Program of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
189
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Batool EGillani SNaz SBukhari MMaqsood MYeo SRho S(2023)POSNet: a hybrid deep learning model for efficient person re-identificationThe Journal of Supercomputing10.1007/s11227-023-05169-479:12(13090-13118)Online publication date: 22-Mar-2023
https://dl.acm.org/doi/10.1007/s11227-023-05169-4
Wang TJain NBoehme DBeckingsale DMueller FGamblin TAyguadé EHwu WBadia RHofstee H(2020)CodeSeerProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392741(1-11)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392741
Liu JHe XLiu WTan G(2019)Register-Aware Optimizations for Parallel Sparse Matrix---Matrix MultiplicationInternational Journal of Parallel Programming10.1007/s10766-018-0604-847:3(403-417)Online publication date: 1-Jun-2019
https://dl.acm.org/doi/10.1007/s10766-018-0604-8

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents