skip to main content
research-article

An Autotuning Protocol to Rapidly Build Autotuners

Published: 04 January 2019 Publication History

Abstract

Automatic performance tuning (Autotuning) is an increasingly critical tuning technique for the high portable performance of Exascale applications. However, constructing an autotuner from scratch remains a challenge, even for domain experts. In this work, we propose a performance tuning and knowledge management suite (PAK) to help rapidly build autotuners. In order to accommodate existing autotuning techniques, we present an autotuning protocol that is composed of an extractor, producer, optimizer, evaluator, and learner. To achieve modularity and reusability, we also define programming interfaces for each protocol component as the fundamental infrastructure, which provides a customizable mechanism to deploy knowledge mining in the performance database. PAK’s usability is demonstrated by studying two important computational kernels: stencil computation and sparse matrix-vector multiplication (SpMV). Our proposed autotuner based on PAK shows comparable performance and higher productivity than traditional autotuners by writing just a few tens of code using our autotuning protocol.

References

[1]
Ayaz Ali, Lennart Johnsson, and Jaspal Subhlok. 2007. Scheduling FFT computation on SMP and multicore systems. In Proceedings of the 21st Annual International Conference on Supercomputing. ACM, 293--301.
[2]
Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. 2009. PetaBricks: A language and compiler for algorithmic choice. In Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation, Vol. 44. ACM.
[3]
Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. OpenTuner: An extensible framework for program autotuning. In International Conference on Parallel Architectures and Compilation Techniques. http://groups.csail.mit.edu/commit/papers/2014/ansel-pact14-opentuner.pdf
[4]
P. Basu, M. Hall, M. Khan, S. Maindola, S. Muralidharan, S. Ramalingam, A. Rivera, M. Shantharam, and A. Venkat. 2013a. Towards making autotuning mainstream. The International Journal of High Performance Computing Applications 27, 4 (2013), 379--393.
[5]
P. Basu, A. Venkat, M. Hall, S. Williams, B. Van Straalen, and L. Oliker. 2013b. Compiler generation and autotuning of communication-avoiding operators for geometric multigrid. In Proceedings of the 20th Annual International Conference on High Performance Computing. 452--461.
[6]
Protonu Basu, Samuel Williams, Brian Van Straalen, Leonid Oliker, Phillip Colella, and Mary Hall. 2017. Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers. Parallel Computing 64 (2017), 50--64. http://www.sciencedirect.com/science/article/pii/S0167819117300376 High-End Computing for Next-Generation Scientific Discovery.
[7]
Jeff Bilmes, Krste Asanovic, Chee-Whye Chin, and Jim Demmel. 1997. Optimizing matrix multiply using PHiPAC: A portable, high-performance, ANSI C coding methodology. In Proceedings of the 11th International Conference on Supercomputing. ACM, 340--347.
[8]
Cy Chan, Jason Ansel, Yee Lok Wong, Saman Amarasinghe, and Alan Edelman. 2009. Autotuning multigrid with PetaBricks. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC’09). ACM, New York, Article 5, 12 pages.
[9]
Chun Chen, Jacqueline Chame, and Mary Hall. 2005. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In International Symposium on Code Generation and Optimization. IEEE, 111--122.
[10]
Jee W. Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Vol. 45. ACM, 115--126.
[11]
Matthias Christen, Olaf Schenk, and Yifeng Cui. 2012. Patus for convenient high-performance stencils: Evaluation in earthquake simulations. In Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’12). IEEE, 1--10.
[12]
Keith D. Cooper, Philip J. Schielke, and Devika Subramanian. 1999. Optimizing for reduced code space using genetic algorithms. ACM SIGPLAN Notices 34, 7 (1999), 1--9.
[13]
Kaushik Datta, Shoaib Kamil, Samuel Williams, Leonid Oliker, John Shalf, and Katherine Yelick. 2009. Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Review 51, 1 (2009), 129--159.
[14]
Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, and Katherine Yelick. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. IEEE Press, 4.
[15]
Matteo Frigo and Steven G. Johnson. 2005. The design and implementation of FFTW3. Proceedings of the IEEE 93, 2 (2005), 216--231.
[16]
Archana Ganapathi, Kaushik Datta, Armando Fox, and David Patterson. 2009. A case for machine learning to optimize multicore performance. In Proceedings of the 1st USENIX Workshop on Hot Topics in Parallelism (HotPar’09).
[17]
Albert Hartono, Boyana Norris, and Ponnuswamy Sadayappan. 2009. Annotation-based empirical performance tuning using Orio. In Proceedings of the IEEE International Symposium on Parallel 8 Distributed Processing (IPDPS’09). IEEE, 1--11.
[18]
Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2013. A stencil compiler for short-vector simd architectures. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ACM, 13--24.
[19]
R. Himeno. 2011. Himeno benchmark. Retrieved from http://accc.riken.jp/2444.htm.
[20]
K. Hou, H. Wang, and W. C. Feng. 2016. AAlign: A SIMD framework for pairwise sequence alignment on x86-based multi-and many-core processors. In Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS’16). 780--789.
[21]
Kaixi Hou, Hao Wang, and Wu-chun Feng. 2017. GPU-UniCache: Automatic code generation of spatial blocking for stencils on GPUs. In Proceedings of the Computing Frontiers Conference (CF’17). ACM, New York, 107--116.
[22]
K. Hou, H. Wang, and W. C. Feng. 2018. A framework for the automatic vectorization of parallel sort on x86-based processors. IEEE Transactions on Parallel and Distributed Systems 29, 5 (2018), 958--972.
[23]
Shoaib Kamil, Cy Chan, Leonid Oliker, John Shalf, and Samuel Williams. 2010. An auto-tuning framework for parallel multicore stencil computations. In Proceedings of the 2010 IEEE International Symposium on Parallel 8 Distributed Processing (IPDPS’10). IEEE, 1--12.
[24]
Malik Khan, Protonu Basu, Gabe Rudy, Mary Hall, Chun Chen, and Jacqueline Chame. 2013. A script-based autotuning compiler system to generate high-performance CUDA code. ACM Transactions on Architecture and Code Optimization (TACO) 9, 4 (2013), 31.
[25]
Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R. Bishop. 2014. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM Journal on Scientific Computing 36, 5 (2014), C401--C423.
[26]
Karl S. Kunz and Raymond J. Luebbers. 1993. The Finite Difference Time Domain Method for Electromagnetics. CRC Press.
[27]
Ang Li, Weifeng Liu, Mads R. B. Kristensen, Brian Vinter, Hao Wang, Kaixi Hou, Andres Marquez, and Shuaiwen Leon Song. 2017b. Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). 26:1--26:14.
[28]
J. Li, J. Choi, I. Perros, J. Sun, and R. Vuduc. 2017a. Model-driven sparse CP decomposition for higher-order tensors. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS’17). 1048--1057.
[29]
Jiajia Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2013. SMAT: An input adaptive auto-tuner for sparse matrix-vector multiplication. In ACM SIGPLAN Notices, Vol. 48. ACM, 117--126.
[30]
Junhong Liu, Xin He, Weifeng Liu, and Guangming Tan. 2018. Register-based implementation of the sparse general matrix-matrix multiplication on GPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18). ACM, New York, 407--408.
[31]
Junhong Liu, Guangming Tan, Yulong Luo, Zeyao Mo, and Ninghui Sun. 2015. PAK. Retrieved from https://github.com/PAA-NCIC/PAK.
[32]
Weifeng Liu. 2015. Parallel and Scalable Sparse Basic Linear Algebra Subprograms. Ph.D. dissertation. University of Copenhagen.
[33]
Weifeng Liu, Ang Li, Jonathan D. Hogg, Iain S. Duff, and Brian Vinter. 2017. Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides. Concurrency and Computation: Practice and Experience 29, 21 (2017), e4244--n/a.
[34]
Weifeng Liu and Brian Vinter. 2015a. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM International Conference on Supercomputing (ICS’15). ACM, 339--350.
[35]
Weifeng Liu and Brian Vinter. 2015b. A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors. Journal of Parallel and Distributed Computing 85, C (Nov. 2015), 47--61.
[36]
Yulong Luo, Guangming Tan, Zeyao Mo, and Ninghui Sun. 2015. FAST: A fast stencil autotuning framework based on an optimal-solution space model. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15). ACM, New York, 187--196.
[37]
Thibaut Lutz, Christian Fensch, and Murray Cole. 2013. Partans: An autotuning framework for stencil computation on multi-gpu systems. ACM Transactions on Architecture and Code Optimization (TACO) 9, 4 (2013), 59.
[38]
Allen D. Malony, Jan Cuny, and Sameer Shende. 1999. TAU: Tuning and Analysis Utilities. Technical Report. LALP-99--205. Los Alamos National Laboratory Publication.
[39]
Azamat Mametjanov, Daniel Lowell, Ching-Chen Ma, and Boyana Norris. 2012. Autotuning stencil-based computations on GPUs. In Proceedings of the 2012 IEEE International Conference on Cluster Computing (CLUSTER’12). IEEE, 266--274.
[40]
Naoya Maruyama, Tatsuo Nomura, Kento Sato, and Satoshi Matsuoka. 2011. Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). IEEE, 1--12.
[41]
Christen Matthias, Schenk Olaf, and Burkhart Helmar. 2011. Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In Proceedings of the 2011 IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’11). IEEE, 676--687.
[42]
Jiayuan Meng and Kevin Skadron. 2009. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In Proceedings of the 23rd International Conference on Supercomputing. ACM, 256--265.
[43]
Paulius Micikevicius. 2009. 3D finite difference computation on GPUs using CUDA. In Proceedings of the 2nd Workshop on General Purpose Processing on Graphics Processing Units. ACM, 79--84.
[44]
Saurav Muralidharan, Michael Garland, Albert Sidelnik, and Mary Hall. 2016. Designing a tunable nested data-parallel programming system. ACM Transactions on Architecture and Code Optimization 13, 447 (Dec. 2016), Article, 24 pages.
[45]
Anthony Nguyen, Nadathur Satish, Jatin Chhugani, Changkyu Kim, and Pradeep Dubey. 2010. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 1--13.
[46]
Apan Qasem, Michael Jason Cade, and Dan Tamir. 2012. Improved energy efficiency for multithreaded kernels through model-based autotuning. In Proceedings of the 2012 IEEE Green Technologies Conference. IEEE, 1--6.
[47]
Guangming Tan, Junhong Liu, and Jiajia Li. 2018. Design and implementation of adaptive SpMV library for multicore and manycore architecture. ACM Transactions on Mathematical Software 44, 4 (Aug. 2018), Article 46.
[48]
Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The pochoir stencil compiler. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures. ACM, 117--128.
[49]
Cristian Ţăpuş, I-Hsin Chung, Jeffrey K. Hollingsworth, et al. 2002. Active harmony: Towards automated performance tuning. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing. IEEE Computer Society Press, 1--11.
[50]
A. Tiwari, C. Chen, J. Chame, M. Hall, and J. K. Hollingsworth. 2009. A scalable autotuning framework for compiler optimization. In Proceedings of the International Parallel and Distributed Processing Symposium.
[51]
Didem Unat, Xing Cai, and Scott B. Baden. 2011. Mint: Realizing CUDA performance in 3D stencil methods with annotated C. In Proceedings of the International Conference on Supercomputing. ACM, 214--224.
[52]
Richard Vuduc, James W. Demmel, and Katherine A. Yelick. 2005. OSKI: A library of automatically tuned sparse matrix kernels. In Journal of Physics: Conference Series, Vol. 16. IOP Publishing, 521.
[53]
Hao Wang, Weifeng Liu, Kaixi Hou, and Wu-chun Feng. 2016. Parallel transposition of sparse data structures. In Proceedings of the 2016 International Conference on Supercomputing (ICS’16). 33:1--33:13.
[54]
H. Wang, J. Zhang, D. Zhang, S. Pumma, and W. C. Feng. 2017. PaPar: A parallel data partitioning framework for big data applications. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS’17). 605--614.
[55]
Xinliang Wang, Weifeng Liu, Wei Xue, and Li Wu. 2018. swSpTRSV: A fast sparse triangular solve with sparse level tile layout on sunway architectures. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18). ACM, New York, 338--353.
[56]
R. Clint Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing. IEEE Computer Society, 1--27.
[57]
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM 52, 4 (2009), 65--76.
[58]
Yue Zhao, Jiajia Li, Chunhua Liao, and Xipeng Shen. 2018. Bridging the gap between deep learning and sparse matrix format selection. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18). ACM, New York, 94--108.

Cited By

View all
  • (2023)POSNet: a hybrid deep learning model for efficient person re-identificationThe Journal of Supercomputing10.1007/s11227-023-05169-479:12(13090-13118)Online publication date: 22-Mar-2023
  • (2020)CodeSeerProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392741(1-11)Online publication date: 29-Jun-2020
  • (2019)Register-Aware Optimizations for Parallel Sparse Matrix---Matrix MultiplicationInternational Journal of Parallel Programming10.1007/s10766-018-0604-847:3(403-417)Online publication date: 1-Jun-2019

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing
ACM Transactions on Parallel Computing  Volume 5, Issue 2
June 2018
113 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/3299751
  • Editor:
  • David Bader
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 January 2019
Accepted: 01 June 2018
Revised: 01 January 2018
Received: 01 January 2017
Published in TOPC Volume 5, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Autotuner
  2. SpMV
  3. knowledge database
  4. protocol
  5. stencil

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • National Natural Science Foundation of China
  • National Key Research and Development Program of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)POSNet: a hybrid deep learning model for efficient person re-identificationThe Journal of Supercomputing10.1007/s11227-023-05169-479:12(13090-13118)Online publication date: 22-Mar-2023
  • (2020)CodeSeerProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392741(1-11)Online publication date: 29-Jun-2020
  • (2019)Register-Aware Optimizations for Parallel Sparse Matrix---Matrix MultiplicationInternational Journal of Parallel Programming10.1007/s10766-018-0604-847:3(403-417)Online publication date: 1-Jun-2019

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media