skip to main content
10.1145/2792745.2792783acmotherconferencesArticle/Chapter ViewAbstractPublication PagesxsedeConference Proceedingsconference-collections
research-article

Autotuning OpenACC work distribution via direct search

Published: 26 July 2015 Publication History

Abstract

OpenACC provides a high-productivity API for programming GPUs and similar accelerator devices. One of the last steps in tuning OpenACC programs is selecting values for the num_gangs and vector_length clauses, which control how a parallel workload is distributed to an accelerator's processing units. In this paper, we present OptACC, an autotuner that can assist the programmer in selecting high-quality values for these parameters, and we evaluate the effectiveness of two direct search methods in finding solutions. We assess the quality of the the num_gangs and vector_length values found by our autotuner by comparing them to the values found by a bounded exhaustive search; we also compare the kernel execution times to those of the untuned kernel. On a suite of 36 OpenACC kernels, one or both of our autotuner's direct search methods identified values within the top 5% for 29 of the kernels, within the top 10% for five kernels, and within the top 25% for the remaining two. Eleven of the kernels achieved a speedup greater than 2x over the compiler's defaults, and the autotuner required only 7--11 runs of the target program, on average.

References

[1]
EPCC OpenACC benchmark suite. https://www.epcc.ed.ac.uk/research/computing/performance-characterisation-and-benchmarking/epcc-openacc-benchmark-suite.
[2]
OpenACC Directives | Developing with GPUs | NVIDIA. http://www.nvidia.com/object/openacc-gpu-directives.html.
[3]
OpenACC: Directives for accelerators. http://www.openacc-standard.org/.
[4]
The OpenACC application programming interface, version 2.0a. http://www.openacc.org/sites/default/files/OpenACC.2.0a_1.pdf, August 2013.
[5]
J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U.-M. O'Reilly, and S. Amarisinghe. OpenTuner: An extensible framework for program autotuning. PACT '14, 2014.
[6]
P. Balaprakash, S. M. Wild, and P. D. Hovland. Can search algorithms save large-scale automatic performance tuning? Procedia Computer Science, 4:2136--2145, 2011. Proc. ICCS 2011.
[7]
R. F. Barrett, C. T. Vaughan, and M. A. Heroux. MiniGhost: A miniapp for exploring boundary exchange strategies using stencil computations in scientific parallel computing. Technical Report SAND2012-2437, Sandia National Laboratories, 2012.
[8]
L. Cebamanos. Autotuning NekBone for OpenACC, July 2014. https://www.epcc.ed.ac.uk/blog/2014/07/03/autotuning-nekbone-openacc.
[9]
M. Colgrove. PGInsider March 2012: 5x in 5 hours: Porting a 3D elastic wave simulator to GPUs using PGI Accelerator, March 2012. http://www.pgroup.com/lit/articles/insider/v4n1a3.htm.
[10]
A. R. Conn, K. Scheinberg, and L. N. Vicente. Introduction to Derivative-Free Optimization. SIAM, Philadelphia, PA, USA, 2009.
[11]
M. Frigo and S. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216--231, Feb 2005.
[12]
S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. Auto-tuning a high-level language targeted to GPU codes. In Proc. Innovative Parallel Computing (InPar '12), 2012.
[13]
J. A. Nelder and R. Mead. A simplex method for function minimization. Computer J., 7:308--313, 1965.
[14]
M. Püschel, J. M. F. Moura, B. Singer, J. Xiong, J. Johnson, D. Padua, M. Veloso, and R. W. Johnson. SPIRAL: A generator for platform-adapted libraries of signal processing algorithms. Int. J. High Perform. Comput. Appl., 18(1):21--45, Feb. 2004.
[15]
S. Siddiqui and S. Feki. Historic learning approach for auto-tuning OpenACC accelerated scientific applications. In International Workshop on Automatic Performance Tuning, Eugene, Oregon, July 2014. King Abdulla University of Science and Technology.
[16]
J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood, S. Lathrop, D. Lifka, G. D. Peterson, R. Roskies, J. R. Scott, and N. Wilkens-Diehr. XSEDE: Accelerating scientific discovery. Computing in Science and Engineering, 16(5):62--74, 2014.
[17]
M. H. Wright. Direct search methods: Once scorned, now respectable. In D. Griffiths and G. Watson, editors, Numerical Analysis 1995 (Proc. 1995 Dundee Biennial Conf. in Numerical Analysis), pages 191--208, Harlow, UK, 1995. Addison Wesley Longman.

Cited By

View all
  • (2024)ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU CodeProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00247(1979-1990)Online publication date: 17-Nov-2024
  • (2021)JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC53243.2021.00032(182-191)Online publication date: Dec-2021
  • (2016)Utilization and Expansion of ppOpen-AT for OpenACC2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2016.123(1496-1505)Online publication date: May-2016
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
XSEDE '15: Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure
July 2015
296 pages
ISBN:9781450337205
DOI:10.1145/2792745
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

  • San Diego Super Computing Ctr: San Diego Super Computing Ctr
  • HPCWire: HPCWire
  • Omnibond: Omnibond Systems, LLC
  • SGI
  • Internet2
  • Indiana University: Indiana University
  • CASC: The Coalition for Academic Scientific Computation
  • NICS: National Institute for Computational Sciences
  • Intel: Intel
  • DDN: DataDirect Networks, Inc
  • DELL
  • CORSA: CORSA Technology
  • ALLINEA: Allinea Software
  • Cray
  • RENCI: Renaissance Computing Institute

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 July 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPUs
  2. OpenACC
  3. XSEDE
  4. accelerators
  5. autotuning

Qualifiers

  • Research-article

Conference

XSEDE '15
Sponsor:
  • San Diego Super Computing Ctr
  • HPCWire
  • Omnibond
  • Indiana University
  • CASC
  • NICS
  • Intel
  • DDN
  • CORSA
  • ALLINEA
  • RENCI

Acceptance Rates

XSEDE '15 Paper Acceptance Rate 49 of 70 submissions, 70%;
Overall Acceptance Rate 129 of 190 submissions, 68%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU CodeProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00247(1979-1990)Online publication date: 17-Nov-2024
  • (2021)JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC53243.2021.00032(182-191)Online publication date: Dec-2021
  • (2016)Utilization and Expansion of ppOpen-AT for OpenACC2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2016.123(1496-1505)Online publication date: May-2016
  • (2016)An Analytical Model-Based Auto-tuning Framework for Locality-Aware Loop SchedulingHigh Performance Computing10.1007/978-3-319-41321-1_1(3-20)Online publication date: 15-Jun-2016

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media