poster

Benchmarking, autotuning and crowdtuning OpenCL programs using the Collective Knowledge framework

Authors:
Anton Lokhmotov

dividiti, UK

dividiti, UK
View Profile

,
Grigori Fursin

dividiti, UK and cTuning foundation, France

dividiti, UK and cTuning foundation, France
View Profile

IWOCL '16: Proceedings of the 4th International Workshop on OpenCLApril 2016Article No.: 20Pages 1–2https://doi.org/10.1145/2909437.2909460

Published:19 April 2016Publication History

IWOCL '16: Proceedings of the 4th International Workshop on OpenCL

Pages 1–2

ABSTRACT

Autotuning is a popular technique to ensure performance portability for important algorithms such as BLAS, FFT and DNN across the ever evolving software and hardware stack. Unfortunately, when performed on a single machine, autotuning can explore only a tiny fraction of the ever growing and non-linear optimization spaces and thus can easily miss optimal solutions. We propose to practically solve this problem with the help of the community using the open-source Collective Knowledge framework (CK). We have customized the universal multi-objective autotuning engine of CK to optimize the local work size and other parameters of OpenCL workloads across diverse inputs and devices. Optimal solutions (with speed increases of up to 20x and energy savings of up to 30% over the default configurations) are preserved in the open repository of optimization knowledge at http://cknowledge.org/repo.

References

F. Agakov, E. Bonilla, J.Cavazos, B.Franke, G. Fursin, M. O'Boyle, J. Thomson, M. Toussaint, and C. Williams. Using machine learning to focus iterative optimization. In Proceedings of the International Symposium on Code Generation and Optimization (CGO), 2006. Google ScholarDigital Library
J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. Petabricks: a language and compiler for algorithmic choice. In Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, PLDI '09, pages 38--49, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
G. Fursin, A. Lokhmotov, and E. Plowman. Collective Knowledge: towards R&D sustainability. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE'16), March 2016.Google ScholarDigital Library
G. Fursin, A. Memon, C. Guillon, and A. Lokhmotov. Collective Mind, Part II: Towards performance- and cost-aware software engineering as a natural science. In Proceedings of the 18th International Workshop on Compilers for Parallel Computing (CPC'15), January 2015.Google Scholar
A. Lascu and A. F. Donaldson. Integrating a large-scale testing campaign in the ck framework. In Proceedings of the 6th International Workshop on Adaptive, Self-tuning Computing Systems (ADAPT'16), January 2016.Google Scholar
A. Lokhmotov. GEMMbench: a framework for reproducible and collaborative benchmarking of matrix multiplication. In Proceedings of the 6th International Workshop on Adaptive, Self-tuning Computing Systems (ADAPT'16), January 2016.Google Scholar
M. Püschel, J. M. Moura, J. R. Johnson, D. Padua, M. M. Veloso, B. W. Singer, J. Xiong, F. Franchetti, A. Gačic, Y. Voronenko, et al. Spiral: Code generation for dsp transforms. Proceedings of the IEEE, 93(2):232--275, 2005.Google ScholarCross Ref
N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y. LeCun. Fast convolutional nets with fbfft: A gpu performance evaluation. arXiv preprint arXiv:1412.7580, 2014.Google Scholar

Recommendations

Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption
ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing

Many modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. ...
Read More
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More
A Comparison of Performance Tunabilities between OpenCL and OpenACC
MCSOC '13: Proceedings of the 2013 IEEE 7th International Symposium on Embedded Multicore/Manycore System-on-Chip

To design and develop any auto tuning mechanisms for OpenACC, it is important to clarify the differences between conventional GPU programming models and OpenACC in terms of available programming and tuning techniques, called performance tunabilities. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

IWOCL '16: Proceedings of the 4th International Workshop on OpenCL
April 2016
131 pages
ISBN:9781450343381
DOI:10.1145/2909437

Copyright © 2016 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 April 2016
Check for updates
Author Tags
Collective Knowledge
OpenCL
autotuning
benchmarking
crowdtuning
optimization knowledge sharing
Qualifiers
- poster
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate84of152submissions,55%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 106
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Benchmarking, autotuning and crowdtuning OpenCL programs using the Collective Knowledge framework

IWOCL '16: Proceedings of the 4th International Workshop on OpenCL

ABSTRACT

References

Cited By

Recommendations

Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

A Comparison of Performance Tunabilities between OpenCL and OpenACC

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Benchmarking, autotuning and crowdtuning OpenCL programs using the Collective Knowledge framework

IWOCL '16: Proceedings of the 4th International Workshop on OpenCL

ABSTRACT

References

Cited By

Recommendations

Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

A Comparison of Performance Tunabilities between OpenCL and OpenACC

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media