research-article

Semi-automatic restructuring of offloadable tasks for many-core accelerators

Authors:
Nishkam Ravi

NEC Labs America, Princeton, NJ

NEC Labs America, Princeton, NJ
View Profile

,
Yi Yang

NEC Labs America, Princeton, NJ

NEC Labs America, Princeton, NJ
View Profile

,
Tao Bao

Purdue University

Purdue University
View Profile

,
Srimat Chakradhar

NEC Labs America, Princeton, NJ

NEC Labs America, Princeton, NJ
View Profile

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and AnalysisNovember 2013Article No.: 12Pages 1–12https://doi.org/10.1145/2503210.2503285

Published:17 November 2013Publication History

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Pages 1–12

ABSTRACT

Work division between the processor and accelerator is a common theme in modern heterogenous computing. Recent efforts (such as LEO and OpenAcc) provide directives that allow the developer to mark code regions in the original application from which offloadable tasks can be generated by the compiler. Auto-tuners and runtime schedulers work with the options (i.e., offloadable tasks) generated at compile time, which is limited by the directives specified by the developer. There is no provision for offload restructuring.

We propose a new directive to add relaxed semantics to directive-based languages. The compiler identifies and generates one or more offloadable tasks in the neighbourhood of the code region marked by the directive. Central to our contribution is the idea of sub-offload and super-offload. In sub-offload, only a part of the code region marked by the developer is offloaded to the accelerator, while the other part executes on the CPU in parallel. This is done by splitting the index range of the main parallel loop into two or more parts and declaring one of the subloops as the offloadable task. Support is added to handle reduction variables and critical sections across subloops. Sub-offload enables concurrent execution of a task on the CPU and accelerator. In super-offload, a code region larger than the one specified by the developer is declared as the offloadable task (e.g., a parent loop). Super-offload reduces data transfers between CPU and accelerator memory.

We develop Elastic Offload Compiler(EOC) for use alongside existing directive-based languages. The current implementation supports LEO for the new Intel Xeon Phi (MIC) architecture. We evaluate EOC with respect to SpecOMP and NAS Parallel Benchmarks. Speedups range between 1.3x-4.4x with the CPU version as baseline and 1.2x-24x with the offload (CPU-MIC) version as baseline.

References

Caps openhmpp. http://www.caps-entreprise.com/openhmpp-directives.Google Scholar
Cuda. http://www.nvidia.com/object/cuda_home_new.html.Google Scholar
Gnu compiler collection. http://gcc.gnu.org.Google Scholar
Gnu gprof. http://sourceware.org/binutils/docs/gprof.Google Scholar
Intel c++ compiler. http://www.intel.com/Compilers.Google Scholar
Ompss. http://pm.bsc.es/ompss.Google Scholar
Openacc: Directives for accelerators. http://www.openacc-standard.org/.Google Scholar
Opencl. https://developer.nvidia.com/opencl.Google Scholar
The openmp api. http://www.openmp.org.Google Scholar
The portland group (pgi). http://www.pgroup.com.Google Scholar
R-stream compiler. https://www.reservoir.com/rstream.Google Scholar
Threading building blocks. http://software.intel.com/en-us/articles/intel-tbb.Google Scholar
Top 500 supercomputers. http://www.top500.org.Google Scholar
A. Agarwal, D. A. Kranz, and V. Natarajan. Automatic partitioning of parallel loops and data arrays for distributed shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst., 6(9):943--962, 1995. Google ScholarDigital Library
C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput.: Pract. Exper., 23(2), 2011. Google ScholarDigital Library
V. Bandishti, I. Pananilath, and U. Bondhugula. Tiling stencil computations to maximize parallelism. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarDigital Library
R. Barua, D. A. Kranz, and A. Agarwal. Communication-minimal partitioning of parallel loops and data arrays for cache-coherent distributed-memory multiprocessors. In LCPC, pages 350--368, 1996. Google ScholarDigital Library
M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic c-to-cuda code generation for affine programs. In Compiler Construction (CC), pages 244--263, 2010. Google ScholarDigital Library
M. M. Baskaran, N. Vydyanathan, U. K. R. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors. In Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP), 2009. Google ScholarDigital Library
M. Becchi and P. Crowley. Dynamic thread assignment on heterogeneous multiprocessor architectures. In Proceedings of the 3rd conference on Computing frontiers (CF), 2006. Google ScholarDigital Library
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In PPoPP, pages 207--216, 1995. Google ScholarDigital Library
J. Bueno, J. Planas, A. Duran, R. M. Badia, X. Martorell, E. Ayguade, and J. Labarta. Productive programming of gpu clusters with ompss. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS), 2012. Google ScholarDigital Library
B. Chamberlain, D. Callahan, and H. Zima. Parallel programmability and the chapel language. International Journal of High Perf. Comput. Appl., 21(3), 2007. Google ScholarDigital Library
G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark. Ocelot: A dynamic compiler for bulk-synchronous applications in heterogeneous systems. In PACT, 2010. Google ScholarDigital Library
A. F. Donaldson, U. Dolinsky, A. Richards, and G. Russell. Automatic offloading of c++ for the cell be processor: A case study using offload. In Proceedings of the 2010 International Conference on Complex, Intelligent and Software Intensive Systems, pages 901--906, 2010. Google ScholarDigital Library
M. M. Eshaghian and Y. C. Wu. Mapping heterogeneous task graphs onto heterogeneous system graphs. In Proceedings of the 6th Heterogeneous Computing Workshop (HCW), 1997. Google ScholarDigital Library
D. et al. Scheduling parallel task graphs on (almost) homogeneous multicluster platforms. IEEE Trans. Parallel Distrib. Syst., 20(7), 2009. Google ScholarDigital Library
M. Garland, M. Kudlur, and Y. Zheng. Designing a unified programming model for heterogeneous machines. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarDigital Library
T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R. Beard, and D. I. August. Automatic cpu-gpu communication management and optimization. In PLDI, pages 142--151, 2011. Google ScholarDigital Library
K. Kennedy and J. R. Allen. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc., 2002. Google ScholarDigital Library
J. Kim, H. Kim, J. H. Lee, and J. Lee. Achieving a single compute device image in opencl for multiple gpus. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming (PPoPP), pages 277--288, 2011. Google ScholarDigital Library
J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. Snucl: an opencl framework for heterogeneous cpu/gpu clusters. In Proceedings of the 26th ACM international conference on Supercomputing (ICS), 2012. Google ScholarDigital Library
S. Lee and R. Eigenmann. Openmpc: Extended openmp programming and tuning for gpus. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2010. Google ScholarDigital Library
S. Lee, S.-J. Min, and R. Eigenmann. Openmp to gpgpu: a compiler framework for automatic translation and optimization. In Proceedings of the 14th Symposium on Principles and Practice of Parallel programming (PPoPP), 2009. Google ScholarDigital Library
S. Lee and J. S. Vetter. Early evaluation of directive-based gpu programming models for productive exascale computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarDigital Library
J. Lee et al. An opencl framework for heterogeneous multicores with local memory. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques (PACT), pages 193--204, 2010. Google ScholarDigital Library
N. Maruyama, T. Nomura, K. Sato, and S. Matsuoka. Physis: an implicitly parallel programming model for stencil computations on large-scale gpu-accelerated supercomputers. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2011. Google ScholarDigital Library
H. Oh and S. Ha. A static scheduling heuristic for heterogeneous processors. In Proceedings of the Second International Euro-Par Conference on Parallel Processing-Volume II (Euro-Par), 1996. Google ScholarDigital Library
P. M. Phothilimthana, J. Ansel, J. Ragan-Kelley, and S. Amarasinghe. Portable performance on heterogeneous architectures. In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems (ASPLOS, 2013. Google ScholarDigital Library
J. A. Pienaar, S. Chakradhar, and A. Raghunathan. Automatic generation of software pipelines for heterogeneous parallel systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarDigital Library
S. Pop, A. Cohen, and G.-A. Silber. Induction variable analysis with delayed abstractions. In In 2005 International Conference on High Performance Embedded Architectures and Compilers (HiPEAC), pages 218--232, 2005. Google ScholarDigital Library
N. Ravi, Y. Yang, T. Bao, and S. Chakradhar. Apricot: an optimizing compiler and productivity tool for x86-compatible many-core coprocessors. In Proceedings of the 26th ACM international conference on Supercomputing (ICS), pages 47--58, 2012. Google ScholarDigital Library
V. A. Saraswat, V. Sarkar, and C. von Praun. X10: concurrent programming for modern architectures. In Proceedings of the 12th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming (PPoPP), 2007. Google ScholarDigital Library
A. Sbirlea, Y. Zou, Z. Budimlic, J. Cong, and V. Sarkar. Mapping a data-flow programming model onto heterogeneous platforms. In Proceedings of the 13th ACM SIGPLAN/SIGBED LCTES, 2012. Google ScholarDigital Library
J. Sim, A. Dasgputa, H. Kim, and R. Vuduc. A performance analysis framework for identifying performance benefits in GPGPU applications. In Proc. ACM Symp. Principles and Practice of Parallel Prog. (PPoPP), 2012. Google ScholarDigital Library
H. Wu, G. Diamos, S. Cadambi, and S. Yalamanchili. Kernel weaver: Automatically fusing database primitives for efficient gpu computation. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), 2012. Google ScholarDigital Library
Y. Yang et al. A gpgpu compiler for memory optimization and parallelism management. In PLDI, 2010. Google ScholarDigital Library
Y. Zhang and F. Mueller. Auto-generation and auto-tuning of 3d stencil codes on gpu clusters. In Proceedings of the Tenth International Symposium on Code Generation and Optimization (CGO), 2012. Google ScholarDigital Library

Recommendations

Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Read More
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Read More
From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture

Comparing the architectures and performance levels of an Nvidia Fermi accelerator with an Intel MIC Architecture coprocessor demonstrates the benefit of the coprocessor for bringing highly parallel applications into, or even beyond, GPGPU performance ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
November 2013
1123 pages
ISBN:9781450323789
DOI:10.1145/2503210
General Chair:
William Gropp
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Satoshi Matsuoka
Tokyo Institute of Technology, Tokyo, Japan
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 November 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
SC '13 Paper Acceptance Rate91of449submissions,20%Overall Acceptance Rate1,516of6,373submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 309
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.