ABSTRACT
Work division between the processor and accelerator is a common theme in modern heterogenous computing. Recent efforts (such as LEO and OpenAcc) provide directives that allow the developer to mark code regions in the original application from which offloadable tasks can be generated by the compiler. Auto-tuners and runtime schedulers work with the options (i.e., offloadable tasks) generated at compile time, which is limited by the directives specified by the developer. There is no provision for offload restructuring.
We propose a new directive to add relaxed semantics to directive-based languages. The compiler identifies and generates one or more offloadable tasks in the neighbourhood of the code region marked by the directive. Central to our contribution is the idea of sub-offload and super-offload. In sub-offload, only a part of the code region marked by the developer is offloaded to the accelerator, while the other part executes on the CPU in parallel. This is done by splitting the index range of the main parallel loop into two or more parts and declaring one of the subloops as the offloadable task. Support is added to handle reduction variables and critical sections across subloops. Sub-offload enables concurrent execution of a task on the CPU and accelerator. In super-offload, a code region larger than the one specified by the developer is declared as the offloadable task (e.g., a parent loop). Super-offload reduces data transfers between CPU and accelerator memory.
We develop Elastic Offload Compiler(EOC) for use alongside existing directive-based languages. The current implementation supports LEO for the new Intel Xeon Phi (MIC) architecture. We evaluate EOC with respect to SpecOMP and NAS Parallel Benchmarks. Speedups range between 1.3x-4.4x with the CPU version as baseline and 1.2x-24x with the offload (CPU-MIC) version as baseline.
- Caps openhmpp. http://www.caps-entreprise.com/openhmpp-directives.Google Scholar
- Cuda. http://www.nvidia.com/object/cuda_home_new.html.Google Scholar
- Gnu compiler collection. http://gcc.gnu.org.Google Scholar
- Gnu gprof. http://sourceware.org/binutils/docs/gprof.Google Scholar
- Intel c++ compiler. http://www.intel.com/Compilers.Google Scholar
- Ompss. http://pm.bsc.es/ompss.Google Scholar
- Openacc: Directives for accelerators. http://www.openacc-standard.org/.Google Scholar
- Opencl. https://developer.nvidia.com/opencl.Google Scholar
- The openmp api. http://www.openmp.org.Google Scholar
- The portland group (pgi). http://www.pgroup.com.Google Scholar
- R-stream compiler. https://www.reservoir.com/rstream.Google Scholar
- Threading building blocks. http://software.intel.com/en-us/articles/intel-tbb.Google Scholar
- Top 500 supercomputers. http://www.top500.org.Google Scholar
- A. Agarwal, D. A. Kranz, and V. Natarajan. Automatic partitioning of parallel loops and data arrays for distributed shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst., 6(9):943--962, 1995. Google ScholarDigital Library
- C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput.: Pract. Exper., 23(2), 2011. Google ScholarDigital Library
- V. Bandishti, I. Pananilath, and U. Bondhugula. Tiling stencil computations to maximize parallelism. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarDigital Library
- R. Barua, D. A. Kranz, and A. Agarwal. Communication-minimal partitioning of parallel loops and data arrays for cache-coherent distributed-memory multiprocessors. In LCPC, pages 350--368, 1996. Google ScholarDigital Library
- M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic c-to-cuda code generation for affine programs. In Compiler Construction (CC), pages 244--263, 2010. Google ScholarDigital Library
- M. M. Baskaran, N. Vydyanathan, U. K. R. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors. In Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP), 2009. Google ScholarDigital Library
- M. Becchi and P. Crowley. Dynamic thread assignment on heterogeneous multiprocessor architectures. In Proceedings of the 3rd conference on Computing frontiers (CF), 2006. Google ScholarDigital Library
- R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In PPoPP, pages 207--216, 1995. Google ScholarDigital Library
- J. Bueno, J. Planas, A. Duran, R. M. Badia, X. Martorell, E. Ayguade, and J. Labarta. Productive programming of gpu clusters with ompss. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS), 2012. Google ScholarDigital Library
- B. Chamberlain, D. Callahan, and H. Zima. Parallel programmability and the chapel language. International Journal of High Perf. Comput. Appl., 21(3), 2007. Google ScholarDigital Library
- G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark. Ocelot: A dynamic compiler for bulk-synchronous applications in heterogeneous systems. In PACT, 2010. Google ScholarDigital Library
- A. F. Donaldson, U. Dolinsky, A. Richards, and G. Russell. Automatic offloading of c++ for the cell be processor: A case study using offload. In Proceedings of the 2010 International Conference on Complex, Intelligent and Software Intensive Systems, pages 901--906, 2010. Google ScholarDigital Library
- M. M. Eshaghian and Y. C. Wu. Mapping heterogeneous task graphs onto heterogeneous system graphs. In Proceedings of the 6th Heterogeneous Computing Workshop (HCW), 1997. Google ScholarDigital Library
- D. et al. Scheduling parallel task graphs on (almost) homogeneous multicluster platforms. IEEE Trans. Parallel Distrib. Syst., 20(7), 2009. Google ScholarDigital Library
- M. Garland, M. Kudlur, and Y. Zheng. Designing a unified programming model for heterogeneous machines. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarDigital Library
- T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R. Beard, and D. I. August. Automatic cpu-gpu communication management and optimization. In PLDI, pages 142--151, 2011. Google ScholarDigital Library
- K. Kennedy and J. R. Allen. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc., 2002. Google ScholarDigital Library
- J. Kim, H. Kim, J. H. Lee, and J. Lee. Achieving a single compute device image in opencl for multiple gpus. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming (PPoPP), pages 277--288, 2011. Google ScholarDigital Library
- J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. Snucl: an opencl framework for heterogeneous cpu/gpu clusters. In Proceedings of the 26th ACM international conference on Supercomputing (ICS), 2012. Google ScholarDigital Library
- S. Lee and R. Eigenmann. Openmpc: Extended openmp programming and tuning for gpus. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2010. Google ScholarDigital Library
- S. Lee, S.-J. Min, and R. Eigenmann. Openmp to gpgpu: a compiler framework for automatic translation and optimization. In Proceedings of the 14th Symposium on Principles and Practice of Parallel programming (PPoPP), 2009. Google ScholarDigital Library
- S. Lee and J. S. Vetter. Early evaluation of directive-based gpu programming models for productive exascale computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarDigital Library
- J. Lee et al. An opencl framework for heterogeneous multicores with local memory. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques (PACT), pages 193--204, 2010. Google ScholarDigital Library
- N. Maruyama, T. Nomura, K. Sato, and S. Matsuoka. Physis: an implicitly parallel programming model for stencil computations on large-scale gpu-accelerated supercomputers. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2011. Google ScholarDigital Library
- H. Oh and S. Ha. A static scheduling heuristic for heterogeneous processors. In Proceedings of the Second International Euro-Par Conference on Parallel Processing-Volume II (Euro-Par), 1996. Google ScholarDigital Library
- P. M. Phothilimthana, J. Ansel, J. Ragan-Kelley, and S. Amarasinghe. Portable performance on heterogeneous architectures. In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems (ASPLOS, 2013. Google ScholarDigital Library
- J. A. Pienaar, S. Chakradhar, and A. Raghunathan. Automatic generation of software pipelines for heterogeneous parallel systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarDigital Library
- S. Pop, A. Cohen, and G.-A. Silber. Induction variable analysis with delayed abstractions. In In 2005 International Conference on High Performance Embedded Architectures and Compilers (HiPEAC), pages 218--232, 2005. Google ScholarDigital Library
- N. Ravi, Y. Yang, T. Bao, and S. Chakradhar. Apricot: an optimizing compiler and productivity tool for x86-compatible many-core coprocessors. In Proceedings of the 26th ACM international conference on Supercomputing (ICS), pages 47--58, 2012. Google ScholarDigital Library
- V. A. Saraswat, V. Sarkar, and C. von Praun. X10: concurrent programming for modern architectures. In Proceedings of the 12th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming (PPoPP), 2007. Google ScholarDigital Library
- A. Sbirlea, Y. Zou, Z. Budimlic, J. Cong, and V. Sarkar. Mapping a data-flow programming model onto heterogeneous platforms. In Proceedings of the 13th ACM SIGPLAN/SIGBED LCTES, 2012. Google ScholarDigital Library
- J. Sim, A. Dasgputa, H. Kim, and R. Vuduc. A performance analysis framework for identifying performance benefits in GPGPU applications. In Proc. ACM Symp. Principles and Practice of Parallel Prog. (PPoPP), 2012. Google ScholarDigital Library
- H. Wu, G. Diamos, S. Cadambi, and S. Yalamanchili. Kernel weaver: Automatically fusing database primitives for efficient gpu computation. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), 2012. Google ScholarDigital Library
- Y. Yang et al. A gpgpu compiler for memory optimization and parallelism management. In PLDI, 2010. Google ScholarDigital Library
- Y. Zhang and F. Mueller. Auto-generation and auto-tuning of 3d stencil codes on gpu clusters. In Proceedings of the Tenth International Symposium on Code Generation and Optimization (CGO), 2012. Google ScholarDigital Library
Recommendations
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and ManycoresAchieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and ManycoresAchieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture
Comparing the architectures and performance levels of an Nvidia Fermi accelerator with an Intel MIC Architecture coprocessor demonstrates the benefit of the coprocessor for bringing highly parallel applications into, or even beyond, GPGPU performance ...
Comments