skip to main content
10.1145/2503210.2503285acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Semi-automatic restructuring of offloadable tasks for many-core accelerators

Published:17 November 2013Publication History

ABSTRACT

Work division between the processor and accelerator is a common theme in modern heterogenous computing. Recent efforts (such as LEO and OpenAcc) provide directives that allow the developer to mark code regions in the original application from which offloadable tasks can be generated by the compiler. Auto-tuners and runtime schedulers work with the options (i.e., offloadable tasks) generated at compile time, which is limited by the directives specified by the developer. There is no provision for offload restructuring.

We propose a new directive to add relaxed semantics to directive-based languages. The compiler identifies and generates one or more offloadable tasks in the neighbourhood of the code region marked by the directive. Central to our contribution is the idea of sub-offload and super-offload. In sub-offload, only a part of the code region marked by the developer is offloaded to the accelerator, while the other part executes on the CPU in parallel. This is done by splitting the index range of the main parallel loop into two or more parts and declaring one of the subloops as the offloadable task. Support is added to handle reduction variables and critical sections across subloops. Sub-offload enables concurrent execution of a task on the CPU and accelerator. In super-offload, a code region larger than the one specified by the developer is declared as the offloadable task (e.g., a parent loop). Super-offload reduces data transfers between CPU and accelerator memory.

We develop Elastic Offload Compiler(EOC) for use alongside existing directive-based languages. The current implementation supports LEO for the new Intel Xeon Phi (MIC) architecture. We evaluate EOC with respect to SpecOMP and NAS Parallel Benchmarks. Speedups range between 1.3x-4.4x with the CPU version as baseline and 1.2x-24x with the offload (CPU-MIC) version as baseline.

References

  1. Caps openhmpp. http://www.caps-entreprise.com/openhmpp-directives.Google ScholarGoogle Scholar
  2. Cuda. http://www.nvidia.com/object/cuda_home_new.html.Google ScholarGoogle Scholar
  3. Gnu compiler collection. http://gcc.gnu.org.Google ScholarGoogle Scholar
  4. Gnu gprof. http://sourceware.org/binutils/docs/gprof.Google ScholarGoogle Scholar
  5. Intel c++ compiler. http://www.intel.com/Compilers.Google ScholarGoogle Scholar
  6. Ompss. http://pm.bsc.es/ompss.Google ScholarGoogle Scholar
  7. Openacc: Directives for accelerators. http://www.openacc-standard.org/.Google ScholarGoogle Scholar
  8. Opencl. https://developer.nvidia.com/opencl.Google ScholarGoogle Scholar
  9. The openmp api. http://www.openmp.org.Google ScholarGoogle Scholar
  10. The portland group (pgi). http://www.pgroup.com.Google ScholarGoogle Scholar
  11. R-stream compiler. https://www.reservoir.com/rstream.Google ScholarGoogle Scholar
  12. Threading building blocks. http://software.intel.com/en-us/articles/intel-tbb.Google ScholarGoogle Scholar
  13. Top 500 supercomputers. http://www.top500.org.Google ScholarGoogle Scholar
  14. A. Agarwal, D. A. Kranz, and V. Natarajan. Automatic partitioning of parallel loops and data arrays for distributed shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst., 6(9):943--962, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput.: Pract. Exper., 23(2), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. V. Bandishti, I. Pananilath, and U. Bondhugula. Tiling stencil computations to maximize parallelism. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Barua, D. A. Kranz, and A. Agarwal. Communication-minimal partitioning of parallel loops and data arrays for cache-coherent distributed-memory multiprocessors. In LCPC, pages 350--368, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic c-to-cuda code generation for affine programs. In Compiler Construction (CC), pages 244--263, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. M. Baskaran, N. Vydyanathan, U. K. R. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors. In Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Becchi and P. Crowley. Dynamic thread assignment on heterogeneous multiprocessor architectures. In Proceedings of the 3rd conference on Computing frontiers (CF), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In PPoPP, pages 207--216, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Bueno, J. Planas, A. Duran, R. M. Badia, X. Martorell, E. Ayguade, and J. Labarta. Productive programming of gpu clusters with ompss. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. B. Chamberlain, D. Callahan, and H. Zima. Parallel programmability and the chapel language. International Journal of High Perf. Comput. Appl., 21(3), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark. Ocelot: A dynamic compiler for bulk-synchronous applications in heterogeneous systems. In PACT, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. F. Donaldson, U. Dolinsky, A. Richards, and G. Russell. Automatic offloading of c++ for the cell be processor: A case study using offload. In Proceedings of the 2010 International Conference on Complex, Intelligent and Software Intensive Systems, pages 901--906, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. M. Eshaghian and Y. C. Wu. Mapping heterogeneous task graphs onto heterogeneous system graphs. In Proceedings of the 6th Heterogeneous Computing Workshop (HCW), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. et al. Scheduling parallel task graphs on (almost) homogeneous multicluster platforms. IEEE Trans. Parallel Distrib. Syst., 20(7), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Garland, M. Kudlur, and Y. Zheng. Designing a unified programming model for heterogeneous machines. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R. Beard, and D. I. August. Automatic cpu-gpu communication management and optimization. In PLDI, pages 142--151, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. K. Kennedy and J. R. Allen. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc., 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. Kim, H. Kim, J. H. Lee, and J. Lee. Achieving a single compute device image in opencl for multiple gpus. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming (PPoPP), pages 277--288, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. Snucl: an opencl framework for heterogeneous cpu/gpu clusters. In Proceedings of the 26th ACM international conference on Supercomputing (ICS), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. Lee and R. Eigenmann. Openmpc: Extended openmp programming and tuning for gpus. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Lee, S.-J. Min, and R. Eigenmann. Openmp to gpgpu: a compiler framework for automatic translation and optimization. In Proceedings of the 14th Symposium on Principles and Practice of Parallel programming (PPoPP), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S. Lee and J. S. Vetter. Early evaluation of directive-based gpu programming models for productive exascale computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Lee et al. An opencl framework for heterogeneous multicores with local memory. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques (PACT), pages 193--204, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. N. Maruyama, T. Nomura, K. Sato, and S. Matsuoka. Physis: an implicitly parallel programming model for stencil computations on large-scale gpu-accelerated supercomputers. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. H. Oh and S. Ha. A static scheduling heuristic for heterogeneous processors. In Proceedings of the Second International Euro-Par Conference on Parallel Processing-Volume II (Euro-Par), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. P. M. Phothilimthana, J. Ansel, J. Ragan-Kelley, and S. Amarasinghe. Portable performance on heterogeneous architectures. In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems (ASPLOS, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. A. Pienaar, S. Chakradhar, and A. Raghunathan. Automatic generation of software pipelines for heterogeneous parallel systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. S. Pop, A. Cohen, and G.-A. Silber. Induction variable analysis with delayed abstractions. In In 2005 International Conference on High Performance Embedded Architectures and Compilers (HiPEAC), pages 218--232, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. N. Ravi, Y. Yang, T. Bao, and S. Chakradhar. Apricot: an optimizing compiler and productivity tool for x86-compatible many-core coprocessors. In Proceedings of the 26th ACM international conference on Supercomputing (ICS), pages 47--58, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. V. A. Saraswat, V. Sarkar, and C. von Praun. X10: concurrent programming for modern architectures. In Proceedings of the 12th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming (PPoPP), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. A. Sbirlea, Y. Zou, Z. Budimlic, J. Cong, and V. Sarkar. Mapping a data-flow programming model onto heterogeneous platforms. In Proceedings of the 13th ACM SIGPLAN/SIGBED LCTES, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. Sim, A. Dasgputa, H. Kim, and R. Vuduc. A performance analysis framework for identifying performance benefits in GPGPU applications. In Proc. ACM Symp. Principles and Practice of Parallel Prog. (PPoPP), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. H. Wu, G. Diamos, S. Cadambi, and S. Yalamanchili. Kernel weaver: Automatically fusing database primitives for efficient gpu computation. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Y. Yang et al. A gpgpu compiler for memory optimization and parallelism management. In PLDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Y. Zhang and F. Mueller. Auto-generation and auto-tuning of 3d stencil codes on gpu clusters. In Proceedings of the Tenth International Symposium on Code Generation and Optimization (CGO), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
    November 2013
    1123 pages
    ISBN:9781450323789
    DOI:10.1145/2503210
    • General Chair:
    • William Gropp,
    • Program Chair:
    • Satoshi Matsuoka

    Copyright © 2013 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 17 November 2013

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    SC '13 Paper Acceptance Rate91of449submissions,20%Overall Acceptance Rate1,516of6,373submissions,24%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader