Abstract
In the state-of-the-art parallel programming approaches OpenCL and CUDA, so-called host code is required for program’s execution. Efficiently implementing host code is often a cumbersome task, especially when executing OpenCL and CUDA programs on systems with multiple nodes, each comprising different devices, e.g., multi-core CPU and graphics processing units; the programmer is responsible for explicitly managing node’s and device’s memory, synchronizing computations with data transfers between devices of potentially different nodes and for optimizing data transfers between devices’ memories and nodes’ main memories, e.g., by using pinned main memory for accelerating data transfers and overlapping the transfers with computations. We develop distributed OpenCL/CUDA abstraction layer (dOCAL)—a novel high-level C++ library that simplifies the development of host code. dOCAL combines major advantages over the state-of-the-art high-level approaches: (1) it simplifies implementing both OpenCL and CUDA host code by providing a simple-to-use, high-level abstraction API; (2) it supports executing arbitrary OpenCL and CUDA programs; (3) it allows conveniently targeting the devices of different nodes by automatically managing node-to-node communications; (4) it simplifies implementing data transfer optimizations by providing different, specially allocated memory regions, e.g., pinned main memory for overlapping data transfers with computations; (5) it optimizes memory management by automatically avoiding unnecessary data transfers; (6) it enables interoperability between OpenCL and CUDA host code for systems with devices from different vendors. Our experiments show that dOCAL significantly simplifies the development of host code for heterogeneous and distributed systems, with a low runtime overhead.
Similar content being viewed by others
References
Rasch A, Gorlatch S (2018) ATF: a generic, directive-based auto-tuning framework. In: CCPE, pp 1–16. https://doi.org/10.1002/cpe.4423
Aldinucci M et al (2015) The loop-of-stencil-reduce paradigm. In: IEEE Trustcom/BigDataSE/ISPA, pp 172–177
Boehm B et al (1995) Cost models for future software life cycle processes: COCOMO 2.0. In: Annals of software engineering, pp 57–94
Boost: Boost.Asio (2018). http://www.boost.org/doc/libs/1_66_0/doc/html/boost_asio.html
Castro D et al (2016) Farms, pipes, streams and reforestation: reasoning about structured parallel processes using types and hylomorphisms. In: Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming, ICFP, pp 4–17
Cedric A et al (2011) StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. In: Concurrency and computation: practice and experience, pp 187–198
Chang PP et al (1989) Inline function expansion for compiling C programs. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, pp 246–257
Dagum L et al (1998) OpenMP: an industry-standard api for shared-memory programming. In: IEEE computational science and engineering, pp 46–55
Dastgeer U et al (2014) The PEPPHER composition tool: performance-aware dynamic composition of applications for GPU-based systems. In: Computing, pp 1195–1211
Wheeler David A (2018) SLOCCount. https://www.dwheeler.com/sloccount/
Du P et al (2012) From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. In: Parallel computing, pp 391 – 407
Duato J et al (2010) rCUDA: reducing the number of GPU-based accelerators in high performance clusters. In: International Conference on High Performance Computing Simulation, pp 224–231
Duran A et al (2011) OmpSs: a proposal for programming heterogeneous multi-core architectures. In: Parallel processing letters, pp 173–193
Enmyren J et al (2010) SkePU: a multi-backend skeleton programming library for multi-GPU systems. In: HLPP, pp 5–14
Ernsting S et al (2011) Data parallel skeletons for GPU clusters and multi-GPU systems. In: PARCO, pp 509–518
Gorlatch S, Cole M (2011) Parallel skeletons. In: Encyclopedia of parallel computing, pp 1417–1422
Grasso I et al (2013) LibWater: heterogeneous distributed computing made easy. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS, pp 161–172
Haidl M, Gorlatch S (2014) PACXX: towards a unified programming model for programming accelerators using C++14. In: LLVM compiler infrastructure in HPC, pp 1–11
Halstead MH (1977) Elements of software science. Elsevier computer science library: operational programming systems series
Intel: Ambient Occlusion Benchmark (AOBench) (2014). http://code.google.com/p/aobench
Intel: Code Samples (2018). https://software.intel.com/en-us/intel-opencl-support/code-samples
Intel: CUDA Deep Neural Network Library (2018). https://developer.nvidia.com/cudnn
Intel: how to increase performance by minimizing buffer copies on intel processor graphics (2018). https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics
Jia Y et al (2014) Caffe: convolutional architecture for fast feature embedding. In: arXiv preprint arXiv:1408.5093
Karimi K et al (2010) A performance comparison of CUDA and OpenCL. In: CoRR
Kegel P et al (2012) dOpenCL: towards a uniform programming approach for distributed heterogeneous multi-/many-core systems. In: IEEE 26th international parallel and distributed processing symposium workshops PhD forum, pp 174–186
Kim J et al (2012) SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters. In: Proceedings of the 26th ACM International Conference on Supercomputing, ICS, pp 341–352
Klöckner A et al (2012) PyCUDA and PyOpenCL: a scripting-based approach to GPU run-time code generation. In: Parallel computing, pp 157 – 174
Koch G et al (2015) Siamese neural networks for one-shot image recognition. In: ICML deep learning workshop
Lee S et al (2010) OpenMPC: extended OpenMP programming and tuning for GPUs. In: ACM/IEEE International Conference for high Performance Computing, Networking, Storage and Analysis, pp 1–11
McCabe T.J (1976) A complexity measure. In: IEEE transactions on software engineering, pp 308–320
Memeti S et al (2017) Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption. In: Workshop on adaptive resource management and scheduling for cloud computing, pp 1–6
Moreton-Fernandez A et al (2017) Multi-device controllers: a library to simplify parallel heterogeneous programming. Int J Parallel Program 47(1):94–113
Nugteren C (2016) CLBlast: a tuned OpenCL BLAS library. In: CoRR
NVIDIA: nvidia-opencl-examples. https://github.com/sschaetz/nvidia-opencl-examples (2012)
NVIDIA: OpenCL samples (2012). https://github.com/sschaetz/nvidia-opencl-examples/
NVIDIA: CUDA Toolkit 9.1 (2018). https://developer.nvidia.com/cuda-toolkit
NVIDIA: how to optimize data transfers in CUDA C/C++ (2018). https://devblogs.nvidia.com/how-optimize-data-transfers-cuda-cc/
NVIDIA: how to overlap data transfers in CUDA C/C++ (2018). https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/
NVIDIA: hyper-Q (2018). http://developer.download.nvidia.com/compute/DevZone/C/html_x64/6_Advanced/simpleHyperQ/doc/HyperQ.pdf
NVIDIA: unified memory for CUDA beginners (2018). https://devblogs.nvidia.com/unified-memory-cuda-beginners/
Pérez B et al (2016) Simplifying programming and load balancing of data parallel applications on heterogeneous systems. In: GPGPU, pp 42–51
Reyes R et al (2015) SYCL: single-source C++ accelerator programming. In: PARCO, pp 673–682
rharish100193: halstead metrics tool (2016). https://sourceforge.net/projects/halsteadmetricstool/
Rompf T et al (2015) Go meta! A case for generative programming and DSLs in performance critical systems. In: LIPIcs–Leibniz international proceedings in informatics, pp 238–261
Rupp K et al (2010) Automatic performance optimization in ViennaCL for GPUs. In: POOSC, pp 1–6
Spafford K et al (2010) Maestro: data orchestration and tuning for OpenCL devices. In: Euro-Par–parallel processing. Springer, Berlin, pp 275–286
Standard C++ foundation foundation members: ISO C++ (2018). https://isocpp.org
Steuwer M et al (2011) SkelCL—a portable skeleton library for high-level GPU programming. In: IEEE IPDPS workshops, pp 1176–1182
Steve Arnold: CCCC project documentation (2005). http://sarnold.github.io/cccc/
Szuppe J (2016) Boost.Compute: a parallel computing library for C++ based on OpenCL. In: IWOCL, pp 1–39
Tejedor E et al (2011) ClusterSs: a task-based programming model for clusters. In: Proceedings of the 20th international symposium on high performance distributed computing, HPDC, pp 267–268
Tillet P, Cox D (2017) Input-aware auto-tuning of compute-bound HPC kernels. In: SC, pp 1–12
Vinas M et al (2015) Improving OpenCL programmability with the heterogeneous programming library. In: International Conference on Computational Science, ICCS, pp 110 – 119
Wienke S et al (2012) OpenACC—first experiences with real-world applications. In: Euro-Par parallel processing, pp 859–870
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Rasch, A., Bigge, J., Wrodarczyk, M. et al. dOCAL: high-level distributed programming with OpenCL and CUDA. J Supercomput 76, 5117–5138 (2020). https://doi.org/10.1007/s11227-019-02829-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-019-02829-2