Skip to main content
Log in

dOCAL: high-level distributed programming with OpenCL and CUDA

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

In the state-of-the-art parallel programming approaches OpenCL and CUDA, so-called host code is required for program’s execution. Efficiently implementing host code is often a cumbersome task, especially when executing OpenCL and CUDA programs on systems with multiple nodes, each comprising different devices, e.g., multi-core CPU and graphics processing units; the programmer is responsible for explicitly managing node’s and device’s memory, synchronizing computations with data transfers between devices of potentially different nodes and for optimizing data transfers between devices’ memories and nodes’ main memories, e.g., by using pinned main memory for accelerating data transfers and overlapping the transfers with computations. We develop distributed OpenCL/CUDA abstraction layer (dOCAL)—a novel high-level C++ library that simplifies the development of host code. dOCAL combines major advantages over the state-of-the-art high-level approaches: (1) it simplifies implementing both OpenCL and CUDA host code by providing a simple-to-use, high-level abstraction API; (2) it supports executing arbitrary OpenCL and CUDA programs; (3) it allows conveniently targeting the devices of different nodes by automatically managing node-to-node communications; (4) it simplifies implementing data transfer optimizations by providing different, specially allocated memory regions, e.g., pinned main memory for overlapping data transfers with computations; (5) it optimizes memory management by automatically avoiding unnecessary data transfers; (6) it enables interoperability between OpenCL and CUDA host code for systems with devices from different vendors. Our experiments show that dOCAL significantly simplifies the development of host code for heterogeneous and distributed systems, with a low runtime overhead.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Rasch A, Gorlatch S (2018) ATF: a generic, directive-based auto-tuning framework. In: CCPE, pp 1–16. https://doi.org/10.1002/cpe.4423

  2. Aldinucci M et al (2015) The loop-of-stencil-reduce paradigm. In: IEEE Trustcom/BigDataSE/ISPA, pp 172–177

  3. Boehm B et al (1995) Cost models for future software life cycle processes: COCOMO 2.0. In: Annals of software engineering, pp 57–94

  4. Boost: Boost.Asio (2018). http://www.boost.org/doc/libs/1_66_0/doc/html/boost_asio.html

  5. Castro D et al (2016) Farms, pipes, streams and reforestation: reasoning about structured parallel processes using types and hylomorphisms. In: Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming, ICFP, pp 4–17

  6. Cedric A et al (2011) StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. In: Concurrency and computation: practice and experience, pp 187–198

  7. Chang PP et al (1989) Inline function expansion for compiling C programs. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, pp 246–257

  8. Dagum L et al (1998) OpenMP: an industry-standard api for shared-memory programming. In: IEEE computational science and engineering, pp 46–55

  9. Dastgeer U et al (2014) The PEPPHER composition tool: performance-aware dynamic composition of applications for GPU-based systems. In: Computing, pp 1195–1211

  10. Wheeler David A (2018) SLOCCount. https://www.dwheeler.com/sloccount/

  11. Du P et al (2012) From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. In: Parallel computing, pp 391 – 407

  12. Duato J et al (2010) rCUDA: reducing the number of GPU-based accelerators in high performance clusters. In: International Conference on High Performance Computing Simulation, pp 224–231

  13. Duran A et al (2011) OmpSs: a proposal for programming heterogeneous multi-core architectures. In: Parallel processing letters, pp 173–193

  14. Enmyren J et al (2010) SkePU: a multi-backend skeleton programming library for multi-GPU systems. In: HLPP, pp 5–14

  15. Ernsting S et al (2011) Data parallel skeletons for GPU clusters and multi-GPU systems. In: PARCO, pp 509–518

  16. Gorlatch S, Cole M (2011) Parallel skeletons. In: Encyclopedia of parallel computing, pp 1417–1422

  17. Grasso I et al (2013) LibWater: heterogeneous distributed computing made easy. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS, pp 161–172

  18. Haidl M, Gorlatch S (2014) PACXX: towards a unified programming model for programming accelerators using C++14. In: LLVM compiler infrastructure in HPC, pp 1–11

  19. Halstead MH (1977) Elements of software science. Elsevier computer science library: operational programming systems series

  20. Intel: Ambient Occlusion Benchmark (AOBench) (2014). http://code.google.com/p/aobench

  21. Intel: Code Samples (2018). https://software.intel.com/en-us/intel-opencl-support/code-samples

  22. Intel: CUDA Deep Neural Network Library (2018). https://developer.nvidia.com/cudnn

  23. Intel: how to increase performance by minimizing buffer copies on intel processor graphics (2018). https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics

  24. Jia Y et al (2014) Caffe: convolutional architecture for fast feature embedding. In: arXiv preprint arXiv:1408.5093

  25. Karimi K et al (2010) A performance comparison of CUDA and OpenCL. In: CoRR

  26. Kegel P et al (2012) dOpenCL: towards a uniform programming approach for distributed heterogeneous multi-/many-core systems. In: IEEE 26th international parallel and distributed processing symposium workshops PhD forum, pp 174–186

  27. Kim J et al (2012) SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters. In: Proceedings of the 26th ACM International Conference on Supercomputing, ICS, pp 341–352

  28. Klöckner A et al (2012) PyCUDA and PyOpenCL: a scripting-based approach to GPU run-time code generation. In: Parallel computing, pp 157 – 174

  29. Koch G et al (2015) Siamese neural networks for one-shot image recognition. In: ICML deep learning workshop

  30. Lee S et al (2010) OpenMPC: extended OpenMP programming and tuning for GPUs. In: ACM/IEEE International Conference for high Performance Computing, Networking, Storage and Analysis, pp 1–11

  31. McCabe T.J (1976) A complexity measure. In: IEEE transactions on software engineering, pp 308–320

  32. Memeti S et al (2017) Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption. In: Workshop on adaptive resource management and scheduling for cloud computing, pp 1–6

  33. Moreton-Fernandez A et al (2017) Multi-device controllers: a library to simplify parallel heterogeneous programming. Int J Parallel Program 47(1):94–113

    Article  Google Scholar 

  34. Nugteren C (2016) CLBlast: a tuned OpenCL BLAS library. In: CoRR

  35. NVIDIA: nvidia-opencl-examples. https://github.com/sschaetz/nvidia-opencl-examples (2012)

  36. NVIDIA: OpenCL samples (2012). https://github.com/sschaetz/nvidia-opencl-examples/

  37. NVIDIA: CUDA Toolkit 9.1 (2018). https://developer.nvidia.com/cuda-toolkit

  38. NVIDIA: how to optimize data transfers in CUDA C/C++ (2018). https://devblogs.nvidia.com/how-optimize-data-transfers-cuda-cc/

  39. NVIDIA: how to overlap data transfers in CUDA C/C++ (2018). https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/

  40. NVIDIA: hyper-Q (2018). http://developer.download.nvidia.com/compute/DevZone/C/html_x64/6_Advanced/simpleHyperQ/doc/HyperQ.pdf

  41. NVIDIA: unified memory for CUDA beginners (2018). https://devblogs.nvidia.com/unified-memory-cuda-beginners/

  42. Pérez B et al (2016) Simplifying programming and load balancing of data parallel applications on heterogeneous systems. In: GPGPU, pp 42–51

  43. Reyes R et al (2015) SYCL: single-source C++ accelerator programming. In: PARCO, pp 673–682

  44. rharish100193: halstead metrics tool (2016). https://sourceforge.net/projects/halsteadmetricstool/

  45. Rompf T et al (2015) Go meta! A case for generative programming and DSLs in performance critical systems. In: LIPIcs–Leibniz international proceedings in informatics, pp 238–261

  46. Rupp K et al (2010) Automatic performance optimization in ViennaCL for GPUs. In: POOSC, pp 1–6

  47. Spafford K et al (2010) Maestro: data orchestration and tuning for OpenCL devices. In: Euro-Par–parallel processing. Springer, Berlin, pp 275–286

  48. Standard C++ foundation foundation members: ISO C++ (2018). https://isocpp.org

  49. Steuwer M et al (2011) SkelCL—a portable skeleton library for high-level GPU programming. In: IEEE IPDPS workshops, pp 1176–1182

  50. Steve Arnold: CCCC project documentation (2005). http://sarnold.github.io/cccc/

  51. Szuppe J (2016) Boost.Compute: a parallel computing library for C++ based on OpenCL. In: IWOCL, pp 1–39

  52. Tejedor E et al (2011) ClusterSs: a task-based programming model for clusters. In: Proceedings of the 20th international symposium on high performance distributed computing, HPDC, pp 267–268

  53. Tillet P, Cox D (2017) Input-aware auto-tuning of compute-bound HPC kernels. In: SC, pp 1–12

  54. Vinas M et al (2015) Improving OpenCL programmability with the heterogeneous programming library. In: International Conference on Computational Science, ICCS, pp 110 – 119

  55. Wienke S et al (2012) OpenACC—first experiences with real-world applications. In: Euro-Par parallel processing, pp 859–870

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ari Rasch.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rasch, A., Bigge, J., Wrodarczyk, M. et al. dOCAL: high-level distributed programming with OpenCL and CUDA. J Supercomput 76, 5117–5138 (2020). https://doi.org/10.1007/s11227-019-02829-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-019-02829-2

Keywords

Navigation