dOCAL: high-level distributed programming with OpenCL and CUDA

Rasch, Ari; Bigge, Julian; Wrodarczyk, Martin; Schulze, Richard; Gorlatch, Sergei

doi:10.1007/s11227-019-02829-2

dOCAL: high-level distributed programming with OpenCL and CUDA

Published: 30 March 2019

Volume 76, pages 5117–5138, (2020)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Ari Rasch ORCID: orcid.org/0000-0002-0286-0755¹,
Julian Bigge¹,
Martin Wrodarczyk¹,
Richard Schulze¹ &
…
Sergei Gorlatch¹

626 Accesses
6 Citations
Explore all metrics

Abstract

In the state-of-the-art parallel programming approaches OpenCL and CUDA, so-called host code is required for program’s execution. Efficiently implementing host code is often a cumbersome task, especially when executing OpenCL and CUDA programs on systems with multiple nodes, each comprising different devices, e.g., multi-core CPU and graphics processing units; the programmer is responsible for explicitly managing node’s and device’s memory, synchronizing computations with data transfers between devices of potentially different nodes and for optimizing data transfers between devices’ memories and nodes’ main memories, e.g., by using pinned main memory for accelerating data transfers and overlapping the transfers with computations. We develop distributed OpenCL/CUDA abstraction layer (dOCAL)—a novel high-level C++ library that simplifies the development of host code. dOCAL combines major advantages over the state-of-the-art high-level approaches: (1) it simplifies implementing both OpenCL and CUDA host code by providing a simple-to-use, high-level abstraction API; (2) it supports executing arbitrary OpenCL and CUDA programs; (3) it allows conveniently targeting the devices of different nodes by automatically managing node-to-node communications; (4) it simplifies implementing data transfer optimizations by providing different, specially allocated memory regions, e.g., pinned main memory for overlapping data transfers with computations; (5) it optimizes memory management by automatically avoiding unnecessary data transfers; (6) it enables interoperability between OpenCL and CUDA host code for systems with devices from different vendors. Our experiments show that dOCAL significantly simplifies the development of host code for heterogeneous and distributed systems, with a low runtime overhead.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-device Controllers: A Library to Simplify Parallel Heterogeneous Programming

Article 09 December 2017

Developing adaptive multi-device applications with the Heterogeneous Programming Library

Article 07 May 2015

Beyond Explicit Transfers: Shared and Managed Memory in OpenMP

References

Rasch A, Gorlatch S (2018) ATF: a generic, directive-based auto-tuning framework. In: CCPE, pp 1–16. https://doi.org/10.1002/cpe.4423
Aldinucci M et al (2015) The loop-of-stencil-reduce paradigm. In: IEEE Trustcom/BigDataSE/ISPA, pp 172–177
Boehm B et al (1995) Cost models for future software life cycle processes: COCOMO 2.0. In: Annals of software engineering, pp 57–94
Boost: Boost.Asio (2018). http://www.boost.org/doc/libs/1_66_0/doc/html/boost_asio.html
Castro D et al (2016) Farms, pipes, streams and reforestation: reasoning about structured parallel processes using types and hylomorphisms. In: Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming, ICFP, pp 4–17
Cedric A et al (2011) StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. In: Concurrency and computation: practice and experience, pp 187–198
Chang PP et al (1989) Inline function expansion for compiling C programs. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, pp 246–257
Dagum L et al (1998) OpenMP: an industry-standard api for shared-memory programming. In: IEEE computational science and engineering, pp 46–55
Dastgeer U et al (2014) The PEPPHER composition tool: performance-aware dynamic composition of applications for GPU-based systems. In: Computing, pp 1195–1211
Wheeler David A (2018) SLOCCount. https://www.dwheeler.com/sloccount/
Du P et al (2012) From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. In: Parallel computing, pp 391 – 407
Duato J et al (2010) rCUDA: reducing the number of GPU-based accelerators in high performance clusters. In: International Conference on High Performance Computing Simulation, pp 224–231
Duran A et al (2011) OmpSs: a proposal for programming heterogeneous multi-core architectures. In: Parallel processing letters, pp 173–193
Enmyren J et al (2010) SkePU: a multi-backend skeleton programming library for multi-GPU systems. In: HLPP, pp 5–14
Ernsting S et al (2011) Data parallel skeletons for GPU clusters and multi-GPU systems. In: PARCO, pp 509–518
Gorlatch S, Cole M (2011) Parallel skeletons. In: Encyclopedia of parallel computing, pp 1417–1422
Grasso I et al (2013) LibWater: heterogeneous distributed computing made easy. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS, pp 161–172
Haidl M, Gorlatch S (2014) PACXX: towards a unified programming model for programming accelerators using C++14. In: LLVM compiler infrastructure in HPC, pp 1–11
Halstead MH (1977) Elements of software science. Elsevier computer science library: operational programming systems series
Intel: Ambient Occlusion Benchmark (AOBench) (2014). http://code.google.com/p/aobench
Intel: Code Samples (2018). https://software.intel.com/en-us/intel-opencl-support/code-samples
Intel: CUDA Deep Neural Network Library (2018). https://developer.nvidia.com/cudnn
Intel: how to increase performance by minimizing buffer copies on intel processor graphics (2018). https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics
Jia Y et al (2014) Caffe: convolutional architecture for fast feature embedding. In: arXiv preprint arXiv:1408.5093
Karimi K et al (2010) A performance comparison of CUDA and OpenCL. In: CoRR
Kegel P et al (2012) dOpenCL: towards a uniform programming approach for distributed heterogeneous multi-/many-core systems. In: IEEE 26th international parallel and distributed processing symposium workshops PhD forum, pp 174–186
Kim J et al (2012) SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters. In: Proceedings of the 26th ACM International Conference on Supercomputing, ICS, pp 341–352
Klöckner A et al (2012) PyCUDA and PyOpenCL: a scripting-based approach to GPU run-time code generation. In: Parallel computing, pp 157 – 174
Koch G et al (2015) Siamese neural networks for one-shot image recognition. In: ICML deep learning workshop
Lee S et al (2010) OpenMPC: extended OpenMP programming and tuning for GPUs. In: ACM/IEEE International Conference for high Performance Computing, Networking, Storage and Analysis, pp 1–11
McCabe T.J (1976) A complexity measure. In: IEEE transactions on software engineering, pp 308–320
Memeti S et al (2017) Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption. In: Workshop on adaptive resource management and scheduling for cloud computing, pp 1–6
Moreton-Fernandez A et al (2017) Multi-device controllers: a library to simplify parallel heterogeneous programming. Int J Parallel Program 47(1):94–113
Article Google Scholar
Nugteren C (2016) CLBlast: a tuned OpenCL BLAS library. In: CoRR
NVIDIA: nvidia-opencl-examples. https://github.com/sschaetz/nvidia-opencl-examples (2012)
NVIDIA: OpenCL samples (2012). https://github.com/sschaetz/nvidia-opencl-examples/
NVIDIA: CUDA Toolkit 9.1 (2018). https://developer.nvidia.com/cuda-toolkit
NVIDIA: how to optimize data transfers in CUDA C/C++ (2018). https://devblogs.nvidia.com/how-optimize-data-transfers-cuda-cc/
NVIDIA: how to overlap data transfers in CUDA C/C++ (2018). https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/
NVIDIA: hyper-Q (2018). http://developer.download.nvidia.com/compute/DevZone/C/html_x64/6_Advanced/simpleHyperQ/doc/HyperQ.pdf
NVIDIA: unified memory for CUDA beginners (2018). https://devblogs.nvidia.com/unified-memory-cuda-beginners/
Pérez B et al (2016) Simplifying programming and load balancing of data parallel applications on heterogeneous systems. In: GPGPU, pp 42–51
Reyes R et al (2015) SYCL: single-source C++ accelerator programming. In: PARCO, pp 673–682
rharish100193: halstead metrics tool (2016). https://sourceforge.net/projects/halsteadmetricstool/
Rompf T et al (2015) Go meta! A case for generative programming and DSLs in performance critical systems. In: LIPIcs–Leibniz international proceedings in informatics, pp 238–261
Rupp K et al (2010) Automatic performance optimization in ViennaCL for GPUs. In: POOSC, pp 1–6
Spafford K et al (2010) Maestro: data orchestration and tuning for OpenCL devices. In: Euro-Par–parallel processing. Springer, Berlin, pp 275–286
Standard C++ foundation foundation members: ISO C++ (2018). https://isocpp.org
Steuwer M et al (2011) SkelCL—a portable skeleton library for high-level GPU programming. In: IEEE IPDPS workshops, pp 1176–1182
Steve Arnold: CCCC project documentation (2005). http://sarnold.github.io/cccc/
Szuppe J (2016) Boost.Compute: a parallel computing library for C++ based on OpenCL. In: IWOCL, pp 1–39
Tejedor E et al (2011) ClusterSs: a task-based programming model for clusters. In: Proceedings of the 20th international symposium on high performance distributed computing, HPDC, pp 267–268
Tillet P, Cox D (2017) Input-aware auto-tuning of compute-bound HPC kernels. In: SC, pp 1–12
Vinas M et al (2015) Improving OpenCL programmability with the heterogeneous programming library. In: International Conference on Computational Science, ICCS, pp 110 – 119
Wienke S et al (2012) OpenACC—first experiences with real-world applications. In: Euro-Par parallel processing, pp 859–870

Download references

Author information

Authors and Affiliations

Department of Mathematics and Computer Science, University of Münster, Münster, Germany
Ari Rasch, Julian Bigge, Martin Wrodarczyk, Richard Schulze & Sergei Gorlatch

Authors

Ari Rasch
View author publications
You can also search for this author in PubMed Google Scholar
Julian Bigge
View author publications
You can also search for this author in PubMed Google Scholar
Martin Wrodarczyk
View author publications
You can also search for this author in PubMed Google Scholar
Richard Schulze
View author publications
You can also search for this author in PubMed Google Scholar
Sergei Gorlatch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ari Rasch.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rasch, A., Bigge, J., Wrodarczyk, M. et al. dOCAL: high-level distributed programming with OpenCL and CUDA. J Supercomput 76, 5117–5138 (2020). https://doi.org/10.1007/s11227-019-02829-2

Download citation

Published: 30 March 2019
Issue Date: July 2020
DOI: https://doi.org/10.1007/s11227-019-02829-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

dOCAL: high-level distributed programming with OpenCL and CUDA

Abstract

Access this article

Similar content being viewed by others

Multi-device Controllers: A Library to Simplify Parallel Heterogeneous Programming

Developing adaptive multi-device applications with the Heterogeneous Programming Library

Beyond Explicit Transfers: Shared and Managed Memory in OpenMP

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

dOCAL: high-level distributed programming with OpenCL and CUDA

Abstract

Access this article

Similar content being viewed by others

Multi-device Controllers: A Library to Simplify Parallel Heterogeneous Programming

Developing adaptive multi-device applications with the Heterogeneous Programming Library

Beyond Explicit Transfers: Shared and Managed Memory in OpenMP

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation