clusterCL: comprehensive support for multi-kernel data-parallel applications in heterogeneous asymmetric clusters

Raca, Valon; Mehofer, Eduard

doi:10.1007/s11227-020-03234-w

clusterCL: comprehensive support for multi-kernel data-parallel applications in heterogeneous asymmetric clusters

Published: 09 March 2020

Volume 76, pages 9976–10008, (2020)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Valon Raca¹ &
Eduard Mehofer¹

187 Accesses
Explore all metrics

Abstract

Heterogeneous cluster systems consisting of CPUs and different kinds of accelerators have become mainstream in HPC. Programming such systems is a difficult task and requires addressing manifold challenges that stem from the intricate composition of such systems and peculiarities of scientific applications. A broad range of obstacles preventing efficient execution have to be considered and dealt with properly. In this paper, we propose a systematic approach and a framework that is capable of providing comprehensive support for running data-parallel applications in heterogeneous asymmetric clusters. Our implementation provides work partitioning and distribution by ensuring workload balance in the cluster while handling of partitioning-induced communication and synchronization in a transparent way. In our experimental section, we choose 11 representative scientific applications from different domains to evaluate our approach. Experimental results show a strong speedup and workload balance for different cluster configurations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Source-to-Source Parallelization Compilers for Scientific Shared-Memory Multi-core and Accelerated Multiprocessing: Analysis, Pitfalls, Enhancement and Potential

Article 08 August 2019

An Open MPI Extension for Supporting Task Based Parallelism in Heterogeneous CPU-GPU Clusters

A Self-adaptive HPL-Based Benchmark with Dynamic Task Parallelism for Multicore Systems

References

Alves A, Rufino J, Pina A, Santos L (2013) clOpenCL—supporting distributed heterogeneous computing in HPC clusters. In: Euro-Par 2012: Parallel Processing Workshops, LNCS, vol 7640. Springer, Berlin, pp 112–122
AMD: AMD accelerated parallel processing SDK. https://developer.amd.com/tools-and-sdks/. Accessed May 2019
Aoki R, Oikawa S, Nakamura T, Miki S (2011) Hybrid OpenCL: enhancing OpenCL for distributed processing. In: International Symposium on Parallel and Distributed Processing with Applications (ISPA), pp 149–154
Augonnet Cedric, Thibault Samuel, Namyst Raymond, Wacrenier Pierre-Andre (2011) Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr Comput Pract Exp 23(2):187–198
Article Google Scholar
Barak A, Ben-Nun T, Levy E, Shiloh A (2010) A package for OpenCL based heterogeneous computing on clusters with many GPU devices. In: 2010 IEEE Conference on Cluster Computing Workshops and Posters, pp 1–7
Beaumont O, Becker BA, DeFlumere A, Eyraud-Dubois L, Lambert T, Lastovetsky A (2019) Recent advances in matrix partitioning for parallel computing on heterogeneous platforms. IEEE Trans Parallel Distrib Syst 30(1):218–229
Article Google Scholar
Beaumont O, Boudet V, Rastello F, Robert Y (2002) Partitioning a square into rectangles: Np-completeness and approximation algorithms. Algorithmica 34(3):217–239
Article MathSciNet Google Scholar
Coti C, Herault T, Lemarinier P, Pilard L, Rezmerita A, Rodriguezb E, Cappello F (2006) Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI. In: SC’06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, p 18
Daly JT (2006) A higher order estimate of the optimum checkpoint interval for restart dumps. Fut Gener Comput Syst 22(3):303–312
Article Google Scholar
Diop T, Gurfinkel S, Anderson J, Jerger NE (2013) Distcl: a framework for the distributed execution of opencl kernels. In: 2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems, pp 556–566
Duato J, Peña AJ, Silla F, Mayo R, Quintana-Ortí ES (2010) rCUDA: reducing the number of GPU-based accelerators in high performance clusters. In: HPCS, pp 224–231
Eskikaya B, Altilar DT (2012) Distributed OpenCL distributing OpenCL platform on network scale. Int J Comput Appl ACCTHPCA(2):25–30
Google Scholar
Gentzsch V (2001) Sun grid engine: towards creating a compute power grid. In: Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid, pp 35–36
Grasso I, Pellegrini S, Cosenza B, Fahringer T (2014) A uniform approach for programming distributed heterogeneous computing systems. J Parallel Distrib Comput 74(12):3228–3239 Domain-specific languages and high-level frameworks for high-performance computing
Article Google Scholar
Grauer-Gray S, Xu L, Searles R, Ayalasomayajula S, Cavazos J (2012) Auto-tuning a high-level language targeted to GPU codes. In: 2012 Innovative Parallel Computing (InPar), pp 1–10
Grewe D, O’Boyle MFP (2011) A static task partitioning approach for heterogeneous systems using opencl. In: Knoop J (ed) Compiler construction. Springer, Berlin, pp 286–305
Chapter Google Scholar
Gschwandtner P, Durillo JJ, Fahringer T (2014) Multi-objective auto-tuning with insieme: optimization and trade-off analysis for time, energy and resource usage. In: Euro-Par 2014 Parallel Processing, LNCS, vol 8632. Springer, Berlin, pp 87–98
Kegel P, Steuwer M, Gorlatch S (2012) dOpenCL: towards a uniform programming approach for distributed heterogeneous multi-/many-core systems. In: Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), pp 174–186
Khronos OpenCL Working Group: The OpenCL C Specification. Version 2.0. http://www.khronos.org/opencl. Accessed May 2019
Kim J, Seo S, Lee J, Nah J, Jo G, Lee J (2012) SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters. In: Proceedings of the 26th ACM International Conference on Supercomputing (ICS). San Servolo Island, Venice, Italy, pp 341–352
Lee J, Samadi M, Mahlke S (2015) Orchestrating multiple data-parallel kernels on multiple devices. In: 2015 International Conference on Parallel Architecture and Compilation (PACT), pp 355–366
Lee J, Samadi M, Park Y, Mahlke S (2013) Transparent CPU–GPU collaboration for data-parallel kernels on heterogeneous systems. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT’13. IEEE Press, Piscataway, NJ, USA, pp 245–256
Louis-Noel Pouchet: PolyBench/GPU. http://web.cse.ohio-state.edu/~pouchet.2/software/polybench/ (2012). Accessed May 2019
Luk CK, Hong S, Kim H (2009) Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In: 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 45–55
MPI Forum: MPI: a message-passing interface standard, Version 3.0. http://www.mpi-forum.org. Accessed May 2019
nVidia: CUDA zone. https://developer.nvidia.com/cuda-zone. Accessed May 2019
nVidia: OpenCL SDK. https://developer.nvidia.com/opencl. Accessed May 2019
OpenACC: OpenACC Standard. https://www.openacc.org/. Accessed May 2019
OpenMP: OpenMP specification. https://www.openmp.org/. Accessed May 2019
Planas J, Badia RM, Ayguadé E, Labarta J (2013) Self-adaptive OMPSS tasks in heterogeneous environments. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp 138–149
Raca V, Mehofer E (2015) Device-sensitive framework for handling heterogeneous asymmetric clusters efficiently. In: 26th IEEE International Symposium on Computer Architecture and High Performance Computing. Florianopolis, Brazil, pp 181–188
Raca V, Mehofer E, Hudec M (2016) Optimal time and energy efficient work distributions in heterogeneous systems. In: 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP). Heraklion, Greece, pp 440–447
Rasch A, Bigge J, Wrodarczyk M, Schulze R, Gorlatch S (2019) dOCAL: high-level distributed programming with OpenCL and CUDA. J Supercomput
Shen J, Varbanescu AL, Lu Y, Zou P, Sips H (2016) Workload partitioning for accelerating applications on heterogeneous platforms. IEEE Trans Parallel Distrib Syst 27(9):2766–2780
Article Google Scholar
Shen J, Varbanescu AL, Martorell X, Sips H (2015) Matchmaking applications and partitioning strategies for efficient execution on heterogeneous platforms. In: 2015 44th International Conference on Parallel Processing, pp 560–569
Steuwer M, Kegel P, Gorlatch S (2011) Skelcl—a portable skeleton library for high-level gpu programming. In: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp 1176–1182
Top500: Top 500 list. https://www.top500.org/lists/2019/06. Accessed July 2019
Wen Y, Wang Z, O’Boyle MFP (2014) Smart multi-task scheduling for openCL programs on CPU/GPU heterogeneous platforms. In: 2014 21st International Conference on High Performance Computing (HiPC), pp 1–10
Yoo AB, Jette MA, Grondona M (2003) Slurm: simple linux utility for resource management. In: Feitelson D, Rudolph L, Schwiegelshohn U (eds) Job scheduling strategies for parallel processing. Springer, Berlin, pp 44–60
Chapter Google Scholar
Young JW (1974) A first order approximation to the optimum checkpoint interval. Commun ACM 17(9):530–531
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by the Austrian Science Fund (FWF) Project P 29783 Dynamic Runtime System for Future Parallel Architectures.

Author information

Authors and Affiliations

Faculty of Computer Science, University of Vienna, Vienna, Austria
Valon Raca & Eduard Mehofer

Authors

Valon Raca
View author publications
You can also search for this author inPubMed Google Scholar
Eduard Mehofer
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Valon Raca.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Raca, V., Mehofer, E. clusterCL: comprehensive support for multi-kernel data-parallel applications in heterogeneous asymmetric clusters. J Supercomput 76, 9976–10008 (2020). https://doi.org/10.1007/s11227-020-03234-w

Download citation

Published: 09 March 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s11227-020-03234-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

clusterCL: comprehensive support for multi-kernel data-parallel applications in heterogeneous asymmetric clusters

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Source-to-Source Parallelization Compilers for Scientific Shared-Memory Multi-core and Accelerated Multiprocessing: Analysis, Pitfalls, Enhancement and Potential

An Open MPI Extension for Supporting Task Based Parallelism in Heterogeneous CPU-GPU Clusters

A Self-adaptive HPL-Based Benchmark with Dynamic Task Parallelism for Multicore Systems

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now