Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures

Navarro, Angeles; Vilches, Antonio; Corbera, Francisco; Asenjo, Rafael

doi:10.1007/s11227-014-1200-3

Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures

Published: 13 May 2014

Volume 70, pages 756–771, (2014)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Angeles Navarro¹,
Antonio Vilches¹,
Francisco Corbera¹ &
…
Rafael Asenjo²

555 Accesses
18 Citations
Explore all metrics

Abstract

This paper explores the possibility of efficiently executing a single application using multicores simultaneously with multiple GPU accelerators under a parallel task programming paradigm. In particular, we address the challenge of extending a parallel_for template to allow its exploitation on heterogeneous architectures. Due to the asymmetry of the computing resources, we propose in this work a dynamic scheduling strategy coupled with an adaptive partitioning scheme that resizes chunks to prevent underutilization and load imbalance of CPUs and GPUs. In this paper we also address the problem of the underutilization of the CPU core where a host thread operates. To solve it, we propose two different approaches: (1) a collaborative host thread strategy, in which the host thread, instead of busy-waiting for the GPU to complete, it carries out useful chunk processing; and (2) a host thread blocking strategy combined with oversubscription, that delegates on the OS the duty of scheduling threads to available CPU cores in order to guarantee that all cores are doing useful work. Using two benchmarks we evaluate the overhead introduced by our scheduling and partitioning algorithms, finding that it is negligible. We also evaluate the efficiency of the strategies proposed finding that allowing oversubscription controlled by the OS can be beneficial under certain scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance improvement of the triangular matrix product in commodity clusters

Article Open access 15 April 2024

The Egyptian national HPC grid (EN-HPCG): open-source Slurm implementation from cluster to grid approach

Article Open access 17 April 2024

Shared Memory Parallelism in Modern C++ and HPX

Article 20 April 2024

References

Augonnet C, Clet-Ortega J, Thibault S, Namyst R (2010). Data-aware task scheduling on multi-accelerator based platforms. In: Parallel and distributed systems (ICPADS)
Augonnet C, Thibault S, Namyst R, Wacrenier P-A (February 2011) StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr Comput Pract Exp 23:187–198
Belviranli ME, Bhuyan LN, Gupta R (2013) A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans Archit Code Optim 9(4):57:1–57:20
Article Google Scholar
Bueno J, Planas J, Duran A, Badia RM, Martorell X, Ayguade E, Labarta J (2012) Productive programming of GPU clusters with OmpSs. In: Proceeding of the IEEE 26th IPDPS
Hart A (2012) The OpenACC programming model. Technical report, Cray Exascale Research Initiative Europe
Kulkarni M, Burtscher M, Cascaval C, Pingali K (2009) Lonestar: a suite of parallel irregular programs. In: International symposium on performance analysis of systems and software (ISPASS’09)
Lima JVF, Gautier T, Maillard N, Danjean V (2012) Exploiting concurrent GPU operations for efficient work stealing on multi-GPUs. In: SBAC-PAD’12, pp 75–82
Luk C-K, Hong S, Kim H (2009) Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In: MICRO-42, pp 45–55
NVIDIA Corporation (2013) CUDA Toolkit Documentation ver.5.5. http://docs.nvidia.com/cuda/index.html. Accessed 20 Nov 2013
Ravi VT, Agrawal G (2011) A dynamic scheduling framework for emerging heterogeneous systems. In: High performance computing (HiPC), pp 1–10
Reinders J (2007) Intel threading building blocks: multi-core parallelism for C++ programming. O’Reilly, USA
Google Scholar
Rudolph DC, Polychronopoulos CD (1989) An efficient message-passing scheduler based on guided self scheduling. In: Proceeding of the third international conference on supercomputing, ICS ’89
Russel SA (2012) Levering GPGPU and OpenCL technologies for natural user interaces. You i Labs inc., Canada Technical report
Google Scholar
Venkatasubramanian S, Vuduc RW (2009) Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems. In: Procedding of the international conference on supercomputing (ICS’09)
Vilches A, Navarro A, Corbera F, Asenjo R (2004) Strategies for maximizing utilization on multi-CPU & multi-GPU heterogeneous architectures. Technical report, Computer Architecture Department. http://www.ac.uma.es/~asenjo/research/

Download references

Acknowledgments

This material is based on work supported by Spanish projects: TIN2010-16144 from the Ministerio de Ciencia e Innovación, by P08-TIC-3500 and P11-TIC-8144 from the Junta de Andalucía, and by CAPAP-H4 network (TIN2011-15734-E).

Author information

Authors and Affiliations

Department of Computer Architecture, University of Malaga, Málaga, Spain
Angeles Navarro, Antonio Vilches & Francisco Corbera
Andalucía Tech, Department of Computer Architecture, Universidad de Málaga, Málaga, Spain
Rafael Asenjo

Authors

Angeles Navarro
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Vilches
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Corbera
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Asenjo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rafael Asenjo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Navarro, A., Vilches, A., Corbera, F. et al. Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures. J Supercomput 70, 756–771 (2014). https://doi.org/10.1007/s11227-014-1200-3

Download citation

Published: 13 May 2014
Issue Date: November 2014
DOI: https://doi.org/10.1007/s11227-014-1200-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures

Abstract

Access this article

Similar content being viewed by others

Performance improvement of the triangular matrix product in commodity clusters

The Egyptian national HPC grid (EN-HPCG): open-source Slurm implementation from cluster to grid approach

Shared Memory Parallelism in Modern C++ and HPX

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures

Abstract

Access this article

Similar content being viewed by others

Performance improvement of the triangular matrix product in commodity clusters

The Egyptian national HPC grid (EN-HPCG): open-source Slurm implementation from cluster to grid approach

Shared Memory Parallelism in Modern C++ and HPX

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation