Heterogeneous parallel_for Template for CPU–GPU Chips

Navarro, Angeles; Corbera, Francisco; Rodriguez, Andres; Vilches, Antonio; Asenjo, Rafael

doi:10.1007/s10766-018-0555-0

Heterogeneous parallel_for Template for CPU–GPU Chips

Published: 31 January 2018

Volume 47, pages 213–233, (2019)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Angeles Navarro¹,
Francisco Corbera¹,
Andres Rodriguez¹,
Antonio Vilches¹ &
…
Rafael Asenjo ORCID: orcid.org/0000-0002-1570-3863¹

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Heterogeneous processors, comprising CPU cores and a GPU, are the de facto standard in desktop and mobile platforms. In many cases it is worthwhile to exploit both the CPU and GPU simultaneously. However, the workload distribution poses a challenge when running irregular applications. In this paper, we present LogFit, a novel adaptive partitioning strategy for parallel loops, specially designed for applications with irregular data accesses running on heterogeneous CPU–GPU architectures. Our algorithm dynamically finds the optimal chunk size that must be assigned to the GPU. Also, the number of iterations assigned to the CPU cores are adaptively computed to avoid load unbalance. In addition, we also strive to increase the programmer’s productivity by providing a high level template that eases the coding of heterogeneous parallel loops. We evaluate LogFit’s performance and energy consumption by using a set of irregular benchmarks running on a heterogeneous CPU–GPU processor, an Intel Haswell. Our experimental results show that we outperform Oracle-like static and other dynamic state-of-the-art approaches both in terms of performance, up to 57%, and energy saving, up to 31%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Locality-Aware Task-Parallel Execution on GPUs

Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures

Article 13 May 2014

Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations

Article 07 November 2015

Notes

https://github.com/avilchess/barneshut
For instance, Threading Building Blocks library (TBB) [22], recommends to have CPU chunk sizes that take 100,000 clock cycles at least.
RO = Read-Only; WO = Write-Only; RW = Read–Write
\(nEU =\,\)clGetDeviceInfo(deviceId, CL_DEVICE_MAX_COMPUTE_UNITS)

References

Augonnet, C., Clet-Ortega, J., Thibault, S., Namyst, R.: Data-aware task scheduling on multi-accelerator based platforms. In: Proceedings of ICPADS, pp. 291–298 (2010)
Belviranli, M., Bhuyan, L., Gupta, R.: A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans. Archit. Code Optim. 9(4), 57 (2013)
Article Google Scholar
Bueno, J., Planas, J., Duran, A., Badia, R., Martorell, X., Ayguade, E., Labarta, J.: Productive programming of GPU clusters with OmpSs. In: Proceedings of IPDPS (2012)
Burtscher, M., Nasre, R., Pingali, K.: A quantitative study of irregular programs on GPUs. In: Proceedings of IISWC, pp. 141–151 (2012)
Chatterjee, S., Grossman, M., Sbirlea, A., Sarkar, V.: Dynamic task parallelism with a GPU work-stealing runtime system. In: LNCS Series, vol. 7146, pp. 203–217 (2011)
Che, S., et al.: A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads. In: IISWC, pp. 1–11 (2010)
Danalis, A., Marin, G., McCurdy, C., et al.: The scalable heterogeneous computing (SHOC) benchmark suite. In: GPGPU, pp. 63–74 (2010)
Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1–25 (2011)
MathSciNet MATH Google Scholar
Dementiev, R., Willhalm, T., Bruggeman, O., et al.: Intel Performance Counter Monitor (2012). www.intel.com/software/pcm
Gibbon, P., Frings, W., Mohr, B.: Performance analysis and visualization of the n-body tree code PEPC on massively parallel computers. In: PARCO, pp. 367–374 (2005)
Hart, A.: The OpenACC programming model. Technical Report, Cray Exascale Research Initiative Europe (2012)
Intel: Intel OpenCL N-Body Sample (2014)
Intel VTune Amplifier 2015 (2014). https://software.intel.com/en-us/intel-vtune-amplifier-xe
Kaleem, R., et al.: Adaptive heterogeneous scheduling for integrated GPUs. In: International Conference on Parallel Architectures and Compilation, PACT ’14, pp. 151–162 (2014)
Kulkarni, M., Burtscher, M., Cascaval, C., Pingali, K.: Lonestar: a suite of parallel irregular programs. In: ISPASS, pp. 65–76 (2009)
Li, D., Rhu, M., et al.: Priority-based cache allocation in throughput processors. In: International Symposium on High Performance Computer Architecture (HPCA) (2015)
Lima, J., Gautier, T., Maillard, N., Danjean, V.: Exploiting concurrent GPU operations for efficient work stealing on multi-GPUs. In: SBAC-PAD’12, pp. 75–82 (2012)
Luk, C.K., Hong, S., Kim, H.: Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In: Proceedings of Microarchitecture, pp. 45–55 (2009)
Navarro, A., Vilches, A., Corbera, F., Asenjo, R.: Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures. J. Supercomput. 70, 756–771 (2014)
Article Google Scholar
NVidia: CUDA Toolkit 5.0 Performance Report (2013)
Pandit, P., Govindarajan, R.: Fluidic kernels: cooperative execution of OpenCL programs on multiple heterogeneous devices. In: CGO (2014)
Reinders, J.: Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O’Reilly Media, Inc. (2007)
Rogers, T.G., O’Connor, M., Aamodt, T.M.: Cache-conscious wavefront scheduling. In: IEEE/ACM International Symposium on Microarchitecture, MICRO-45 (2012)
Russel, S.: Levering GPGPU and OpenCL technologies for natural user interaces. Technical Report, You i Labs inc (2012)
Sbirlea, A., Zou, Y., Budimlic, Z., Cong, J., Sarkar, V.: Mapping a data-flow programming model onto heterogeneous platforms. In: Proceedings of LCTES, pp. 61–70 (2012)
Wang, Z., Zheng, L., Chen, Q., Guo, M.: CPU + GPU scheduling with asymptotic profiling. Parallel Comput. 40(2), 107–115 (2014)
Article Google Scholar

Download references

Acknowledgements

Funding was provided by Ministerio de Economía y Competitividad (Grant No. TIN2016-80920-R) and Consejería de Economía, Innovación, Ciencia y Empleo, Junta de Andalucía P11-(Grant No. TIC-08144).

Author information

Authors and Affiliations

Universidad de Málaga, Andalucía Tech, Málaga, Spain
Angeles Navarro, Francisco Corbera, Andres Rodriguez, Antonio Vilches & Rafael Asenjo

Authors

Angeles Navarro
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Corbera
View author publications
You can also search for this author in PubMed Google Scholar
Andres Rodriguez
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Vilches
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Asenjo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rafael Asenjo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Navarro, A., Corbera, F., Rodriguez, A. et al. Heterogeneous parallel_for Template for CPU–GPU Chips. Int J Parallel Prog 47, 213–233 (2019). https://doi.org/10.1007/s10766-018-0555-0

Download citation

Received: 11 September 2017
Accepted: 08 January 2018
Published: 31 January 2018
Issue Date: 01 April 2019
DOI: https://doi.org/10.1007/s10766-018-0555-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Heterogeneous parallel_for Template for CPU–GPU Chips

Abstract

Access this article

Similar content being viewed by others

Locality-Aware Task-Parallel Execution on GPUs

Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures

Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Heterogeneous parallel_for Template for CPU–GPU Chips

Abstract

Access this article

Similar content being viewed by others

Locality-Aware Task-Parallel Execution on GPUs

Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures

Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation