Out-of-core implementation for accelerator kernels on heterogeneous clouds

Khaleghzadeh, Hamidreza; Zhong, Ziming; Reddy, Ravi; Lastovetsky, Alexey

doi:10.1007/s11227-017-2141-4

Out-of-core implementation for accelerator kernels on heterogeneous clouds

Published: 13 September 2017

Volume 74, pages 551–568, (2018)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Hamidreza Khaleghzadeh ORCID: orcid.org/0000-0003-4070-7468¹,
Ziming Zhong²,
Ravi Reddy¹ &
…
Alexey Lastovetsky¹

431 Accesses
16 Citations
1 Altmetric
Explore all metrics

Abstract

Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to facilitate easier migration to them of HPC workloads. While virtualization of accelerators in clouds is a leading research challenge, we address the programming challenges that assail execution of large instances of data-parallel applications using these accelerators in this paper. In a typical hybrid node in a cloud, the tight integration of accelerators with multicore CPUs via PCI-E communication links contains inherent limitations such as limited main memory of accelerators and limited bandwidth of the PCI-E communication links. These limitations poses formidable programming challenges to execution of large problem sizes on these accelerators. In this paper, we describe a library containing interfaces (HCLOOC) that addresses these challenges. It employs optimal software pipelines to overlap data transfers between host CPU and the accelerator and computations on the accelerator. It is designed using the fundamental building blocks, which are OpenCL command queues for FPGAs, Intel offload streams for Intel Xeon Phis, and CUDA streams and events that allow concurrent utilization of the copy and execution engines provided in NVidia GPUs. We elucidate the key features of our library using an out-of-core implementation of matrix multiplication of large dense matrices on a hybrid node, an Intel Haswell multicore CPU server hosting three accelerators that includes NVidia K40c GPU, Intel Xeon Phi 3120P, and a Xilinx FPGA. Based on experiments with the GPU, we show that our out-of-core implementation achieves 82% of peak double-precision floating performance of the GPU and a speedup of 2.7 times over the NVidia’s out-of-core matrix multiplication implementation (CUBLAS-XT). We also demonstrate that our implementation exhibits 0% drop in performance when the problem size exceeds the main memory of the GPU. We observe this 0% drop also for our implementation for Intel Xeon Phi and Xilinx FPGA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accelerating Scientific Applications on Heterogeneous Systems with HybridOMP

Generating High-Performance FPGA Accelerator Designs for Big Data Analytics with Fletcher and Apache Arrow

Article Open access 01 March 2021

Portability and Scalability of OpenMP Offloading on State-of-the-Art Accelerators

References

Filelis-Papadopoulos CK, Grylonakis ENG, Kyziropoulos PE, Gravvanis GA, Morrison JP (2016) Characterization of hardware in self-managing self-organizing cloud environment. In: Proceedings of the 20th Pan-Hellenic Conference on Informatics, Series PCI ’16. ACM, pp 56:1–56:6
Lynn T, Xiong H, Dong D, Momani B, Gravvanis GA, Filelis-Papadopoulos CK, Elster AC, Khan MM, Tzovaras D, Giannoutakis KM, Petcu D, Neagul M, Dragon I, Kuppudayar P, Natarajan S, McGrath M, Gaydadjiev G, Becker T, Gourinovitch A, Kenny D, Morrison J (2016) CLOUDLIGHTNING: a framework for a self-organising and self-managing heterogeneous cloud. In: Proceedings of the 6th International Conference on Cloud Computing and Services Science, vols 1 and 2, Series CLOSER 2016. SCITEPRESS - Science and Technology Publications, Lda pp 333–338
Hong CH, Spence I, Nikolopoulos D (2017) FairGV: fair and fast GPU virtualization. IEEE Trans Parallel Distrib Syst 99:1–1
Google Scholar
CUBLAS-XT (2016) CUBLAS-XT: multi-GPU version of CUBLAS library supporting out-of-core routines. https://developer.nvidia.com/cublas
Tomov S, Dongarra J, Baboulin M (2010) Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput 36(5–6):232–240
Article MATH Google Scholar
Khronos OpenCL Registry (2017) OpenCL command queues. https://www.khronos.org/registry/OpenCL/specs/opencl-2.2.pdf
Intel (2017) Programming for Intel MIC architecture. https://software.intel.com/en-us/node/684368
NVIDIA (2016) CUDA C programming guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/
NVIDIA (2013) Tesla K40 GPU accelerator. http://www.nvidia.com/content/PDF/kepler/Tesla-K40-PCIe-Passive-Board-Spec-BD-06902-001_v05.pdf
Ostermann S, Iosup A, Yigitbasi N, Prodan R, Fahringer T, Epema D (2009) A performance analysis of EC2 cloud computing services for scientific computing. In: International Conference on Cloud Computing. Springer, pp 115–131
Iosup A, Ostermann S, Yigitbasi MN, Prodan R, Fahringer T, Epema D (2011) Performance analysis of cloud computing services for many-tasks scientific computing. IEEE Trans Parallel Distrib Syst 22(6):931–945
Article Google Scholar
Gupta A, Kalé LV, Milojicic D, Faraboschi P, Balle SM (2013) HPC-aware VM placement in infrastructure clouds. In: 2013 IEEE International Conference on Cloud Engineering (IC2E), Mar 2013, pp 11–20
Parashar M, AbdelBaky M, Rodero I, Devarakonda A (2013) Cloud paradigms and practices for computational and data-enabled science and engineering. Comput Sci Eng 15(4):10–18
Article Google Scholar
Mauch V, Kunze M, Hillenbrand M (2013) High performance cloud computing. Future Gener Comput Syst 29(6):1408–1416
Article Google Scholar
Giunta G, Montella R, Agrillo G, Coviello G (2010) A GPGPU transparent virtualization component for high performance computing clouds. Springer, Berlin
Book Google Scholar
Byma S, Steffan JG, Bannazadeh H, Garcia AL, Chow P (2014) FPGAs in the cloud: booting virtualized hardware accelerators with OpenStack. In: 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines, May 2014, pp 109–116
Hong C-H, Spence I, Nikolopoulos DS (2017) GPU virtualization and scheduling methods: a comprehensive survey. ACM Comput Surv (CSUR) 50(3):35
Article Google Scholar
Gu L, Siegel J, Li X (2011) Using GPUs to compute large out-of-card FFTs. In: Proceedings of the International Conference on Supercomputing, Series ICS ’11. ACM, pp 255–264
Mu X, Zhou H-X, Chen K, Hong W (2014) Higher order method of moments with a parallel out-of-core LU solver on GPU/CPU platform. IEEE Trans Antennas Propag 62(11):5634–5646
Article MathSciNet MATH Google Scholar
Zhong Z, Rychkov V, Lastovetsky A (2012) Data partitioning on heterogeneous multicore and multi-GPU systems using functional performance models of data-parallel applications. In: 2012 IEEE International Conference on Cluster Computing (Cluster 2012), 24–28 Sept 2012, pp 191–199
Zhong Z (2014) Optimization of data-parallel scientific applications on highly heterogeneous modern HPC platforms. Ph.D. dissertation, University College Dublin
Wu J, Jaja J (2016) Achieving native GPU performance for out-of-card large dense matrix multiplication. Parallel Process Lett 26(02):1650007
Article MathSciNet MATH Google Scholar
Edgar R (2009) SciGPU-GEMM. https://github.com/YaohuiZeng/scigpugemm
Martin D (2010) High performance computing linpack benchmark for CUDA. https://github.com/avidday/hpl-cuda
NVIDIA (2017) CUDA toolkit documentation. http://docs.nvidia.com/cuda/cublas/index.html#axzz4kRVc2o6B
Khaleghzadeh H, Zhong Z, Reddy R, Lastovetsky A (2017) ZZGemmOOC: multi-GPU out-of-core routines for dense matrix multiplization. https://git.ucd.ie/hcl/zzgemmooc.git
Khaleghzadeh H, Zhong Z, Reddy R, Lastovetsky A (2017) XeonPhiOOC: out-of-core package for out-of-core DGEMM on Xeon Phi. https://git.ucd.ie/manumachu/xeonphiooc.git
Khaleghzadeh H, Zhong Z, Reddy R, Lastovetsky A (2017) FPGAOOC: out-of-core package for out-of-core DGEMM on FPGA. https://git.ucd.ie/hcl/fpgagemm.git
Intel MKL BLAS. https://software.intel.com/en-us/mkl

Download references

Acknowledgements

This publication has emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under Grant Number 14/IA/2474.

Author information

Authors and Affiliations

School of Computer Science, University College Dublin, Belfield, Dublin 4, Ireland
Hamidreza Khaleghzadeh, Ravi Reddy & Alexey Lastovetsky
Complex Aviation Systems Simulation Laboratory, Beijing, China
Ziming Zhong

Authors

Hamidreza Khaleghzadeh
View author publications
You can also search for this author inPubMed Google Scholar
Ziming Zhong
View author publications
You can also search for this author inPubMed Google Scholar
Ravi Reddy
View author publications
You can also search for this author inPubMed Google Scholar
Alexey Lastovetsky
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Hamidreza Khaleghzadeh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khaleghzadeh, H., Zhong, Z., Reddy, R. et al. Out-of-core implementation for accelerator kernels on heterogeneous clouds. J Supercomput 74, 551–568 (2018). https://doi.org/10.1007/s11227-017-2141-4

Download citation

Published: 13 September 2017
Issue Date: February 2018
DOI: https://doi.org/10.1007/s11227-017-2141-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Out-of-core implementation for accelerator kernels on heterogeneous clouds

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Accelerating Scientific Applications on Heterogeneous Systems with HybridOMP

Generating High-Performance FPGA Accelerator Designs for Big Data Analytics with Fletcher and Apache Arrow

Portability and Scalability of OpenMP Offloading on State-of-the-Art Accelerators

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now