FusionCL: a machine-learning based approach for OpenCL kernel fusion to increase system performance

Khalid, Yasir Noman; Aleem, Muhammad; Ahmed, Usman; Prodan, Radu; Islam, Muhammad Arshad; Iqbal, Muhammad Azhar

doi:10.1007/s00607-021-00958-2

FusionCL: a machine-learning based approach for OpenCL kernel fusion to increase system performance

Regular Paper
Published: 03 June 2021

Volume 103, pages 2171–2202, (2021)
Cite this article

Computing Aims and scope Submit manuscript

Yasir Noman Khalid¹,
Muhammad Aleem ORCID: orcid.org/0000-0001-8342-5757²,
Usman Ahmed³,
Radu Prodan⁴,
Muhammad Arshad Islam² &
…
Muhammad Azhar Iqbal⁵

491 Accesses
4 Citations
Explore all metrics

Abstract

Employing general-purpose graphics processing units (GPGPU) with the help of OpenCL has resulted in greatly reducing the execution time of data-parallel applications by taking advantage of the massive available parallelism. However, when a small data size application is executed on GPU there is a wastage of GPU resources as the application cannot fully utilize GPU compute-cores. There is no mechanism to share a GPU between two kernels due to the lack of operating system support on GPU. In this paper, we propose the provision of a GPU sharing mechanism between two kernels that will lead to increasing GPU occupancy, and as a result, reduce execution time of a job pool. However, if a pair of the kernel is competing for the same set of resources (i.e., both applications are compute-intensive or memory-intensive), kernel fusion may also result in a significant increase in execution time of fused kernels. Therefore, it is pertinent to select an optimal pair of kernels for fusion that will result in significant speedup over their serial execution. This research presents FusionCL, a machine learning-based GPU sharing mechanism between a pair of OpenCL kernels. FusionCL identifies each pair of kernels (from the job pool), which are suitable candidates for fusion using a machine learning-based fusion suitability classifier. Thereafter, from all the candidates, it selects a pair of candidate kernels that will produce maximum speedup after fusion over their serial execution using a fusion speedup predictor. The experimental evaluation shows that the proposed kernel fusion mechanism reduces execution time by 2.83× when compared to a baseline scheduling scheme. When compared to state-of-the-art, the reduction in execution time is up to 8%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Kernel concurrency opportunities based on GPU benchmarks characterization

Article 17 January 2019

Adaptive Simultaneous Multi-tenancy for GPUs

A Hybrid Piece-Wise Slowdown Model for Concurrent Kernel Execution on GPU

Notes

Flops = Floating point operations.

References

Rausch T, Rashed R, Dustdar S (2021) Optimized container scheduling for data-intensive serverless edge computing. Futur Gener Comput Syst 114:259–271. https://doi.org/10.1016/j.future.2020.07.017
Article Google Scholar
Khalid YN, Aleem M, Ahmed U, Islam MA, Iqbal MA (2019) Troodon: a machine-learning based load-balancing application scheduler for CPU–GPU system. J Parallel Distrib Comput 132:79–94. https://doi.org/10.1016/J.JPDC.2019.05.015
Article Google Scholar
Rohr D et al (2014) An energy-efficient multi-GPU supercomputer. In: 2014 IEEE international conference on high performance computing and communications, 2014 IEEE 6th international symposium on cyberspace safety and security, 2014 IEEE 11th international conference on embedded software and system (HPCC, CSS, ICESS). pp 42–45. https://doi.org/10.1109/HPCC.2014.14
Jog A et al (2015) Anatomy of GPU memory system for multi-application execution. In: Proceedings of the 2015 international symposium on memory systems. pp 223–234. https://www.cs.utexas.edu/~skeckler/pubs/MEMSYS_2015_Anatomy.pdf. Accessed 31 Jul 2019
Papadimitriou M, Markou E, Fumero J, Stratikopoulos A, Blanaru F, Kotselidis C (2021) Multiple-tasks on multiple-devices (MTMD): exploiting concurrency in heterogeneous managed runtimes. In: Proceedings of the 17th ACM SIGPLAN/SIGOPS international conference on virtual execution environments. pp 125–138. https://doi.org/10.1145/3453933.3454019
Khalid YN, Aleem M, Prodan R, Iqbal MA, Islam MA (2018) E-OSched: a load balancing scheduler for heterogeneous multicores. J Supercomput. https://doi.org/10.1007/s11227-018-2435-1
Article Google Scholar
OpenCL overview—The Khronos Group Inc (2021). https://www.khronos.org/opencl/ Accessed 02 May 2021
Ahmed U, Aleem M, Noman Khalid Y, Arshad Islam M, Azhar Iqbal M (2019) RALB-HC: a resource-aware load balancer for heterogeneous cluster. Concurr Comput Pract Exp. https://doi.org/10.1002/cpe.5606
Article Google Scholar
Munshi A (2009) The OpenCL specification. In: Hot chips 21 symposium (HCS). IEEE, pp 1–314. https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf. Accessed 30 Oct 2017
Wen Y (2017) Multi-tasking scheduling for heterogeneous systems. University of Edinburgh, Edinburgh
Google Scholar
AMD E-Series E2–7110 Notebook Processor—NotebookCheck.net Tech. https://www.notebookcheck.net/AMD-E-Series-E2-7110-Notebook-Processor.144996.0.html. Accessed 03 May 2021
Lee VW et al (2010) Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. Isca 38(3):451–460. https://doi.org/10.1145/1815961.1816021
Article Google Scholar
Thompson NC, Spanuth S (2021) The decline of computers as a general purpose technology. Commun ACM 64(3):64–72. https://doi.org/10.1145/3430936
Article Google Scholar
Hechtman BA, Sorin DJ (2013) exploring memory consistency for massively-threaded throughput-oriented processors. ACM SIGARCH Comput Archit News 41(3):201–212
Article Google Scholar
Kiran U, Gautam SS, Sharma D (2020) GPU-based matrix-free finite element solver exploiting symmetry of elemental matrices. Computing 102(9):1941–1965. https://doi.org/10.1007/s00607-020-00827-4
Article MathSciNet MATH Google Scholar
Lee S-Y, Wu C-J (2018) Performance characterization, prediction, and optimization for heterogeneous systems with multi-level memory interference. In: 2017 IEEE international symposium on workload characterization (IISWC), pp 43–53. [Online]. https://pdfs.semanticscholar.org/bfed/ce6668172edbec76fc67c29f7a320979c110.pdf. Accessed 07 Feb 2018
Baruah T et al (2020) Valkyrie: leveraging inter-TLB locality to enhance GPU performance. In: Parallel architectures and compilation techniques—conference proceedings, PACT, pp 456–466. https://doi.org/10.1145/3410463.3414639
Kang H, Kwon HC, Kim D (2020) HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs. Computing 102(12):2607–2631. https://doi.org/10.1007/s00607-020-00846-1
Article MathSciNet MATH Google Scholar
Chilingaryan S, Kopmann A, Ametova E, Mirone A (2018) ESRF: balancing load of GPU subsystems to accelerate image reconstruction in parallel beam tomography. In: 30th international symposium on computer architecture and high performance computing (SBAC-PAD). pp 158–166. https://doi.org/10.1109/CAHPC.2018.8645862
Shen M, Luo G (2017) Corolla: GPU-accelerated FPGA routing based on subgraph dynamic expansion. In: Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays. pp 105–114. https://doi.org/10.1145/3020078.3021732
Zhao Z, Song L, Xie R, Yang X (2016) GPU accelerated high-quality video/image super-resolution. In: 2016 IEEE international symposium on broadband multimedia systems and broadcasting (BMSB). pp 1–4. [Online]. http://medialab.sjtu.edu.cn/publications/2016/BMSB2016_ZhaoSongYangXie.pdf. Accessed 28 Jun 2019
Sun Y et al (2019) MGPUSim: enabling multi-GPU performance modeling and optimization. In: Proceedings—international symposium on computer architecture. pp 197–209. https://doi.org/10.1145/3307650.3322230
Ausavarungnirun R et al (2018) MASK: redesigning the GPU memory hierarchy to support multi-application concurrency. ACM SIGPLAN Not 53(2):503–518. https://doi.org/10.1145/3296957.3173169
Article Google Scholar
Grauer-Gray S, Xu L, Searles R, Ayalasomayajula S, Cavazos J (2012) Auto-tuning a high-level language targeted to GPU codes. In: Innovative parallel computing (InPar). pp 1–10. [Online]. https://www.eecis.udel.edu/~searles/resources/autotune-HMPP.pdf. Accessed 31 Jul 2017
Wen Y, O’Boyle MF (2017) Merge or separate? Multi-job scheduling for OpenCL Kernels on CPU/GPU platforms. In: Proceedings of the general purpose GPUs. pp 22–31. https://doi.org/10.1145/3038228.3038235
Choi HJ, Son DO, Kang SG, Kim JM, Lee H-H, Kim CH (2013) An efficient scheduling scheme using estimated execution time for heterogeneous computing systems. J Supercomput 65(2):886–902. https://doi.org/10.1007/s11227-013-0870-6
Article Google Scholar
Wen Y, O’Boyle MFP, Fensch C (2018) MaxPair: enhance OpenCL concurrent kernel execution by weighted maximum matching. In: Proceedings of the 11th workshop on general purpose GPUs. pp 40–49. https://doi.org/10.1145/3180270.3180272
Pai S, Thazhuthaveetil MJ, Govindarajan R (2013) Improving GPGPU concurrency with elastic kernels. ACM SIGPLAN Not 48(4):407–418. https://doi.org/10.1145/2499368.2451160
Article Google Scholar
Zhong J, He B (2014) Kernelet: high-throughput GPU kernel executions with dynamic slicing and scheduling. IEEE Trans Parallel Distrib Syst 25(6):1522–1532. https://doi.org/10.1109/TPDS.2013.257
Article MathSciNet Google Scholar
Wen Y, Wang Z, O’boyle MFP (2014) Smart multi-task scheduling for OpenCL programs on CPU/GPU heterogeneous platforms. In: 2014 21st international conference on high performance computing (HiPC). pp 1–10
Margiolas C, O’Boyle MFP (2016) Portable and transparent software managed scheduling on accelerators for fair resource sharing. In: Proceedings of the 2016 international symposium on code generation and optimization. pp 82–93. https://doi.org/10.1145/2854038.2854040
Jiao Q, Lu M, Huynh HP, Mitra T (2015) Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS. In: Proceedings of the 2015 IEEE/ACM international symposium on code generation and optimization, CGO 2015. pp 1–11. https://doi.org/10.1109/CGO.2015.7054182
Belviranli ME, Khorasani F, Bhuyan LN, Gupta R (2016) CuMAS: data transfer aware multi-application scheduling for shared GPUs. In: Proceedings of the 2016 international conference on supercomputing, {ICS} 2016, Istanbul, Turkey, June 1–3, 2016. pp 31:1–31:12. https://doi.org/10.1145/2925426.2926271
Pérez B, Bosque JL, Beivide R (2016) Simplifying programming and load balancing of data parallel applications on heterogeneous systems. In: Proceedings of the 9th annual workshop on general purpose processing using graphics processing unit—GPGPU ’16. pp 42–51. https://doi.org/10.1145/2884045.2884051
Boyer M, Skadron K, Che S, Jayasena N (2013) Load balancing in a changing world: dealing with heterogeneity and performance variability. In: Proceedings of the ACM international conference on computing frontiers. p 21
Kaleem R, Barik R, Shpeisman T, Hu C, Lewis BT, Pingali K (2017) Adaptive heterogeneous scheduling for integrated GPUs. In: Proceedings of the 23rd international conference on parallel architectures and compilation. pp 151–162. [Online]. http://ai2-s2-pdfs.s3.amazonaws.com/8db3/c11cd85195f459b8ba82fe3326e8f86f1d52.pdf. Accessed 07 Jul 2017
Gregg C, Boyer M, Hazelwood K, Skadron K (2011) Dynamic heterogeneous scheduling decisions using historical runtime data. In: Proceedings of the 2nd workshop on applications for multi-and many-core processors. San Jose, CA. pp 1–12
Grewe MF, Dominik, O’Boyle (2011) A static task partitioning approach for heterogeneous systems using OpenCL. In: International conference on compiler construction. pp 286–305
Kofler K, Grasso I, Cosenza B, Fahringer T (2013) An automatic input-sensitive approach for heterogeneous task partitioning categories and subject descriptors. In: Proceedings of the 27th international ACM conference on international conference on supercomputing—ICS ’13. pp 149–160. https://doi.org/10.1145/2464996.2465007
Insieme Compiler Project. http://www.insieme-compiler.org/. Accessed 02 May 2021
The LLVM Compiler Infrastructure Project. https://llvm.org/. Accessed 02, May 2021
Ravi VT, Becchi M, Jiang W, Agrawal G, Chakradhar S (2013) Scheduling concurrent applications on a cluster of CPU–GPU nodes. Futur Gener Comput Syst 29(8):2262–2271. https://doi.org/10.1109/CCGrid.2012.78
Article Google Scholar
Olson RS, Bartley N, Urbanowicz RJ, Moore JH (2016) Evaluation of a tree-based pipeline optimization tool for automating data science. In: Proceedings of the genetic and evolutionary computation conference 2016. pp 485–492. https://doi.org/10.1145/2908812.2908918
Laadan D, Vainshtein R, Curiel Y, Katz G, Rokach L (2020) MetaTPOT: enhancing a tree-based pipeline optimization tool using meta-learning. In: International conference on information and knowledge management, proceedings. pp 2097–2100. https://doi.org/10.1145/3340531.3412147
Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38(4):367–378. https://doi.org/10.1016/S0167-9473(01)00065-2
Article MathSciNet MATH Google Scholar
Biau G, Cadre B, Rouvière L (2019) Accelerated gradient boosting. Mach Learn 108(6):971–992. https://doi.org/10.1007/s10994-019-05787-1
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The research was partially supported by National University of Computer and Emerging Sciences Islamabad under FRSG Grant No: 11-71/NU-R/20. This work received support from the ASPIDE project funded by the European Commission under the Horizon 2020 Programme (grant number 801091).

Author information

Authors and Affiliations

HITEC University, Taxila, 47080, Pakistan
Yasir Noman Khalid
National University of Computer and Emerging Sciences, Islamabad, 44000, Pakistan
Muhammad Aleem & Muhammad Arshad Islam
Western Norway University of Applied Sciences, 5004, Bergen, Norway
Usman Ahmed
Alpen-Adria-Universität Klagenfurt, 9020, Klagenfurt, Austria
Radu Prodan
School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, 611756, China
Muhammad Azhar Iqbal

Authors

Yasir Noman Khalid
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Aleem
View author publications
You can also search for this author in PubMed Google Scholar
Usman Ahmed
View author publications
You can also search for this author in PubMed Google Scholar
Radu Prodan
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Arshad Islam
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Azhar Iqbal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Muhammad Aleem.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khalid, Y.N., Aleem, M., Ahmed, U. et al. FusionCL: a machine-learning based approach for OpenCL kernel fusion to increase system performance. Computing 103, 2171–2202 (2021). https://doi.org/10.1007/s00607-021-00958-2

Download citation

Received: 12 August 2020
Accepted: 08 May 2021
Published: 03 June 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s00607-021-00958-2

Keywords

Mathematics Subject Classification

68M20

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FusionCL: a machine-learning based approach for OpenCL kernel fusion to increase system performance

Abstract

Access this article

Similar content being viewed by others

Kernel concurrency opportunities based on GPU benchmarks characterization

Adaptive Simultaneous Multi-tenancy for GPUs

A Hybrid Piece-Wise Slowdown Model for Concurrent Kernel Execution on GPU

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

FusionCL: a machine-learning based approach for OpenCL kernel fusion to increase system performance

Abstract

Access this article

Similar content being viewed by others

Kernel concurrency opportunities based on GPU benchmarks characterization

Adaptive Simultaneous Multi-tenancy for GPUs

A Hybrid Piece-Wise Slowdown Model for Concurrent Kernel Execution on GPU

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation