Abstract
This paper proposes a deep Q network (DQN)-based method for the workload partition problem in OpenCL. The DQN, a reinforcement learning algorithm, optimizes the workload partition for each processing unit by the self-training, based on the accumulated performance data on the computing environment. Our experiments reveal that the DQN-based partition provides the performance improvement by up to 62.2% and 6.9% in JPEG decoding, compared to the LuxMark-based and target-based partitions, respectively. The DQN is able to capture the low-level contention in slave devices such as caches and memory, and the communication bottleneck between devices, and reflect it to the workload partition ratio.
Similar content being viewed by others
Notes
Object oriented RPC framework of Google.
References
Belviranli ME, Bhuyan LN, Gupta R (2013) A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans Archit Code Optim (TACO) 9(4):57
Cano A (2018) A survey on graphic processing unit computing for large-scale data mining. Wiley Interdiscip Rev Data Min Knowl Discov 8(1):e1232
Choi HJ, Son DO, Kang SG, Kim JM, Lee HH, Kim CH (2013) An efficient scheduling scheme using estimated execution time for heterogeneous computing systems. J Supercomput 65(2):886–902
Constantinides GA (2017) FPGAs in the cloud. In: Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays. ACM, pp 167–167
Gaster B, Howes L, Kaeli DR, Mistry P, Schaa D (2012) Heterogeneous computing with OpenCL: revised OpenCL. 1.2 edn. Morgan Kaufmann
Gregg C, Brantley J, Hazelwood K (2010) Contention-aware scheduling of parallel code for heterogeneous systems. In: 2nd USENIX workshop on hot topics in parallelism, HotPar, Berkeley, CA
Grewe D, O’Boyle MF (2011) A static task partitioning approach for heterogeneous systems using OpenCL. In: International Conference on Compiler Construction. Springer, pp 286–305
Group KOW et al. (2011) The OpenCL specification version 1.1. http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf. Accessed 21 Apr 2018
Helal AE, Feng Wc, Jung C, Hanafy YY (2017) AutoMatch: an automated framework for relative performance estimation and workload distribution on heterogeneous HPC systems. In: Workload characterization (IISWC), 2017 IEEE international symposium on. IEEE, pp 32–42
Kasim H, March V, Zhang R, See S (2008) Survey on parallel programming model. In: IFIP International Conference on Network and Parallel Computing. Springer, pp 266–275
Li HF, Liang TY, Chiu JY (2013) A compound OpenMP/MPI program development toolkit for hybrid CPU/GPU clusters. J Supercomput 66(1):381–405
Li L, Li X, Tan G, Chen M, Zhang P (2011) Experience of parallelizing cryo-EM 3D reconstruction on a CPU–GPU heterogeneous system. In: Proceedings of the 20th international symposium on High performance distributed computing. ACM, pp 195–204
Lu F, Song J, Cao X, Zhu X (2012) CPU/GPU computing for long-wave radiation physics on large GPU clusters. Comput Geosci 41:47–55
LuxCoreRender: Luxmark, an OpenCL benchmark based on LuxCoreRender. http://luxmark.info/. Accessed 3 Mar 2018
Mao H, Alizadeh M, Menache I, Kandula S (2016) Resource management with deep reinforcement learning. In: Proceedings of the 15th ACM workshop on hot topics in networks. ACM, pp 50–56
Mittal S, Vetter JS (2015) A survey of CPU–GPU heterogeneous computing techniques. ACM Comput Surv (CSUR) 47(4):69
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529
Munir A, Koushanfar F, Gordon-Ross A, Ranka S (2013) High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study. J Supercomput 66(1):431–487
Navarro A, Vilches A, Corbera F, Asenjo R (2014) Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures. J Supercomput 70(2):756–771
Ogata Y, Endo T, Maruyama N, Matsuoka S (2008) An efficient, model-based CPU–GPU heterogeneous FFT library. In: Parallel and distributed processing, 2008. IPDPS 2008. IEEE international symposium on. IEEE, pp 1–10
Sodsong W, Hong J, Chung S, Lim Y, Kim SD, Burgstaller B (2014) Dynamic partitioning-based jpeg decompression on heterogeneous multicore architectures. In: Proceedings of programming models and applications on multicores and manycores. ACM, p 80
Steuwer M, Gorlatch S (2014) SkelCL: a high-level extension of OpenCL for multi-GPU systems. J Supercomput 69(1):25–33
Sutton RS, Barto AG (1998) Introduction to reinforcement learning, vol 135. MIT Press, Cambridge
Tang W, Lease M (2011) Semi-supervised consensus labeling for crowdsourcing. In: SIGIR 2011 workshop on crowdsourcing for information retrieval (CIR), pp 1–6
Taylor B, Marco VS, Wang Z (2017) Adaptive optimization for OpenCL programs on embedded heterogeneous systems. In: ACM SIGPLAN notices, vol 52. ACM, pp 11–20
Watkins CJ, Dayan P (1992) Q-learning. Mach Learn 8(3–4):279–292
Windh S, Ma X, Halstead RJ, Budhkar P, Luna Z, Hussaini O, Najjar WA (2015) High-level language tools for reconfigurable computing. Proc IEEE 103(3):390–408
Acknowledgements
This work was partially supported by the National Research Foundation of Korea under Grant NRF-2017R1D1A1B03028926.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Park, S., Suh, T. DQN-based OpenCL workload partition for performance optimization. J Supercomput 75, 4875–4893 (2019). https://doi.org/10.1007/s11227-019-02766-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-019-02766-0