Efficient fine-grained shared buffer management for multiple OpenCL devices

Xun, Chang-qing; Chen, Dong; Lan, Qiang; Zhang, Chun-yuan

doi:10.1631/jzus.C1300078

Efficient fine-grained shared buffer management for multiple OpenCL devices

Published: 09 November 2013

Volume 14, pages 859–872, (2013)
Cite this article

Journal of Zhejiang University SCIENCE C Aims and scope Submit manuscript

Chang-qing Xun^1,2,
Dong Chen^1,2,
Qiang Lan^1,2 &
…
Chun-yuan Zhang^1,2

70 Accesses
1 Citation
3 Altmetric
Explore all metrics

Abstract

OpenCL programming provides full code portability between different hardware platforms, and can serve as a good programming candidate for heterogeneous systems, which typically consist of a host processor and several accelerators. However, to make full use of the computing capacity of such a system, programmers are requested to manage diverse OpenCL-enabled devices explicitly, including distributing the workload between different devices and managing data transfer between multiple devices. All these tedious jobs pose a huge challenge for programmers. In this paper, a distributed shared OpenCL memory (DSOM) is presented, which relieves users of having to manage data transfer explicitly, by supporting shared buffers across devices. DSOM allocates shared buffers in the system memory and treats the on-device memory as a software managed virtual cache buffer. To support fine-grained shared buffer management, we designed a kernel parser in DSOM for buffer access range analysis. A basic modified, shared, invalid cache coherency is implemented for DSOM to maintain coherency for cache buffers. In addition, we propose a novel strategy to minimize communication cost between devices by launching each necessary data transfer as early as possible. This strategy enables overlap of data transfer with kernel execution. Our experimental results show that the applicability of our method for buffer access range analysis is good, and the efficiency of DSOM is high.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

dOCAL: high-level distributed programming with OpenCL and CUDA

Article 30 March 2019

Beyond Explicit Transfers: Shared and Managed Memory in OpenMP

Proposing OpenSHMEM Extensions Towards a Future for Hybrid Programming and Heterogeneous Computing

References

Agarwal, A., Bianchini, R., Chaiken, D., Johnson, K.L., Kranz, D., Kubiatowicz, J., Lim, B.H., Mackenzie, K., Yeung, D., 1995. The MIT Alewife Machine: Architecture and Performance. Proc. 22nd Annual Int. Symp. on Computer Architecture, p.2–13. [doi:10.1145/223982.223985]
Chapter Google Scholar
Bal, H.E., Tanenbaum, A.S., 1988. Distributed Programming with Shared Data. Proc. Int. Conf. on Computer Languages, p.82–91. [doi:10.1109/ICCL.1988.13046]
Google Scholar
Balasundaram, V., Kennedy, K., 1989. A Technique for Summarizing Data Access and Its Use in Parallelism Enhancing Transformations. Proc. ACM SIGPLAN Conf. on Programming Language Design and Implementation, p.41–53. [doi:10.1145/73141.74822]
Google Scholar
Bershad, B.N., Zekauskas, M.J., Sawdon, W.A., 1993. The Midway Distributed Shared Memory System. Compcon Spring, Digest of Papers, p.528–537. [doi:10.1109/CMPCON.1993.289730]
Chapter Google Scholar
Cadar, C., Dunbar, D., Engler, D., 2008. KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. Proc. 8th USENIX Conf. on Operating Systems Design and Implementation, p.209–224.
Google Scholar
Callahan, D., Kennedy, K., 1988. Analysis of interprocedural side effects in a parallel programming environment. J. Parall. Distr. Comput., 5(5):517–550. [doi:10.1016/0743-7315(88)90011-1]
Article Google Scholar
Danalis, A., Marin, G., McCurdy, C., Meredith, J.S., Roth, P.C., Spafford, K., Tipparaju, V., Vetter, J.S., 2010. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. Proc. 3rd Workshop on General-Purpose Computation on Graphics Processing Units, p.63–74. [doi:10.1145/1735688.1735702]
Chapter Google Scholar
Dantzig, G.B., Curtis, E.B., 1973. Fourier-Motzkin elimination and its dual. J. Comb. Theory A, 14(3):288–297.
Article MATH Google Scholar
Dasgupta, P., LeBlanc, R.J.Jr., Ahamad, M., Ramachandran, U., 1991. The clouds distributed operating system. Computer, 24(11):34–44. [doi:10.1109/2.116849]
Article Google Scholar
Delp, G., Sethi, A., Farber, D., 1988. An Analysis of Memnet—an Experiment in High-Speed Shared-Memory Local Networking. Symp. Proc. on Communications architectures and protocols, p.165–174. [doi:10.1145/52324.52342]
Google Scholar
Frank, S., Burkhardt, H., Rothnie, J., 1993. The KSR 1: Bridging the Gap Between Shared Memory and MPPs. Compcon Spring, Digest of Papers, p.285–294. [doi:10.1109/CMPCON.1993.289682]
Chapter Google Scholar
Gelado, I., Stone, J.E., Cabezas, J., Patel, S., Navarro, N., Hwu, W.W., 2010. An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems. Proc. 15th ASPLOS on Architectural Support for Programming Languages and Operating Systems, p.347–358. [doi:10.1145/1736020.1736059]
Chapter Google Scholar
Jablin, T.B., Prabhu, P., Jablin, J.A., Johnson, N.P., Beard, S.R., August, D.I., 2011. Automatic CPU-GPU Communication Management and Optimization. Proc. 32nd ACM SIGPLAN Conf. on Programming Language Design and Implementation, p.142–151. [doi:10.1145/1993498.1993516]
Chapter Google Scholar
Jablin, T.B., Jablin, J.A., Prabhu, P., Liu, F., August, D.I., 2012. Dynamically Managed Data for CPU-GPU Architectures. Proc. 10th Int. Symp. on Code Generation and Optimization, p.165–174. [doi:10.1145/2259016.2259038]
Google Scholar
Kim, J., Kim, H., Lee, J.H., Lee, J., 2011. Achieving a Single Compute Device Image in OpenCL for Multiple GPUs. Proc. 16th ACM Symp. on Principles and Practice of Parallel Programming, p.277–288. [doi:10.1145/1941553.1941591]
Google Scholar
Lattner, C., Adve, V., 2004. LLVM: a Compilation Framework for Lifelong Program Analysis & Transformation. Proc. Int. Symp. on Code Generation and Optimization: Feedback-Directed and Runtime Optimization, p.75–87.
Google Scholar
Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., et al., 2010. Debunking the 100x GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. SIGARCH Comput. Archit. News, 38(3):451–460. [doi:10.1145/1816038.1816021]
Article Google Scholar
Paek, Y., Hoeflinger, J., Padua, D., 2002. Efficient and precise array access analysis. ACM Trans. Progr. Lang. Syst., 24(1):65–109. [doi:10.1145/509705.509708]
Article Google Scholar
Pai, S., Govindarajan, R., Thazhuthaveetil, M.J., 2012. Fast and Efficient Automatic Memory Management for GPUs Using Compiler-Assisted Runtime Coherence Scheme. Proc. 21st Int. Conf. on Parallel Architectures and Compilation Techniques, p.33–42. [doi:10.1145/2370816.2370 824]
Chapter Google Scholar
Pugh, W., 1992. A practical algorithm for exact array dependence analysis. ACM Commun., 35(8):102–114. [doi:10.1145/135226.135233]
Article Google Scholar
Seo, S., Jo, G., Lee, J., 2011. Performance Characterization of the NAS Parallel Benchmarks in OpenCL. IEEE Int. Symp. on Workload Characterization, p.137–148. [doi:10.1109/IISWC.2011.6114174]
Google Scholar
Shen, Z., Li, Z., Yew, P.C., 1990. An empirical study of Fortran programs for parallelizing compilers. IEEE Trans. Parall. Distr. Syst., 1(3):356–364. [doi:10.1109/71.80162]
Article Google Scholar
Stratton, J.A., Stone, S.S., Hwu, W.W., 2008. Languages and Compilers for Parallel Computing. Springer-Verlag Berlin Heidelberg, p.16–30. [doi:10.1007/978-3-540-89740-8]
Book Google Scholar
Stratton, J.A., Rodrigues, C., Sung, R., Obeid, N., Chang, L.W., Anssari, N., Liu, D., Hwu, W.W., 2012. Parboil: a Revised Benchmark Suite for Scientific and Commercial Throughput Computing. IMPACT Technical Report No. IMPACT-12-01, Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, Champaign, Illinois, USA.
Google Scholar
Triolet, R., Irigoin, F., Feautrier, P., 1986. Direct parallelization of call statements. SIGPLAN Not., 21(7):176–185. [doi:10. 1145/13310.13329]
Article Google Scholar
Wilson, A.W.Jr., LaRowe, R.P.Jr., Teller, M.J., 1993. Hardware Assist for Distributed Shared Memory. Proc. 13th Int. Conf. on Distributed Computing Systems, p.246–255. [doi:10.1109/ICDCS.1993.287702]
Google Scholar
Wolfe, M., 2010. Implementing the PGI Accelerator Model. Proc. 3rd Workshop on General-Purpose Computation on Graphics Processing Units, p.43–50. [doi:10.1145/1735 688.1735697]
Chapter Google Scholar
Yan, Y., Grossman, M., Sarkar, V., 2009. JCUDA: a Programmer-Friendly Interface for Accelerating Java Programs with CUDA. Proc. 15th Int. Euro-Par Conf. on Parallel Processing, p.887–899. [doi:10.1007/978-3-642-03869-3-82]
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer, National University of Defense Technology, Changsha, 410073, China
Chang-qing Xun, Dong Chen, Qiang Lan & Chun-yuan Zhang
State Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha, 410073, China
Chang-qing Xun, Dong Chen, Qiang Lan & Chun-yuan Zhang

Authors

Chang-qing Xun
View author publications
You can also search for this author in PubMed Google Scholar
Dong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Lan
View author publications
You can also search for this author in PubMed Google Scholar
Chun-yuan Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chang-qing Xun.

Additional information

Project supported by the National Natural Science Foundation of China (Nos. 61033008, 61272145, 60903041, and 61103080), the Research Fund for the Doctoral Program of Higher Education of China (No. 20104307110002), the Hunan Provincial Innovation Foundation for Postgraduate (No. CX2010B028), and the Fund of Innovation in Graduate School of NUDT (Nos. B100603 and B120605), China

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xun, Cq., Chen, D., Lan, Q. et al. Efficient fine-grained shared buffer management for multiple OpenCL devices. J. Zhejiang Univ. - Sci. C 14, 859–872 (2013). https://doi.org/10.1631/jzus.C1300078

Download citation

Received: 02 April 2013
Accepted: 12 September 2013
Published: 09 November 2013
Issue Date: November 2013
DOI: https://doi.org/10.1631/jzus.C1300078

Key words

CLC number

TP393

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient fine-grained shared buffer management for multiple OpenCL devices

Abstract

Access this article

Similar content being viewed by others

dOCAL: high-level distributed programming with OpenCL and CUDA

Beyond Explicit Transfers: Shared and Managed Memory in OpenMP

Proposing OpenSHMEM Extensions Towards a Future for Hybrid Programming and Heterogeneous Computing

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Key words

CLC number

Navigation

Efficient fine-grained shared buffer management for multiple OpenCL devices

Abstract

Access this article

Similar content being viewed by others

dOCAL: high-level distributed programming with OpenCL and CUDA

Beyond Explicit Transfers: Shared and Managed Memory in OpenMP

Proposing OpenSHMEM Extensions Towards a Future for Hybrid Programming and Heterogeneous Computing

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

CLC number

Search

Navigation