Skip to main content
Log in

Efficient fine-grained shared buffer management for multiple OpenCL devices

  • Published:
Journal of Zhejiang University SCIENCE C Aims and scope Submit manuscript

Abstract

OpenCL programming provides full code portability between different hardware platforms, and can serve as a good programming candidate for heterogeneous systems, which typically consist of a host processor and several accelerators. However, to make full use of the computing capacity of such a system, programmers are requested to manage diverse OpenCL-enabled devices explicitly, including distributing the workload between different devices and managing data transfer between multiple devices. All these tedious jobs pose a huge challenge for programmers. In this paper, a distributed shared OpenCL memory (DSOM) is presented, which relieves users of having to manage data transfer explicitly, by supporting shared buffers across devices. DSOM allocates shared buffers in the system memory and treats the on-device memory as a software managed virtual cache buffer. To support fine-grained shared buffer management, we designed a kernel parser in DSOM for buffer access range analysis. A basic modified, shared, invalid cache coherency is implemented for DSOM to maintain coherency for cache buffers. In addition, we propose a novel strategy to minimize communication cost between devices by launching each necessary data transfer as early as possible. This strategy enables overlap of data transfer with kernel execution. Our experimental results show that the applicability of our method for buffer access range analysis is good, and the efficiency of DSOM is high.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Agarwal, A., Bianchini, R., Chaiken, D., Johnson, K.L., Kranz, D., Kubiatowicz, J., Lim, B.H., Mackenzie, K., Yeung, D., 1995. The MIT Alewife Machine: Architecture and Performance. Proc. 22nd Annual Int. Symp. on Computer Architecture, p.2–13. [doi:10.1145/223982.223985]

    Chapter  Google Scholar 

  • Bal, H.E., Tanenbaum, A.S., 1988. Distributed Programming with Shared Data. Proc. Int. Conf. on Computer Languages, p.82–91. [doi:10.1109/ICCL.1988.13046]

    Google Scholar 

  • Balasundaram, V., Kennedy, K., 1989. A Technique for Summarizing Data Access and Its Use in Parallelism Enhancing Transformations. Proc. ACM SIGPLAN Conf. on Programming Language Design and Implementation, p.41–53. [doi:10.1145/73141.74822]

    Google Scholar 

  • Bershad, B.N., Zekauskas, M.J., Sawdon, W.A., 1993. The Midway Distributed Shared Memory System. Compcon Spring, Digest of Papers, p.528–537. [doi:10.1109/CMPCON.1993.289730]

    Chapter  Google Scholar 

  • Cadar, C., Dunbar, D., Engler, D., 2008. KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. Proc. 8th USENIX Conf. on Operating Systems Design and Implementation, p.209–224.

    Google Scholar 

  • Callahan, D., Kennedy, K., 1988. Analysis of interprocedural side effects in a parallel programming environment. J. Parall. Distr. Comput., 5(5):517–550. [doi:10.1016/0743-7315(88)90011-1]

    Article  Google Scholar 

  • Danalis, A., Marin, G., McCurdy, C., Meredith, J.S., Roth, P.C., Spafford, K., Tipparaju, V., Vetter, J.S., 2010. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. Proc. 3rd Workshop on General-Purpose Computation on Graphics Processing Units, p.63–74. [doi:10.1145/1735688.1735702]

    Chapter  Google Scholar 

  • Dantzig, G.B., Curtis, E.B., 1973. Fourier-Motzkin elimination and its dual. J. Comb. Theory A, 14(3):288–297.

    Article  MATH  Google Scholar 

  • Dasgupta, P., LeBlanc, R.J.Jr., Ahamad, M., Ramachandran, U., 1991. The clouds distributed operating system. Computer, 24(11):34–44. [doi:10.1109/2.116849]

    Article  Google Scholar 

  • Delp, G., Sethi, A., Farber, D., 1988. An Analysis of Memnet—an Experiment in High-Speed Shared-Memory Local Networking. Symp. Proc. on Communications architectures and protocols, p.165–174. [doi:10.1145/52324.52342]

    Google Scholar 

  • Frank, S., Burkhardt, H., Rothnie, J., 1993. The KSR 1: Bridging the Gap Between Shared Memory and MPPs. Compcon Spring, Digest of Papers, p.285–294. [doi:10.1109/CMPCON.1993.289682]

    Chapter  Google Scholar 

  • Gelado, I., Stone, J.E., Cabezas, J., Patel, S., Navarro, N., Hwu, W.W., 2010. An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems. Proc. 15th ASPLOS on Architectural Support for Programming Languages and Operating Systems, p.347–358. [doi:10.1145/1736020.1736059]

    Chapter  Google Scholar 

  • Jablin, T.B., Prabhu, P., Jablin, J.A., Johnson, N.P., Beard, S.R., August, D.I., 2011. Automatic CPU-GPU Communication Management and Optimization. Proc. 32nd ACM SIGPLAN Conf. on Programming Language Design and Implementation, p.142–151. [doi:10.1145/1993498.1993516]

    Chapter  Google Scholar 

  • Jablin, T.B., Jablin, J.A., Prabhu, P., Liu, F., August, D.I., 2012. Dynamically Managed Data for CPU-GPU Architectures. Proc. 10th Int. Symp. on Code Generation and Optimization, p.165–174. [doi:10.1145/2259016.2259038]

    Google Scholar 

  • Kim, J., Kim, H., Lee, J.H., Lee, J., 2011. Achieving a Single Compute Device Image in OpenCL for Multiple GPUs. Proc. 16th ACM Symp. on Principles and Practice of Parallel Programming, p.277–288. [doi:10.1145/1941553.1941591]

    Google Scholar 

  • Lattner, C., Adve, V., 2004. LLVM: a Compilation Framework for Lifelong Program Analysis & Transformation. Proc. Int. Symp. on Code Generation and Optimization: Feedback-Directed and Runtime Optimization, p.75–87.

    Google Scholar 

  • Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., et al., 2010. Debunking the 100x GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. SIGARCH Comput. Archit. News, 38(3):451–460. [doi:10.1145/1816038.1816021]

    Article  Google Scholar 

  • Paek, Y., Hoeflinger, J., Padua, D., 2002. Efficient and precise array access analysis. ACM Trans. Progr. Lang. Syst., 24(1):65–109. [doi:10.1145/509705.509708]

    Article  Google Scholar 

  • Pai, S., Govindarajan, R., Thazhuthaveetil, M.J., 2012. Fast and Efficient Automatic Memory Management for GPUs Using Compiler-Assisted Runtime Coherence Scheme. Proc. 21st Int. Conf. on Parallel Architectures and Compilation Techniques, p.33–42. [doi:10.1145/2370816.2370 824]

    Chapter  Google Scholar 

  • Pugh, W., 1992. A practical algorithm for exact array dependence analysis. ACM Commun., 35(8):102–114. [doi:10.1145/135226.135233]

    Article  Google Scholar 

  • Seo, S., Jo, G., Lee, J., 2011. Performance Characterization of the NAS Parallel Benchmarks in OpenCL. IEEE Int. Symp. on Workload Characterization, p.137–148. [doi:10.1109/IISWC.2011.6114174]

    Google Scholar 

  • Shen, Z., Li, Z., Yew, P.C., 1990. An empirical study of Fortran programs for parallelizing compilers. IEEE Trans. Parall. Distr. Syst., 1(3):356–364. [doi:10.1109/71.80162]

    Article  Google Scholar 

  • Stratton, J.A., Stone, S.S., Hwu, W.W., 2008. Languages and Compilers for Parallel Computing. Springer-Verlag Berlin Heidelberg, p.16–30. [doi:10.1007/978-3-540-89740-8]

    Book  Google Scholar 

  • Stratton, J.A., Rodrigues, C., Sung, R., Obeid, N., Chang, L.W., Anssari, N., Liu, D., Hwu, W.W., 2012. Parboil: a Revised Benchmark Suite for Scientific and Commercial Throughput Computing. IMPACT Technical Report No. IMPACT-12-01, Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, Champaign, Illinois, USA.

    Google Scholar 

  • Triolet, R., Irigoin, F., Feautrier, P., 1986. Direct parallelization of call statements. SIGPLAN Not., 21(7):176–185. [doi:10. 1145/13310.13329]

    Article  Google Scholar 

  • Wilson, A.W.Jr., LaRowe, R.P.Jr., Teller, M.J., 1993. Hardware Assist for Distributed Shared Memory. Proc. 13th Int. Conf. on Distributed Computing Systems, p.246–255. [doi:10.1109/ICDCS.1993.287702]

    Google Scholar 

  • Wolfe, M., 2010. Implementing the PGI Accelerator Model. Proc. 3rd Workshop on General-Purpose Computation on Graphics Processing Units, p.43–50. [doi:10.1145/1735 688.1735697]

    Chapter  Google Scholar 

  • Yan, Y., Grossman, M., Sarkar, V., 2009. JCUDA: a Programmer-Friendly Interface for Accelerating Java Programs with CUDA. Proc. 15th Int. Euro-Par Conf. on Parallel Processing, p.887–899. [doi:10.1007/978-3-642-03869-3-82]

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chang-qing Xun.

Additional information

Project supported by the National Natural Science Foundation of China (Nos. 61033008, 61272145, 60903041, and 61103080), the Research Fund for the Doctoral Program of Higher Education of China (No. 20104307110002), the Hunan Provincial Innovation Foundation for Postgraduate (No. CX2010B028), and the Fund of Innovation in Graduate School of NUDT (Nos. B100603 and B120605), China

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xun, Cq., Chen, D., Lan, Q. et al. Efficient fine-grained shared buffer management for multiple OpenCL devices. J. Zhejiang Univ. - Sci. C 14, 859–872 (2013). https://doi.org/10.1631/jzus.C1300078

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1631/jzus.C1300078

Key words

CLC number

Navigation