Abstract
OpenCL programming provides full code portability between different hardware platforms, and can serve as a good programming candidate for heterogeneous systems, which typically consist of a host processor and several accelerators. However, to make full use of the computing capacity of such a system, programmers are requested to manage diverse OpenCL-enabled devices explicitly, including distributing the workload between different devices and managing data transfer between multiple devices. All these tedious jobs pose a huge challenge for programmers. In this paper, a distributed shared OpenCL memory (DSOM) is presented, which relieves users of having to manage data transfer explicitly, by supporting shared buffers across devices. DSOM allocates shared buffers in the system memory and treats the on-device memory as a software managed virtual cache buffer. To support fine-grained shared buffer management, we designed a kernel parser in DSOM for buffer access range analysis. A basic modified, shared, invalid cache coherency is implemented for DSOM to maintain coherency for cache buffers. In addition, we propose a novel strategy to minimize communication cost between devices by launching each necessary data transfer as early as possible. This strategy enables overlap of data transfer with kernel execution. Our experimental results show that the applicability of our method for buffer access range analysis is good, and the efficiency of DSOM is high.
Similar content being viewed by others
References
Agarwal, A., Bianchini, R., Chaiken, D., Johnson, K.L., Kranz, D., Kubiatowicz, J., Lim, B.H., Mackenzie, K., Yeung, D., 1995. The MIT Alewife Machine: Architecture and Performance. Proc. 22nd Annual Int. Symp. on Computer Architecture, p.2–13. [doi:10.1145/223982.223985]
Bal, H.E., Tanenbaum, A.S., 1988. Distributed Programming with Shared Data. Proc. Int. Conf. on Computer Languages, p.82–91. [doi:10.1109/ICCL.1988.13046]
Balasundaram, V., Kennedy, K., 1989. A Technique for Summarizing Data Access and Its Use in Parallelism Enhancing Transformations. Proc. ACM SIGPLAN Conf. on Programming Language Design and Implementation, p.41–53. [doi:10.1145/73141.74822]
Bershad, B.N., Zekauskas, M.J., Sawdon, W.A., 1993. The Midway Distributed Shared Memory System. Compcon Spring, Digest of Papers, p.528–537. [doi:10.1109/CMPCON.1993.289730]
Cadar, C., Dunbar, D., Engler, D., 2008. KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. Proc. 8th USENIX Conf. on Operating Systems Design and Implementation, p.209–224.
Callahan, D., Kennedy, K., 1988. Analysis of interprocedural side effects in a parallel programming environment. J. Parall. Distr. Comput., 5(5):517–550. [doi:10.1016/0743-7315(88)90011-1]
Danalis, A., Marin, G., McCurdy, C., Meredith, J.S., Roth, P.C., Spafford, K., Tipparaju, V., Vetter, J.S., 2010. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. Proc. 3rd Workshop on General-Purpose Computation on Graphics Processing Units, p.63–74. [doi:10.1145/1735688.1735702]
Dantzig, G.B., Curtis, E.B., 1973. Fourier-Motzkin elimination and its dual. J. Comb. Theory A, 14(3):288–297.
Dasgupta, P., LeBlanc, R.J.Jr., Ahamad, M., Ramachandran, U., 1991. The clouds distributed operating system. Computer, 24(11):34–44. [doi:10.1109/2.116849]
Delp, G., Sethi, A., Farber, D., 1988. An Analysis of Memnet—an Experiment in High-Speed Shared-Memory Local Networking. Symp. Proc. on Communications architectures and protocols, p.165–174. [doi:10.1145/52324.52342]
Frank, S., Burkhardt, H., Rothnie, J., 1993. The KSR 1: Bridging the Gap Between Shared Memory and MPPs. Compcon Spring, Digest of Papers, p.285–294. [doi:10.1109/CMPCON.1993.289682]
Gelado, I., Stone, J.E., Cabezas, J., Patel, S., Navarro, N., Hwu, W.W., 2010. An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems. Proc. 15th ASPLOS on Architectural Support for Programming Languages and Operating Systems, p.347–358. [doi:10.1145/1736020.1736059]
Jablin, T.B., Prabhu, P., Jablin, J.A., Johnson, N.P., Beard, S.R., August, D.I., 2011. Automatic CPU-GPU Communication Management and Optimization. Proc. 32nd ACM SIGPLAN Conf. on Programming Language Design and Implementation, p.142–151. [doi:10.1145/1993498.1993516]
Jablin, T.B., Jablin, J.A., Prabhu, P., Liu, F., August, D.I., 2012. Dynamically Managed Data for CPU-GPU Architectures. Proc. 10th Int. Symp. on Code Generation and Optimization, p.165–174. [doi:10.1145/2259016.2259038]
Kim, J., Kim, H., Lee, J.H., Lee, J., 2011. Achieving a Single Compute Device Image in OpenCL for Multiple GPUs. Proc. 16th ACM Symp. on Principles and Practice of Parallel Programming, p.277–288. [doi:10.1145/1941553.1941591]
Lattner, C., Adve, V., 2004. LLVM: a Compilation Framework for Lifelong Program Analysis & Transformation. Proc. Int. Symp. on Code Generation and Optimization: Feedback-Directed and Runtime Optimization, p.75–87.
Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., et al., 2010. Debunking the 100x GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. SIGARCH Comput. Archit. News, 38(3):451–460. [doi:10.1145/1816038.1816021]
Paek, Y., Hoeflinger, J., Padua, D., 2002. Efficient and precise array access analysis. ACM Trans. Progr. Lang. Syst., 24(1):65–109. [doi:10.1145/509705.509708]
Pai, S., Govindarajan, R., Thazhuthaveetil, M.J., 2012. Fast and Efficient Automatic Memory Management for GPUs Using Compiler-Assisted Runtime Coherence Scheme. Proc. 21st Int. Conf. on Parallel Architectures and Compilation Techniques, p.33–42. [doi:10.1145/2370816.2370 824]
Pugh, W., 1992. A practical algorithm for exact array dependence analysis. ACM Commun., 35(8):102–114. [doi:10.1145/135226.135233]
Seo, S., Jo, G., Lee, J., 2011. Performance Characterization of the NAS Parallel Benchmarks in OpenCL. IEEE Int. Symp. on Workload Characterization, p.137–148. [doi:10.1109/IISWC.2011.6114174]
Shen, Z., Li, Z., Yew, P.C., 1990. An empirical study of Fortran programs for parallelizing compilers. IEEE Trans. Parall. Distr. Syst., 1(3):356–364. [doi:10.1109/71.80162]
Stratton, J.A., Stone, S.S., Hwu, W.W., 2008. Languages and Compilers for Parallel Computing. Springer-Verlag Berlin Heidelberg, p.16–30. [doi:10.1007/978-3-540-89740-8]
Stratton, J.A., Rodrigues, C., Sung, R., Obeid, N., Chang, L.W., Anssari, N., Liu, D., Hwu, W.W., 2012. Parboil: a Revised Benchmark Suite for Scientific and Commercial Throughput Computing. IMPACT Technical Report No. IMPACT-12-01, Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, Champaign, Illinois, USA.
Triolet, R., Irigoin, F., Feautrier, P., 1986. Direct parallelization of call statements. SIGPLAN Not., 21(7):176–185. [doi:10. 1145/13310.13329]
Wilson, A.W.Jr., LaRowe, R.P.Jr., Teller, M.J., 1993. Hardware Assist for Distributed Shared Memory. Proc. 13th Int. Conf. on Distributed Computing Systems, p.246–255. [doi:10.1109/ICDCS.1993.287702]
Wolfe, M., 2010. Implementing the PGI Accelerator Model. Proc. 3rd Workshop on General-Purpose Computation on Graphics Processing Units, p.43–50. [doi:10.1145/1735 688.1735697]
Yan, Y., Grossman, M., Sarkar, V., 2009. JCUDA: a Programmer-Friendly Interface for Accelerating Java Programs with CUDA. Proc. 15th Int. Euro-Par Conf. on Parallel Processing, p.887–899. [doi:10.1007/978-3-642-03869-3-82]
Author information
Authors and Affiliations
Corresponding author
Additional information
Project supported by the National Natural Science Foundation of China (Nos. 61033008, 61272145, 60903041, and 61103080), the Research Fund for the Doctoral Program of Higher Education of China (No. 20104307110002), the Hunan Provincial Innovation Foundation for Postgraduate (No. CX2010B028), and the Fund of Innovation in Graduate School of NUDT (Nos. B100603 and B120605), China
Rights and permissions
About this article
Cite this article
Xun, Cq., Chen, D., Lan, Q. et al. Efficient fine-grained shared buffer management for multiple OpenCL devices. J. Zhejiang Univ. - Sci. C 14, 859–872 (2013). https://doi.org/10.1631/jzus.C1300078
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1631/jzus.C1300078