skip to main content
10.1145/2304576.2304623acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters

Published: 25 June 2012 Publication History

Abstract

In this paper, we propose SnuCL, an OpenCL framework for heterogeneous CPU/GPU clusters. We show that the original OpenCL semantics naturally fits to the heterogeneous cluster programming environment, and the framework achieves high performance and ease of programming. The target cluster architecture consists of a designated, single host node and many compute nodes. They are connected by an interconnection network, such as Gigabit Ethernet and InfiniBand switches. Each compute node is equipped with multicore CPUs and multiple GPUs. A set of CPU cores or each GPU becomes an OpenCL compute device. The host node executes the host program in an OpenCL application. SnuCL provides a system image running a single operating system instance for heterogeneous CPU/GPU clusters to the user. It allows the application to utilize compute devices in a compute node as if they were in the host node. No communication API, such as the MPI library, is required in the application source. SnuCL also provides collective communication extensions to OpenCL to facilitate manipulating memory objects. With SnuCL, an OpenCL application becomes portable not only between heterogeneous devices in a single node, but also between compute devices in the cluster environment. We implement SnuCL and evaluate its performance using eleven OpenCL benchmark applications.

References

[1]
AMD. AMD Accelerated Parallel Processing (APP) SDK, 2011. http://developer.amd.com/sdks/amdappsdk/pages/default.aspx.
[2]
C. Amza, A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. TreadMarks: Shared Memory Computing on Networks of Workstations. Computer, 29:18--28, February 1996.
[3]
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, PACT'08, pages 72--81, 2008.
[4]
J. Duato, A. Pena, F. Silla, R. Mayo, and E. Quintana-Orti. rCUDA: Reducing the number of GPU-based accelerators in high performance clusters. In Proceedings of the International Conference on High Performance Computing and Simulation, HPCS'11, pages 224--231, 28 2010-july 2 2010.
[5]
J. Gummaraju, L. Morichetti, M. Houston, B. Sander, B. R. Gaster, and B. Zheng. Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT'10, pages 205--216, 2010.
[6]
IBM. OpenCL Development Kit for Linux on Power, 2011. http://www.alphaworks.ibm.com/tech/opencl.
[7]
Intel. Intel Composer XE 2011 for Linux. http://software.intel.com/en-us/articles/intel-composer-xe.
[8]
Intel. Intel OpenCL SDK, 2011. http://software.intel.com/en-us/articles/vcsource-tools-opencl-sdk/.
[9]
Khronos OpenCL Working Group. The OpenCL Specification Version 1.1, 2010. http://www.khronos.org/opencl.
[10]
J. Kim, H. Kim, J. H. Lee, and J. Lee. Achieving a single compute device image in OpenCL for multiple GPUs. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP'11, pages 277--288, 2011.
[11]
J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. OpenCL as a Programming Model for GPU Clusters. In Proceedings of the 24th International Workshop on Languages and Compilers for Parallel Computing, LCPC'11, 2011.
[12]
D. B. Kirk and W.-m. W. Hwu. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2010.
[13]
C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, CGO'04, pages 75--86, 2004.
[14]
J. Lee, J. Kim, J. Kim, S. Seo, and J. Lee. An OpenCL Framework for Homogeneous Manycores with no Hardware Cache Coherence. In Proceedings of the 20th international conference on Parallel architectures and compilation techniques, PACT'11, 2011.
[15]
J. Lee, J. Kim, S. Seo, S. Kim, J. Park, H. Kim, T. T. Dao, Y. Cho, S. J. Seo, S. H. Lee, S. M. Cho, H. J. Song, S.-B. Suh, and J.-D. Choi. An OpenCL framework for heterogeneous multicores with local memory. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT'10, pages 193--204, 2010.
[16]
V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In Proceedings of the 37th annual international symposium on Computer architecture, ISCA'10, pages 451--460, 2010.
[17]
H. Li, S. Tandri, M. Stumm, and K. C. Sevcik. Locality and Loop Scheduling on NUMA Multiprocessors. In Proceedings of the 1993 International Conference on Parallel Processing - Volume 02, ICPP'93, pages 140--147, 1993.
[18]
NASA Advanced Supercomputing Division. NAS Parallel Benchmarks version 3.3. http://www.nas.nasa.gov/Resources/Software/npb.html.
[19]
NVIDIA. NVIDIA CUDA Toolkit 4.0. http://developer.nvidia.com/cuda-toolkit-40.
[20]
NVIDIA. NVIDIA OpenCL, 2011. http://developer.nvidia.com/opencl.
[21]
S. Seo, G. Jo, and J. Lee. Performance Characterization of the NAS Parallel Benchmarks in OpenCL. In Proceedings of the 2011 IEEE International Symposium on Workload Characterization, IISWC'11, pages 137--148, 2011.
[22]
Seoul National University and Samsung. SNU-SAMSUNG OpenCL Framework, 2010. http://opencl.snu.ac.kr.
[23]
M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra. MPI-The Complete Reference, Volume 1: The MPI Core. MIT Press, Cambridge, MA, USA, 2nd. (revised) edition, 1998.
[24]
B. Steensgaard. Points-to analysis in almost linear time. In Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL'96, pages 32--41, 1996.
[25]
The IMPACT Research Group. Parboil Benchmark suite. http://impact.crhc.illinois.edu/parboil.php.

Cited By

View all
  • (2025)Efficiency and scalability of fully-resolved fluid-particle simulations on heterogeneous CPU-GPU architecturesThe International Journal of High Performance Computing Applications10.1177/10943420241313385Online publication date: 10-Jan-2025
  • (2024)Workload Scheduling on Heterogeneous DevicesISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528933(1-11)Online publication date: May-2024
  • (2024)IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342901035:10(1796-1809)Online publication date: Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '12: Proceedings of the 26th ACM international conference on Supercomputing
June 2012
400 pages
ISBN:9781450313162
DOI:10.1145/2304576
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. clusters
  2. heterogeneous computing
  3. opencl
  4. programming models

Qualifiers

  • Research-article

Conference

ICS'12
Sponsor:
ICS'12: International Conference on Supercomputing
June 25 - 29, 2012
San Servolo Island, Venice, Italy

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)86
  • Downloads (Last 6 weeks)6
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Efficiency and scalability of fully-resolved fluid-particle simulations on heterogeneous CPU-GPU architecturesThe International Journal of High Performance Computing Applications10.1177/10943420241313385Online publication date: 10-Jan-2025
  • (2024)Workload Scheduling on Heterogeneous DevicesISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528933(1-11)Online publication date: May-2024
  • (2024)IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342901035:10(1796-1809)Online publication date: Oct-2024
  • (2024)A Runtime Manager Integrated Emulation Environment for Heterogeneous SoC Design with RISC-V Cores2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00013(23-30)Online publication date: 27-May-2024
  • (2024)A Survey on Heterogeneous CPU–GPU Architectures and SimulatorsConcurrency and Computation: Practice and Experience10.1002/cpe.831837:1Online publication date: 30-Oct-2024
  • (2023)CEDR: A Compiler-integrated, Extensible DSSoC RuntimeACM Transactions on Embedded Computing Systems10.1145/352925722:2(1-34)Online publication date: 24-Jan-2023
  • (2023)Remote Execution of OpenCL and SYCL Applications via rOpenCL2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00020(51-60)Online publication date: May-2023
  • (2023)Removing Host Interventions from GPU Accelerated Neural Network2023 IEEE International Conference on Consumer Electronics (ICCE)10.1109/ICCE56470.2023.10043523(1-2)Online publication date: 6-Jan-2023
  • (2023)An Asynchronous Dataflow-Driven Execution Model For Distributed Accelerator Computing2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid57682.2023.00018(82-93)Online publication date: May-2023
  • (2023)SNCL: a supernode OpenCL implementation for hybrid computing arraysThe Journal of Supercomputing10.1007/s11227-023-05766-380:7(9471-9493)Online publication date: 8-Dec-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media