Elsevier

Parallel Computing

Volume 58, October 2016, Pages 27-36
Parallel Computing

CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters

https://doi.org/10.1016/j.parco.2016.05.003Get rights and content

Abstract

GPUDirect RDMA (GDR) brings the high-performance communication capabilities of RDMA networks like InfiniBand (IB) to GPUs. It enables IB network adapters to directly write/read data to/from GPU memory. Partitioned Global Address Space (PGAS) programming models, such as OpenSHMEM, provide an attractive approach for developing scientific applications with irregular communication characteristics by providing shared memory address space abstractions, along with one-sided communication semantics. However, current approaches and designs for OpenSHMEM on GPU clusters do not take advantage of the GDR features leading to potential performance improvements being untapped. In this paper, we introduce “CUDA-Aware” concepts for OpenSHMEM that enable operations to be directly performed from/on buffers residing in GPU’s memory. We propose novel and efficient designs that ensure “truly one-sided” communication for different intra-/inter-node configurations while working around the hardware limitations. We achieve 2.5 × and 7 × improvement in point-point communication for intra-node and inter-node configurations, respectively. Our proposed framework achieves 2.2μs for an intra-node 8-byte put operation from CPU to local GPU and 3.13μs for an inter-node 8-byte put operation from GPU to remote GPU. The proposed designs lead to 19% reduction in the execution time of Stencil2D application kernel from the SHOC benchmark suite on Wilkes system which is composed of 64 dual-GPU nodes. Similarly, the evolution time of GPULBM application is reduced by 45% on 64 GPUs. On 8 GPUs per node CS-Storm-based system, we show 50% and 23% improvement on 32 and 64 GPUs, respectively.

Introduction

The emergence of accelerators such as NVIDIA Graphics Processing Units (GPUs) is changing the landscape of super-computing systems. GPUs, being PCI Express (PCIe) devices, have their own memory space and require data to be transferred to their memory through specific mechanisms prior to computation. The Compute Unified Device Architecture (CUDA) is a framework available for users to take advantage of GPUs. The latest CUDA drivers provide the “GPUDirect” set of features that enable efficient data movement among GPUs as well as between GPUs and peer PCIe devices. CUDA 5.0 introduced the GPUDirect RDMA (GDR) feature, which allows InfiniBand network adapters to directly read from or write to GPU device memory while completely bypassing the host. This has the potential to yield significant performance benefits especially in the presence of multiple communication configurations that GPU devices expose. In these heterogeneous systems data can be transferred from: a) Host-to-Host (H-H), b) Device-to-Device (D-D), c) Host-to-Device (H-D), and d) Device-to-Host (D-H). Further, these configurations can be either intra-node or inter-node.

Scientific applications use CUDA, in conjunction with high-level programming models like Message Passing Interface (MPI) and/or Partitioned Global Address Space (PGAS). Usually, CUDA is used for the kernel computation and data movement between CPU and GPU while MPI and PGAS are used for inter-process communication. Several MPI implementations use the CUDA under the hood to allow direct communication from GPU device memory and transparently improve performance of GPU-GPU communication using techniques like CUDA IPC, GDR and pipelining; thus enabling applications to achieve better performance and productivity [1].

PGAS programming models, with their light weight one-sided communication and low overhead synchronization semantics, present an attractive alternative to MPI for developing data-intensive applications that may have irregular communication patterns [2], [3]. There are two categories of PGAS models: 1) language-based such as Unified Parallel C (UPC) [4] and 2) library-based, such as OpenSHMEM [5]. The OpenSHMEM memory model uses symmetric memory allocations that can be accessed remotely. It provides better programmability by allowing processes to access a data variable at a remote process by specifying the corresponding local symmetric variable.

For current OpenSHMEM application programs that involve data movement between GPUs, the developer has to separately manage the data movement between GPU device memory and main memory at each process using CUDA, as well as the data movement between processes using OpenSHMEM. In other words, the current OpenSHMEM standard does not support symmetric allocation for heterogeneous memory systems like GPU-based clusters. Indeed, current OpenSHMEM specification does not provide an allocation API to support different memory kinds. Further it does not provide data consistency semantics with regards to completion and visibility in different memories. These shortcomings severely limit the programmability of the OpenSHMEM model, especially considering the increasing use of such heterogeneous systems for scientific applications and with increased emphasis on programmability of such systems (ex: OpenACC). Furthermore, the current model nullifies the benefits of asynchronous one-sided communication by requiring the target process to perform a CUDA memory copy from the host memory to device memory in order to complete the transfer as shown in Fig. 1(a).

For a wide acceptance of OpenSHMEM model on GPU systems, these limitations need to be handled at the level of the programming model and their associated run-times. We proposed a CUDA-Aware OpenSHMEM [6] runtime that hides the complexity of GPU programming. The envisioned model is illustrated in Fig. 1(b). Further we proposed to support this concept by a GDR-Aware OpenSHMEM implementation to efficiently provide high performance and true one-sided communication along with high productivity.

GPUDirect RDMA has the potential to deliver very low latency compared to transfers staged through the host without involvement from the remote process. This is evident from prior research [7]. However, GDR bandwidth is severely limited when compared to the bandwidth that an InfiniBand HCA offers owing to the well-known PCIe limitations on certain architectures (Section 2.2). For specific data movements in multi-GPU configurations, this tends to adversely degrade both bandwidth and latency. Table 1 shows the inefficiency of the current OpenSHMEM run-times for GPU systems with GDR. At the same time it highlights the potential impact of GDR in the data movement from/to GPUs.

In this paper, we tackle the aforementioned limitations and inefficiencies of the current OpenSHMEM model and runtimes on GPU systems. We propose a CUDA-Aware OpenSHMEM design with minor extensions to the original model. We also propose a novel framework to efficiently design an OpenSHMEM runtime for GPU based systems using GDR. To the best of our knowledge, this is the first paper exploiting GDR features in designing an efficient OpenSHMEM runtime. This paper makes the following contributions:

  • Simple extensions to the programming model to provide CUDA-Aware OpenSHMEM operations.

  • Propose GDR-based designs to efficiently support OpenSHMEM communication from/to GPUs for all configurations.

  • Design novel and efficient truly one-sided communication runtime for both intra-node and inter-node configurations.

  • Propose hybrid and proxy-based designs to overcome current hardware limitations on GPU based systems.

  • Redesign LBM application to use OpenSHMEM directly from/to GPU memories and show the benefits of such designs on end application.

  • Detailed evaluation and analysis on two different multi-GPU systems.

We have designed our proposed framework on top of MVAPICH2-X [8] and evaluated its performance on two different GPU cluster systems. We show that the proposed framework achieves up to 2.5 × and 7 × latency improvement for small and medium message range for intra-node and inter-node communications, respectively. For application-level evaluation, we have used the Stencil 2D application kernel from the SHOC suite and a redesigned LBM application for GPUs. The designs yield 19% improvement in the execution time of Stencil2D kernel ran on 64 GPU nodes of the first system. For the GPULBM application, we achieve 53% and 45% improvement on the execution time of the evolution phase on 32 and 64 GPU nodes of the first system. On our second C-Storm-based GPU cluster, we achieve 50% and 23% improvement for the GPULBM application ran on 32 and 64 GPUs, respectively.

Section snippets

GPU node architecture and GPUDirect technology

Current generation GPUs from NVIDIA are PCIe devices. Communication between GPU and host, and between two GPUs happens over the PCIe bus. NVIDIA’s GPUDirect technology provides a set of features that enable efficient communication among GPUs used by different processes and between GPUs and other devices like network adapters. With CUDA 4.1, NVIDIA addressed the problem of inter-process GPU-to-GPU communication within a node through CUDA Inter-Process Communication (IPC). A process can map and

Model extensions for CUDA-awareness

OpenSHMEM uses symmetric memory regions to expose shared memory abstraction among processing elements (PEs). These regions can be allocated dynamically using collective routines like shmem_malloc, shmem_align and others. However, as GPU memory space is disjoint from CPU main memory, OpenSHMEM does not provide a way to specify where the symmetric memory regions are allocated on GPU clusters. Hence as shown in Fig. 1(a), the current model forces the developer to move data from GPU memory to the

GDR-aware OpenSHMEM runtime designs

In this section, we discuss and propose different alternatives to design an efficient and a truly one-sided CUDA-aware OpenSHMEM runtime exploiting GDR and other CUDA features for both intra-node and inter-node data movement from/to GPU memory.

Experimental setup

We have used two different systems to evaluate the performance of the proposed GDR-aware OpenSHMEM framework. The first system is the Wilkes cluster from Cambridge University which has 128 nodes and each node is a 6-core dual-socket Intel IvyBridge equipped with 2 Tesla K20 NVIDIA GPUs and 2 FDR IB HCAs. The second test-bed is the new CS-Storm-based cluster at CSCS, which has 12 nodes and each node having 16 K40 GPUs (8 K80s) and two IB HCAs. For the evaluation we compare our designs labeled

Related work

There have been several efforts in making programming models accelerator and co-processor-aware. The work in [1] proposed efficient GPU to GPU communication by overlapping RDMA data transfer with CUDA memory inside MPI library. The authors in [7] extended this work with GDR support for MPI libraries. CUDA support for X10 was introduced as part of the Asynchronous PGAS (APGAS) model, which enables writing single source efficient code for heterogeneous and multi-core architectures [11]. Several

Conclusion and future work

In this paper, we presented CUDA-aware OpenSHMEM concept to provide performance and productivity on NVIDIA GPUs and associate runtimes that take advantage of GDR technology for intra-node and inter-node communication. Our GDR-Aware framework uses alternative and hybrid designs that ensure the true one-sided property of OpenSHMEM programming models, enable asynchronous progress, and at the same time work around the hardware limitations. The experimental results show 7 × improvement in the

Acknowledgment

This research is supported in part by National Science Foundation grants #OCI-1148371 and #CCF-1213084.

References (14)

  • WangH. et al.

    MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters

    Int’l Supercomputing Conference (ISC)

    (2011)
  • G. Cong et al.

    Fast PGAS Implementation of Distributed Graph Algorithms

    Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis

    (2010)
  • S. Olivier et al.

    Scalable Dynamic Load Balancing Using UPC

    Proceedings of the 2008 37th International Conference on Parallel Processing

    (2008)
  • UPC Consortium

    UPC Language Specifications, v1.2

    Tech Report LBNL-59208

    (2005)
  • OpenSHMEM, OpenSHMEM Application Programming Interface,...
  • S. Potluri et al.

    Extending OpenSHMEM for GPU Computing

    Parallel Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on

    (2013)
  • S. Potluri et al.

    Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs

    Parallel Processing (ICPP), 2013 42nd International Conference on

    (2013)
There are more references available in the full text version of this article.

Cited by (8)

  • gShare: A centralized GPU memory management framework to enable GPU memory sharing for containers

    2022, Future Generation Computer Systems
    Citation Excerpt :

    The CUDA Inter-Process Communication (IPC) API was introduced from CUDA version 4.1 in order to support inter-process communication on GPU memory. Previous research about message passing interface (MPI) frameworks utilized the API for directly copying GPU memory buffers allocated by different processes without involvement of the host memory [24,34]. To achieve this, one process gets an IPC handle on its GPU memory buffer by executing the cudaIpcGetMemHandle function.

  • Special Issue on Cluster Computing

    2016, Parallel Computing
  • Extending OpenMP and OpenSHMEM for Efficient Heterogeneous Computing

    2022, Proceedings of PAW-ATM 2022: Parallel Applications Workshop, Alternatives to MPI+X, Held in conjunction with SC 2022: The International Conference for High Performance Computing, Networking, Storage and Analysis
  • GPU initiated OpenSHMEM: Correct and eicient intra-kernel networking for DGPUs

    2020, Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP
  • ComP-Net: Command Processor Networking for Efficient Intra-kernel Communications on GPUs

    2018, Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT
  • GPU triggered networking for intra-kernel communications

    2017, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017
View all citing articles on Scopus
View full text