Keywords

1 Introduction

Over the last decade, the popularity of GPU-based supercomputers has increased due to their promising performance per watt ratio. Thus, nowadays, HPC centers often include GPU-based systems into their considerations for new hardware acquisitions. However, in tendering and procurement processes, HPC centers face the challenge to make an informed decision across available operational concepts of compute nodes with attached GPUs (here called GPU nodes). Operational concepts can vary in system configuration, i. e., number of CPU sockets and GPUs within a compute node, and the kind of GPU resource allocation.

The different operational concepts for GPU nodes are also apparent in the Top500 [21]: Titan (#3 system) deploys per GPU node one NVIDIA Kepler GPU attached to an AMD Opteron CPU that consists of two NUMA nodes [16]. Tsubame 2.5 (#40 system) employs three NVIDIA Kepler GPUs per (up to) two-socket Intel Westmere CPU in their GPU nodes. At RWTH Aachen University, the IT Center provides GPU nodes with either two NVIDIA Kepler or two Pascal GPUs attached to two-socket Intel CPUs. On all these HPC clusters, batch jobs are (currently) scheduled exclusively per GPU node [15, 20]. However, from our experiences, users often run applications that are only capable of using a single GPU per node or do not efficiently run on more than one GPU per node. Other users only exploit the node’s GPUs and leave the CPUs idling. One main reason for that is that they cannot or do not want to invest additional effort to leverage all GPUs and CPU cores within one node. Thus, from an HPC center perspective, operational concepts that consider single-GPU and multi-GPU nodes must be compared with respect to their total costs and obtained productivity. In the multi-GPU node configuration, the capabilities of GPU management in the job scheduler or virtualization possibilities can further play an important role.

In this paper, we compare different operational setups of GPU nodes with various program execution models in the context of our university HPC center. For instance, a GPU node comprising one GPU and one CPU socket executes either GPU-only or GPU-CPU hybrid programs; a GPU node with two GPUs and two CPU sockets may additionally run two independent program instances. We run a full productivity study including the system’s total cost of ownership (TCO) with hardware costs, energy costs, and development costs for the parallelization of the applications and for further tuning to enable runs on multiple GPUs within a node. In detail, we investigate the productivity of a Conjugate Gradient (CG) solver and of a bio-medical real-world application on Intel Sandy Bridge and Broadwell servers combined with NVIDIA Kepler or Pascal GPUs.

The rest of the paper is structured as follow: Sect. 2 covers related work. In Sect. 3, we give an overview on the TCO model, derived productivity measure and corresponding assumptions and quantifications. In Sect. 4, we introduce the configurations representing our operational concepts of the GPU nodes. The parallelization of the CG solver and bio-medical application is described in Sect. 5. We present our results answering typical questions for GPU node operation in Sect. 6. Finally, we conclude with a recommendation for procurement in Sect. 7.

2 Related Work

While performance and power consumption of GPUs have been widely investigated in research, operational concepts of GPUs with respect to total costs and productivity have not been studied so far (to the best of our knowledge). Several works cover GPU resource management on the level of the operating system or job scheduler. For example, in [12], a CUDA wrapper library has been manually implemented to override CUDA device management calls enabling more than one user per GPU node with the given resource management constraints of the batch scheduler. Another solution for resource management is based on virtualization of GPUs that has been examined in numerous works. GViM [9] is based on Xen virtual machines. For decoupling GPUs and CPUs in resource allocation, the SLURM batch scheduler was extended with a new GPU device type in [11]. Basis for this remote GPU virtualization is rCUDA [17] that is also used in a runtime evaluation of different scenarios sharing single GPUs or accessing them remotely in another compute node [19]. While we focus on simple resource management strategies, more complex ones could be added later to our model.

TCO and productivity have been mainly studied in the context of the DARPA HPCS program [5] where most works have been published in a special issue journal [6] and cover (mathematical) models of productivity. These works only scarcely present quantifications of TCO parameters and do not apply their models to operational concepts of GPU nodes. In our previous works [24, 25], we showed applicability of our TCO and productivity models to real-world HPC setups and compared costs per program run of real-world applications for (a single) GPU setup, CPU setup and Xeon Phi setup including development efforts [24].

The CG method [10] has been widely studied. A first multi-GPU implementation is given in [3], which still involved a workaround for double-precision calculations. Later multi-GPU implementations focus, e. g., on preconditioning [1], on automatic selection of the fastest of several kernels for the matrix-vector multiplication [4], or on improving the performance by reordering the matrix blocks [22]. A performance study of several kernels including CG with hybrid MPI-CUDA and MPI-OpenMP/CUDA computations is given in [14]. In [13], a heterogeneous implementation of a finite element method involving a CG algorithm on CPUs and GPU is analyzed aiming at a workload distribution that gives optimal performance and energy efficiency. However, the authors only use a single GPU and measure power using internal hardware counters instead of an external power meter. The CG algorithm newly developed in this work supports heterogeneous computations involving several CPU sockets and up to two GPUs. This implementation is highly tuned for our test systems and the structure of the used matrix, especially with respect to data transfers. Additionally, a reimplementation allowed us to track the development effort over time.

An algorithm for the bio-medical application and a shared-memory parallelization using OpenMP was developed in [2]. It was further tuned and ported with OpenCL and OpenACC to NVIDIA Fermi GPUs [26] and with OpenMP to the Intel Xeon Phi [18] in our previous works. In [24], the application’s OpenMP and OpenACC implementations were compared with respect to TCO. However, the analyzed OpenACC implementation only utilized a single GPU and, thus, did not cover different operational GPU concepts. For our purposes, we developed a CUDA implementation while tracking development efforts. We tuned the code for the (newer) hardware supporting multi-GPU and heterogeneous computations using both the CPU as well as GPU architectures.

3 TCO and Productivity

For the comparison of different operational concepts of GPU nodes, we follow an integral approach from an HPC center perspective that is based on total ownership costs and productivity. These models are straightforward and fulfill all real-world procurement needs.

3.1 Model

Total costs of ownership represent the costs to acquire, operate and maintain HPC systems. Here, we follow the TCO model that we have created in [24, 25]. Basically, we distinguish between one-time costs \(C_{ot}\) and costs per anno \(C_{pa}\) that depend on the number of compute nodes n and the system lifetime \(\tau \) (e.g., 5 years) as shown in (1). One-time costs comprise costs for hardware acquisition, building, infrastructure, operating system (OS) and environment installation, and development effort needed to parallelize an application for the targeted HPC system and configuration. Annual costs cover maintenance costs for hardware, OS, environment and the application, as well as, energy costs and compiler/ software costs. To pay for these costs, HPC centers and institutes usually rely on federal, state and university funding that provide a fixed investment I so that an upper bound for total costs is given (see (2)). Using (2) and doing the math, we can compute the number of nodes n that can be purchased for a given fixed investment I and given system lifetime \(\tau \).

$$\begin{aligned} \text {TCO}(n,\tau )= & {} C_{ot}(n) + C_{pa}(n)\cdot \tau \end{aligned}$$
(1)
$$\begin{aligned} \text {TCO}(n,\tau )\le & {} \,I \end{aligned}$$
(2)

To make an informed decision in a procurement, we do not only have to consider TCO but further need to account for the benefit that is gained by employing the HPC system. This can be done using a productivity metric that is economically the ratio of unit of outputs to unit of inputs. We use the productivity metric that we defined in [25], i. e., we take as value of an HPC system the number of application runs \(r(n,\tau )\) that can be executed over the system’s lifetime. Overall, productivity \(\varPsi \) can then be expressed as:

$$\begin{aligned} \varPsi (n,\tau ) = \frac{ value }{ cost } = \frac{r(n,\tau )}{\text {TCO}(n,\tau )} \quad \text {with} \quad r(n,\tau ) = \frac{\alpha \cdot \tau }{t(n)} \end{aligned}$$
(3)

where t(n) represents the application’s runtime and \(\alpha \) the system availability that accounts for downtimes or maintenance periods. While, formally, the runs of all applications executed on the HPC system should be summed up, we take a simplified approach here: We assume that only a single application is running for the whole system lifetime. Furthermore, we ignore any benefits gained through distributed large-scale runs, since we focus on the differences of operational concepts of GPU nodes. In this context, we investigate applications that run on a single node, but can be executed simultaneously similar to a parameter study.

3.2 Quantifications

For the application of the introduced TCO and productivity model to a real-world HPC setup, we make the following assumptions and quantifications based on our experiences from cluster procurement and operation at the IT Center of RWTH Aachen University which are also described in detail in [24].

Regarding the one-time costs, we take hardware list prices from our HPC vendors in 2013 and 2017. Building costs get amortized over 25 years and are, thus, referenced as annual costs here. Development costs are based on the effort spent for parallelizing and tuning the applications under investigation of a single experienced GPU developer so that effects on effort of varying programming skills are reduced. The corresponding salary of a full-time equivalent is derived from the funding guidelines of the German Science Foundation [8] and the European Commission’s CORDIS [7]. Since our system administrations are experienced in running GPU clusters and have established an environment that can be easily rolled out to all nodes, we do not account for any additional environment costs. However, an implementation of flexible resource management into our LSF job scheduler is assumed to cost one administrator two person-days.

For the annual costs, we assume administrative costs of per compute node. We express the annual building costs with respect to the maximum power consumption of the given node configuration since the electrical supply is the limiting factor for housing machinery in the building. For the energy costs, we take with an estimated PUE of 1.5 in 2013. Furthermore, we divide both applications into a serial and parallel part. The former is not measured explicitly but assumed to have a fixed runtime with a power consumption corresponding to one fully-loaded core and the rest of the system idling. The parallel part corresponds to the actual work of the algorithm which is parallelized accross the devices. The runtime and power consumption are measured explicitly. As our systems have each two separate power supplies, the power consumption of both was measured on separate channels and summed up to obtain the final values. If the hardware setup contains less than two GPUs or CPU sockets, their idle power consumption is subtracted from the measured values.

Finally, we assume a fixed investment of from which we compute the number of nodes n. We set the system lifetime \(\tau \) to 5 years and the system usage rate to 80 %.

4 GPU-CPU Configurations

For the comparison of TCO and productivity across different operational concepts of GPU nodes, we take two systems from the RWTH’s compute cluster as basis from which we derive various GPU-CPU configurations, i. e., the combinations of different amounts of CPU sockets and GPU devices together with a suitable program execution model:  

Kepler: :

2 Intel Xeon E5-2680 CPUs @ 2.7 GHz (Sandy Bridge) with \(2\times 8\) cores, 2 NVIDIA K20Xm Kepler GPUs

Pascal: :

2 Intel Xeon E5-2650 v4 CPUs @ 2.2 GHz (Broadwell) with \(2\times 12\) cores, 2 NVIDIA P100 Pascal GPUs.

As notation for the different GPU-CPU configurations, we use tuples of the form \((n_g, n_c) \in \{0,1,2\}^2\) with \(n_g\) denoting the number of involved GPUs and \(n_c\) the number of involved CPU sockets. This kind of tuple indicates that an executed program completely uses the given resources. The tuple (\(\frac{1,1}{1,1}\)) specifies the configuration with two parallel executions of the same application on 1 GPU and 1 CPU each. This configuration represents a job scheduler running two jobs in parallel on a single node, each given one CPU socket and one GPU. The notation indicates that n GPUs or CPU sockets are available but only \(n'\) are used for program execution, i. e., \(n - n'\) are idling. The purpose of these configurations is solely for comparison if GPUs are not used at all. All investigated configurations are summarized in Table 1. In the following, the term device is used as wildcard for either one GPU or all CPU sockets involved in program execution.

Table 1. List of considered configurations

5 Applications

The CG and bio-medical application parallelized with OpenMP and CUDA are used to evaluate the different configurations. While we highly tuned these applications for the Kepler system, we have not yet focused on the Pascal architecture which is left for future work. However, we optimized the ratios for splitting the computations across the different devices. As common ground of both applications, we use a parallel first touch on the host to ensure data locality in the main memory of our cc-NUMA systems and pinned memory to increase the throughput of memory transfers between host and GPU memory. We apply asynchronous memory transfers and computations (where applicable) by using streams and events. Additionally, we hide latency of enqueuing kernels and memory copies on the GPUs by using separate host threads for the enqueuing operations.

5.1 Conjugate Gradient (CG)

First, we implement a double-precision CG algorithm for solving a linear equation system \(A \cdot x = b\) [10]. We use the sparse symmetric positive definite SerenaFootnote 1 matrix with \(n \approx 1.4 \times 10^{6}\) rows, \(nnz \approx 64.1 \times 10^{6}\) non-zeros, and a maximum of 249 non-zeros per row. To achieve the best data locality and performance on both device types, the matrix is stored in the compressed row storage format on the host with a memory footprint of roughly 775 MB, and in the ELLPACK-R format [23] on the GPUs (yielding 4.19 GB). The vectors have a size of \({\sim }90\,\mathrm{MB}\).

On the host side, we use a task-driven approach for the matrix-vector multiplication with each task computing chunks of equal size. On the GPUs, we store the multiplication vector in texture memory to reduce the latency of the unstructured accesses to this vector. Additionally, we use a Jacobi preconditioner to reduce the number of iterations in the algorithm until convergence. All operations of the algorithm are split row-wise across the available devices into disjoint chunks. Each chunk c contains the row indices \(R_c\) such that \(\bigcup R_c = \{1, \dots , n\}\). We exploit the matrix structure having most non-zeros close to the diagonal by minimizing vector data transfers for the matrix vector multiplication: At the beginning of the algorithm, the minimum and maximum column indices \(t_c^\text {min}, t_c^\text {max}\) of non-zeros for each chunk c of the matrix are computed. Formally,

$$\begin{aligned} t_c^\lambda&= \lambda (\{j \in \{1, \dots , n\} \mid A_{i,j} \ne 0, i \in R_c\}) \text { for }\lambda \in \{\min , \max \}\\ T_c&= \{t_c^\text {min}, t_c^\text {min} + 1, \dots , t_c^\text {max} - 1, t_c^\text {max}\} \setminus R_c \end{aligned}$$

where \(T_c\) defines the set of indices of the vector that needs to be transferred to the device responsible for chunk c.

As our Kepler system does not support direct memory transfers between GPUs, we increase memory throughput by minimizing the transferred vector data between GPU and CPU so that additional main memory overheads are avoided. Thus, for hybrid multi-GPU computations, the first chunk refers to the first GPU, the middle one to the CPU, and the last chunk to the second GPU. Our Pascal system supports NVlink between GPUs and, thus, allows fast inter-GPU memory transfers. Therefore, in future, we will reorder the distribution for that architecture to take advantage of NVlink.

The analytical determination of the chunk sizes is challenging, as they are highly affected by the structure of the matrix and we hide some of the latency for copying the vector by doing it asynchronously to other computations. Thus, to obtain optimal work chunk distribution across devices, we benchmarked different values by running the algorithm with a small number of iterations.

The serial part of this algorithm includes reading the matrix file, conversion of matrix formats, allocation and initialization of vectors, and correctness checking of results. The time for these operations is assumed to have a fixed value of 20 s.

5.2 Neuromagnetic Inverse Problem (NINA)

The second application solves a real-world problem from the field of bio-medicine, namely the neuromagnetic inverse problem (NINA). The algorithm was originally implemented in MATLAB with the three most time-consuming parts computed in C, i. e., an objective function, and its first- and second-order derivatives. For simplicity, we assume a constant runtime of 46 s for the (serial) MATLAB part and imitate the original algorithmic optimization process implemented in MATLAB by executing all kernels one after the other for 1000 times. These three parts involve matrix vector operations and reductions with a mostly dense matrix of dimension \(128 \times 512\,000\). This special matrix form hinders the effective usage of BLAS libraries, so that we had to manually optimize the algorithm.

Our best-effort performance was obtained with one block per row for the dense matrix vector multiplication. Additionally, we avoid delays by immediately starting the reduction kernels (per row) out of the multiplication kernels with dynamic parallelism. To coordinate the GPU computations without interfering with the other CPU computations, we use a dedicated CPU thread.

All operations are split row-wise across the different devices. As the matrix is stored in a dense fashion, the computation of every row takes the same time per device type, resulting in an equal number of rows for each GPU. As for CG, we used benchmarking to determine the number of rows computed by the CPU.

6 Productivity Results

We interpret our results with respect to typical questions for the operation of GPU nodes. Results of our runtime and power measurements are shown in Fig. 1 and for the productivity and programming effort in Fig. 2. While runtimes generally improved when going from Kepler to the Pascal system (without further tuning), heterogeneous computations involving more than one device do not perform well on Pascal. We assume that neglecting available memory bandwidth given by NVlink is one reason for that. Remember that presented results refer to of investment. Here, a potential budget increase does not cause any changes (saturation). If we decrease the budget, the results only change slightly.

Fig. 1.
figure 1

Parallel runtime and power consumption: CG (left), NINA (right)

Fig. 2.
figure 2

Programming effort and productivity: CG (left), NINA (right)

Fig. 3.
figure 3

Detailed comparison of configurations

6.1 Cost of Idling Hardware

An interesting question for HPC centers procuring or operating GPU nodes is the cost or penalty if not all available devices are fully used by developers. For this investigation, we take as reference the hardware setup containing 2 GPUs and 2 CPU sockets – which is the default one at RWTH’s compute cluster – and the execution concept using all of them, i. e., the configuration (2,2).

First, we compare the default configuration to (cf. Fig. 3a). The idling GPUs decrease the performance significantly (up to 500 % with NINA on the Pascal system) but only reduce the power consumption by 10 % to 30 %. Hence, overall productivity is decreased with 2 idling GPUs by \({\sim } 15\,\%\) with CG and \({\sim } 40\,\%\) with NINA. With the same execution model exploiting only CPUs, but without any available GPUs in that node (configuration (0,2)), the productivity obviously increases again compared to by about \(\frac{3}{4}\) on Kepler and even \({\sim } 230\,\%\) on Pascal (which is mainly due to omitting the GPU purchase costs).

On the other hand, if both GPUs are used and the CPU sockets are idling (configuration ) (cf. Fig. 3b), the productivity is hardly affected (changes are below 3 %). This is because the runtime increases by at most one fourth, which is compensated by a lower power consumption by about the same factor.

6.2 Multiple (Heterogeneous) Devices

Next, we examine whether extra effort invested into enabling heterogeneous computing with more than one device pays off by additional productivity.

The sheer benefit of exploiting 2 GPUs per node can be investigated by comparison to the corresponding single-GPU setup – both with idling CPUs, i. e., vs. (cf. Fig. 3c). Surprisingly, we observe a productivity decrease with 2 GPUs by \({\sim } 20\,\%\) on the Kepler system, and even 40 % on Pascal. Detailed examination shows that the (low) improvement in runtime (\({\sim } 35\,\%\) on Kepler and \({\sim } 10\,\%\) on Pascal) cannot compensate for the increase in power consumption (around one fourth), programming effort, and purchase costs. While we assume to get better runtime on Pascal when leveraging NVlink, we will not be able to increase the runtime sufficiently to improve productivity due to the high serial runtime: e. g., if the assumption holds that 2 GPUs could halve the parallel runtime, the productivity decrease is still \({\sim } 15\,\%\) on Kepler and \({\sim } 35\,\%\) on Pascal.

As seen in the previous subsection, the productivity does not change much when adding 2 fully-utilized CPU sockets to 2 GPUs. A similar effect is evident when adding one fully-utilized CPU socket to one GPU, i. e., vs. (1,1) (cf. Fig. 3d): The productivity slightly increases on Kepler (by \({\sim } 5\,\%\)) and remains about the same on Pascal. To evaluate the worth of buying a two-socket (single-GPU) node vs. a one-socket (single-GPU) node, we compare the previous configuration (1,1) to (1,2) where both sockets are utilized (cf. Fig. 3e). Here, we see a productivity decrease by 13 % to 23 %, which is mainly due to the small runtime improvement (1 % to 4 %) compared to the higher power consumption (around 30 % on Kepler and 20 % on Pascal) and higher purchase cost.

6.3 Sharing GPU Nodes

The previous results lead us to the question whether we can increase productivity by sharing a single node containing 2 GPUs and 2 CPU sockets across multiple (simultaneous) program executions using disjoint devices (potentially) by multiple users. One solution for sharing nodes could be implemented based on the job scheduler’s resource management capabilities for GPUs. We imitate this solution by running 2 programs in parallel on one node, each using one CPU socket and one GPU, i. e., configuration (\(\frac{1,1}{1,1}\)), and additionally assume further one-time costs for the administrative adoption of the batch scheduler.

On Kepler, this configuration delivers the highest productivity, as the runtime is about the same as for configuration (1,1), – which is the configuration with the second highest productivity – but with two simultaneous program executions (cf. Fig. 3f). In return, the power consumption increases by only \({\sim } 70\,\%\), so the productivity increases by \({\sim } 20\,\%\). On Pascal, the configuration (0,2) achieves the highest productivity, i. e., buying and utilizing GPUs at all seem not beneficial under the reservation that the codes have not been tuned for Pascal GPUs yet. However, the productivity of the sharing approach (\(\frac{1,1}{1,1}\)) is only 10 % lower with NINA, whereas about with CG (cf. Fig. 3g). The reason is the small runtime improvement (\({\sim } 80\,\%\) with NINA, \({\sim } 60\,\%\) with CG) compared to the much higher power consumption (85 % or 55 %, respectively) and purchase costs. With further tuning for Pascal, we can probably reach a higher productivity with the shared approach. Note that effort needed for the adoption of the job scheduler is assumed to be low. More complex virtualization approaches will yield much higher one-time costs. Nevertheless, on Kepler, the sharing approach would still pay off if the administrative effort theoretically increased up to 130 person-days.

7 Conclusion

Concluding our productivity results, we give recommendations for hardware procurement choices and GPU system operations for HPC clusters. For this, we assume that at least one GPU per node should be available and that all cluster nodes have the same hardware setup. We base our suggestions on the case studies investigated, i. e., a CG solver and the real-world NINA application.

Since productivity decreases when using heterogeneous hardware setups, we recommend to buy only minimal nodes, containing only one GPU and one CPU socket. Furthermore, productivity is hardly affected by utilizing the CPU instead of letting it idle. Hence, it could be up to the programmer, to decide if he utilizes the CPU or not. Another approach can be taken by purchasing nodes with two GPUs and two CPU sockets and allow two programs from different users to exploit distinct devices on the node (e.g., by job scheduler resource management). In this way, even higher productivity results can be achieved as long as the additional administrative one-time effort to implement this is not prevailing.

In future, we will evaluate the productivity after tuning the applications for the Pascal architecture. Early results show a performance improvement of \({\sim } 19\,\%\) for CG utilizing NVlink in configuration (2, 0). Additionally, we will analyze more applications, e. g., with lower serial fractions to achieve higher total speedups. Furthermore, we plan to lift the analysis to applications running across multiple nodes, i. e., with MPI+OpenMP+CUDA.