Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The solution of large sparse linear systems is central to many problems in science, engineering, and optimization. Direct methods, for which the major computational task is factoring the coefficient matrix, are often the solution method of choice due to their generality and robustness. Performance optimization, and particularly parallelization, of sparse factorization has been the subject of intensive research.

A growing portion of the computational capability of supercomputer clusters is now being provided by accelerators, particularly GPUs [22]. The characteristics of these distributed heterogeneous systems, most particularly the separate CPU and GPU memories and the limited bandwidth of the PCIe bus over which they communicate, makes leveraging their computational power a challenge for irregular algorithms such as sparse factorization. As a result, there is a need to adapt the implementation of sparse factorization algorithms to such heterogeneous architectures and there is active work in this area [6, 14, 19, 21, 23].

Previous work in GPU acceleration of sparse factorization has taken the approach of offloading dense math to the GPU and making accommodation for:

  • only sending appropriately sized work to the GPU,

  • asynchronous operations to overlap CPU computation, GPU computation and PCIe communication,

  • minimizing communication by re-using data on the GPU as much as possible.

Most previous work has been limited to single-node, shared-memory systems with the notable exception of [20], which showed performance on 8 nodes.

This work seeks GPU acceleration of sparse factorization on large heterogeneous, distributed systems. The current scope is limited to Cholesky factorization (\(A = LL^T\)) [8] as that is the simpler case, but we expect many of the techniques will be applicable to other factorization algorithms as well. While we leverage the techniques itemized immediately above, this implementation is novel in thatit:

  • accelerates the factorization in the Watson Spare Matrix Package (WSMP) [9] which shows leading performance on distributed systems [12, 13, 18],

  • uses a minimally-invasive approach promoting maintainability and portability of the underlying CPU code,

  • demonstrates improved performance that scales to an order of magnitude more nodes and cores than previous work, and

  • identifies techniques for achieving further performance improvements.

We describe the implementation in detail, present experimental results, analyze the results to gain insights about performance bottlenecks, and suggest concrete and feasible avenues for further performance improvement. We show that the use of GPUs can more than double the performance of WSMP’s sparse Cholesky factorization on up to 128 nodes of NCSA’s Blue Waters supercomputer [3] and can strong scale beneficially up to 512 nodes.

2 WSMP Cholesky Factorization

WSMP uses a highly scalable distributed-memory parallel sparse Cholesky factorization algorithm [11] based on the multifrontal method [5, 16]. A multifrontal algorithm expresses the entire sparse matrix factorization in terms of partial factorizations of smaller dense matrices called frontal matrices of the type illustrated in Fig. 1. The rows and columns of the sparse coefficient matrix A are divided into contiguous blocks called supernodes [2] to aid the construction of the dense frontal matrices. The control and data flow in a typical multifrontal algorithm follows a dependency graph known as the elimination tree [15], which can be computed inexpensively from the structure of a symmetric sparse matrix. Each vertex of the elimination tree corresponds to a supernode.

Fig. 1.
figure 1

Computation in the frontal matrix \(F^i\) of a typical supernode in sparse multifrontal Cholesky factorization.

In WSMP’s distributed-memory parallel factorization utilizing p MPI processes, the supernodal tree is binary in the top \(\log _2 p\) levels. The portions of this binary supernodal tree are assigned to the processes using a subtree-to-subcube strategy [17] illustrated in the top two levels of the tree in Fig. 2. The frontal matrix of each supernode is distributed among a logical mesh of a subset of processes using a bit-mask based block-cyclic scheme [11]. The parallel partial factorization operation at each supernode is a pipelined implementation of dense block Cholesky factorization and update operations shown in Fig. 1. Most of the flops corresponding to a supernode are performed by Basic Linear Algebra Subprograms (BLAS) [4]. The three key operations are: (1) dense Cholesky on the top left matrix S, (2) the DTRSM operation \(T = S^{-1}T\), and (3) the DSYRK operation \(U^i = U^i - TT^T\). Hereafter, we will refer to a factorization task like this performed on multiple nodes as cooperative factorization.

Usually, each MPI process is multithreaded, and the portion of multifrontal factorization assigned to each process is further parallelized [10]. Just like the message-passing portion, tasks at each subroot of the elimination tree are assigned to independent groups of threads until each thread ends up with its own subtree. However, mapping between tasks and threads is more flexible than mapping between tasks and MPI processes. Furthermore, a strict block-cyclic mapping based on the binary representation of the indices is not used because all processors can access all rows and columns with the relatively small overhead. Hereafter, we will refer to factorization tasks performed on a single MPI rank as individual factorization.

Fig. 2.
figure 2

A mapping of a hypothetical elimination tree on a 4-node cluster with 4-way shared-memory nodes.

Figure 2 shows one hypothetical mapping of an elimination tree among the 16 CPUs of a 4-node cluster with 4-way shared-memory parallel nodes. The symbolic computation preceding the numerical phase ensures that the tree is binary at the top \(\log _2 p\) levels. At each of these levels, multithreaded BLAS calls are used to utilize all the CPU’s on each node during cooperative factorization. In the individual factorization region of the tree, which lies below the top \(\log _2 p\) levels, the subtrees are assigned to groups of threads until they are mapped onto single threads.

3 Minimally Invasive Approach

When factorization involves sufficient large dense blocks, where the computation can be tiled and the communication costs hidden behind computation, then satisfactory acceleration can be achieved by offloading large level-3 BLAS computations to the GPU [21]. However, when dense blocks are fewer or smaller (as might occur in shell models), or at scale (when the dense blocks are distributed across many nodes), communication can’t be hidden and opportunity for easy GPU acceleration is limited. Further, for best PCIe performance, and to support transfers which are asynchronous or concurrent with GPU computation, the data on the host must reside in ‘pinned’ (page-locked) memory. Copying data in or out of pinned memory buffers adds to the communication overhead. This BLAS-offloading approach was investigated for use with WSMP, and as expected, did not yield significant benefit.

However, as WSMP is inherently multithreaded and performs multiple BLAS operations simultaneously in different threads, the same idea behind tiling a single large BLAS operation can be used to hide communication between many simultaneous small BLAS operations. That is, computation for a BLAS operation in one thread call can proceed simultaneously with PCIe communication for a BLAS operation in another thread and with host memory copy (in or out of pinned buffers) of a third. This parallel offload approach did not involve any changes to the base WSMP code and permitted GPU acceleration of the numerical factorization by a factor of 1.7 on average [7] on a single multicore node. Unfortunately, at scale, to promote load-balancing, WSMP decomposes BLAS calls into smaller blocks than can be accelerated in this fashion.

3.1 Dual Library Solution

Pursuing further optimization, it is clear from Fig. 1 that DTRSM is typically followed by DSYRK, which takes as input the output from DTRSM. As a result, a considerable amount of unnecessary, but expensive, communication remains in transferring submatrix T in and out of GPU memory. It is relatively simple to construct a library routine which performs both DTRSM and DSYRK on the GPU and eliminates the intervening communication. However, such an optimization would require high-level code changes to WSMP. Consequently, a scheme was devised that would not only help overcome the performance constraints at scale, but would also prove to be an elegant way of enabling a large complex existing software like WSMP to effectively harness the power of accelerators. This was implemented as a library (the ‘dual’ library, which we will call ACCEL_WSMP) which would contain all of the GPU-specific code accessed via high-level (BLAS or above) routines. Only small changes would need to be made to WSMP to call these routines, if available, at a high level. In this way, WSMP could be linked with a driver code and, if ACCEL_WSMP was not provided, it’s CPU-only behavior would be unchanged. However, if ACCEL_WSMP were provided during the linking, then GPU acceleration would be obtained. In this way, we consider the GPU acceleration to be ‘minimally-invasive’.

In addition to the acceleration techniques described so far, additional optimizations were to:

  1. 1.

    limit the minimum tile size by which the problem can be decomposed to a size that still permits acceleration on the GPU,

  2. 2.

    create high-level API which would provide parallel, tiled GPU BLAS operations and combined GPU BLAS operations to minimize communication,

  3. 3.

    define a protocol for flagging matrices on the GPU for re-use (and later indicating when they can be deleted) to further minimize communication.

ACCEL_WSMP consists of a total of 10 functions for performing computations on dense matrices and/or moving them between host and device memories, and two administrative functions, including one-time initialization that performs the relatively expensive task of allocating pinned buffers. Note that all computational routines in ACCEL_WSMP have a CPU equivalent (pre-existing in WSMP). If the GPU is occupied (by other WSMP threads) or the operation is too small, then the computation remains on the CPU.

3.2 Individual Factorization

In the initial individual factorization phase of WSMP’s sparse Cholesky factorization, the computationally-intensive portion consists of DTRSM followed by symmetric rank-k update DSYRK, with DTRSM output used as input for DSYRK.

Fig. 3.
figure 3

Performance of DTRSMSYRK with and without caching of DTRSM output on the GPU.

In Fig. 3, we show the benefit of caching the output matrix T from DTRSM for subsequent use in DSYRK. The figure shows performance of DTRSMSYRK routine on a single node of our target cluster (see Sect. 4) as a function of dimension k of the triangular matrix S. The number of rows of matrix T is set to an empirically derived value of 3k. Separately, performance achieved when DTRSM output is not cached in GPU memory (and is therefore reloaded for DSYRK) is shown. Caching appears to improve DTRSMSYRK performance by about 30 % over individual uncached DTRSM and DSYRK. GPU performance increases with k, with a slight decrease at \(k = 2048\) due to switch to tiled version of the algorithm. GPU performance overtakes the highest possible CPU performance at about \(k=384\), which is the cutoff that we use to determine if DTRSMSYRK operation is performed on the CPU or on the GPU.

3.3 Cooperative Factorization

In the cooperative phase of WSMP’s Cholesky factorization, the dense Cholesky panel factorization illustrated in Fig. 1 is performed with S, T, and U distributed among multiple nodes. The dense level-3 BLAS calls in this phase are DTRSM, DSYRK and matrix-matrix multiply (DGEMM). Generally, the caching strategy described in Sect. 3.2 is used to hide PCIe communication.

There are two additional challenges in this phase. First, unlike DTRSMSYRK, where DTRSM output could be immediately reused as DSYRK input, this reuse is spread over multiple DGEMM calls during cooperative factorization. To expose this data reuse, WSMP signals when a particular matrix is likely to be used across many DGEMM and DSYRK calls. Such flagging of reusable data is a key part of the interface between WSMP and ACCEL_WSMP libraries. ACCEL_WSMP attempts to retain flagged matrices until WSMP signals that a previously flagged matrix is no longer needed.

The second challenge is related to load balancing. In the cooperative factorization phase, WSMP distributes frontal matrices in a block-cyclic fashion among participating nodes in order to facilitate pipelined load-balanced dense panel factorization. The optimal block size needs to be large enough to take advantage of the level-3 BLAS operations, yet small enough to not cause excessive load imbalance in the pipeline. Typical block sizes used on the CPU are 64 or 128. This block size is the inner dimension k of DSYRK and DGEMM calls issued during cooperative factorization. When GPU acceleration is used, since data transfer must be hidden behind communication, the achieved performance is directly proportional to the inner dimension. As shown in Fig. 3, GPU performance lags the CPU performance for inner dimensions smaller than about 384. Therefore, a block size of 512 is used in order to achieve significant GPU acceleration, which increases load imbalance and negatively impacts scalability. The trade-offs involved in the selection of the block size are discussed in more detail in Sects. 5 and 6.

4 Testing Configuration

The matrices used in this study are extracted via a DMAP procedure [1] from a commercial FEA software NX Nastran (2012) from Siemens PLM. They represent industrial CAD geometries and are automatically meshed with 10-noded tetrahedral elements. Figure 4 shows the geometries used for this paper: a symmetric machine part cutter with asymmetrical loads, and a header part of a Charge Air Cooler (CAC) with complex geometry and whose elements have higher aspect ratios and therefore a higher condition number. For the cutter model, the element size control has produced different levels of automatic mesh refinement. Table 1 summarizes the test matrices.

Fig. 4.
figure 4

Finite element discretization of the cutter model (left) used for matrices M2, M6, and M20, and the Charge Air Cooler (CAC) model used for M11 (right).

Table 1. Test matrices

The sustained peta-scale Blue Waters [3] system, hosted at the University of Illinois National Center for Supercomputing Applications (NCSA) was used in our experiments. Accelerated XK7 nodes of Blue Waters, each containing a single AMD Interlagos processor with 8 floating point cores (32 GB node memory) and a single NVIDIA Kepler K20X GPU (8 GB of GPU memory) were used. The Intel compiler version 2015 was used for compilation, and sequential Intel MKL for BLAS routines on the host. The CUDA runtime and cuBLAS libraries from the NVIDIA CUDA Toolkit version 5.5 were used for operations on the GPU. All runs used a single MPI rank per node and utilized all 8 cores on each node by using 8 threads per MPI rank.

Table 2. Numerical factorization times (seconds) of the default CPU code, CPU code with block size of 512, and GPU-enabled code with block size of 512.

5 Scaling and Acceleration Results

The numerical factorization performance for each of the four matrices has been benchmarked against the number of XK7 nodes used, spanning from the minimum number of nodes necessary to accommodate the problem to where the GPU-accelerated performance drops below the CPU-only performance. Parallel speed-up is defined as the ratio of the wall clock time on one cluster node to the wall clock time on p nodes. However, since most matrices do not fit on one node, the speed-up is computed with respect to the performance on the minimum number of nodes that a problem fits on.

Figure 5 shows the parallel speed-up obtained both on the CPU and the GPU for the numerical factorization of the four test matrices. Solid blue and green lines show the CPU and GPU results, respectively. Plain dashed lines of the corresponding colors show ideal speed-up from the starting point of each curve. The actual Cholesky factorization times on which these speed-up curves are based are shown in Table 2.

Fig. 5.
figure 5

Numerical factorization speed-ups vs. number of XK7 nodes for the four matrices studied. (Color figure online)

In Sect. 3.3, it was noted that a block size of 512 was used in the cooperative factorization phase on the GPU runs. In CPU-only runs, WSMP automatically chooses the block size, which is typically 64 or 128. To estimate the effect of increased block size on load imbalance, and hence on performance, the CPU runs were also performed using a block size of 512 and the results are shown as black lines in Fig. 5. In all the cases, the performance penalty due to larger block size grows larger as the number of nodes increases. This happens because, in wider runs, more levels of the elimination tree are cooperatively factored, and the number of XK7 nodes among which the frontal matrices of a given size are distributed increases. This creates a longer pipeline of dense operations on smaller submatrices for a given size of a frontal matrix. A large block size results in fewer stages in the pipelines, and therefore greater load imbalance. Figure 5 also shows the projected performance (dotted red line) that could be achieved through further optimizations discussed in Sect. 6.

In all four cases, on the smaller number of nodes, the GPU-accelerated version shows significant performance gains (2.5–3.5x) vs. the CPU-only case. These gains become smaller for wider runs.

Figure 6 shows the acceleration achieved by the GPU-enabled library over the CPU-only version for the cutter and CAC matrices on the different numbers of nodes. This plot directly demonstrates the benefit of GPU-acceleration.

Fig. 6.
figure 6

Acceleration achieved by the GPU-enabled library over the CPU-only factorization vs. number of XK7 nodes.

6 Performance Analysis

The results in Sect. 5 show significant GPU acceleration with respect to the CPU-only performance for complex problems on hundreds of nodes. In this section, we study the issues limiting further GPU acceleration and show that one of the key causes can be remedied to a large extent by a simple enhancement to our implementation.

We have already touched upon the need for using a large block size of 512 in the GPU implementation and the resulting load imbalance in cooperative factorization. Splitting submatrix T into column blocks results in same blocks of U being updated multiple times. Therefore, if ACCEL_WSMP can retain the result matrix of DGEMM in GPU memory, then the result matrix, U, could be transferred back to the CPU only once, after the last DGEMM. In this way, the total amount of PCIe traffic would become independent of the block size - in fact it would be the same as in the unblocked case, permitting effective use of smaller block sizes and improving both performance and load-balancing. The current implementation of ACCEL_WSMP includes the capability to cache the input matrices of DGEMM, but not the output.

To estimate the effect of caching U, we first calculated the time lost on the CPU due to the added load imbalance when running with a block size of 512. This is the difference between CPU-512 and CPU-default times (Table 2). We assume that this overhead is reflected in the GPU-512 time, but scaled by a factor related to GPU acceleration of the imbalanced computation. We used the ratio of CPU-512 to GPU-512 time on the smallest number of nodes that a matrix ran on to derive a conservative estimate of the acceleration factor. The time lost to load imbalance in the CPU-512 case is then divided by this factor and subtracted from the GPU-512 time to obtain the GPU-projected time and speedup. The results is shown by the dotted red lines (GPU-projected) in Fig. 5.

7 Conclusions

Sparse factorization is difficult to accelerate on heterogeneous clusters due to irregularity in computation and memory access and the limits of communication between the GPU and CPU. These issues are beyond the traditional cluster hurdles of problem decomposition and MPI communication.

In this work, it has been shown that with remarkably minor modifications to the original CPU code, the numerical factorization performance of WSMP can be accelerated on a large number of GPU-enabled nodes of a supercomputer. Our approach relied upon identifying computations that could be beneficially accelerated, defining an accelerator API at an appropriate level of abstraction to capture these computations, and designing a separate library to implement the API. Benchmarking on industrially-derived matrices shows that on 128 nodes, the GPU-accelerated code is about 2 times faster than the CPU-only code using the same number of XK7 nodes, but without using the GPUs. A high degree of acceleration, up to 3.5x, is observed for lower node counts. Due to issues with load-balancing and difficulty maintaining sufficient computational intensity to hide CPU\(\leftrightarrow \)GPU transfers, the observed speed-up is reduced at larger node counts. A way to overcome a significant portion of the performance loss on large numbers of nodes has been identified as caching U on the GPU.

The observed scaling represents a unique demonstration that GPUs can be effectively applied to sparse factorization on large clusters and has the potential to permit application of these GPU-accelerated clusters to new types of analysis.

A relatively unique feature of our approach is that performance portability across different accelerator platforms can be achieved by simply tuning or reimplementing the well-contained accelerator library.