Solution to PDEs using radial basis function finite-differences (RBF-FD) on multiple GPUs
Introduction
Numerical methods based on radial basis functions (RBFs) are rapidly gaining popularity for the solution of partial differential equations (PDEs). With a history extending back four decades for RBF interpolation schemes [30], and two decades for RBFs applied to solving PDEs [33], many avenues of research remain untouched within their realm. Being a meshless method, RBF methods excel at solving problems that require geometric flexibility with scattered node layouts in n-dimensional space. They naturally extend into higher dimensions without significant increase in programming complexity [18], [57]. In addition to competitive accuracy and convergence compared with other state-of-the-art methods [15], [16], [18], [19], [57], they also boast stability for large time steps.
Examples of infinitely smooth RBFs in 2D space are shown in Fig. 1 (good references on non-smooth and compactly supported RBFs, which are not considered in this paper due to lower order of convergence, are [6], [55]). RBF methods are based on a superposition of translates of these radially symmetric functions, providing a linearly independent but non-orthogonal basis used to interpolate between nodes in n-dimensional space. An example of RBF interpolation in 2D using 15 Gaussians is shown in Fig. 2, where is an RBF centered at . The radial coordinate is .
Infinitely smooth RBFs depend on a shape or support parameter that controls the width of the function. The functional form of the shape function becomes . Decreasing increases the support of the RBF and in most cases, the accuracy of the interpolation, but worsens the conditioning of the RBF interpolation problem [44]. Fortunately, recent algorithms such as Contour-Padé [26] and RBF-QR [22], [24] allow for numerically stable computation of interpolants in the nearly flat RBF regime (i.e., ) where high accuracy has been observed [25], [37].
Historically, the most common way to implement RBFs is in a global sense. That is, the value of a function or any of its derivatives at a node location is a linear combination of all the function values over the entire domain, just as in a pseudospectral method. If using infinitely smooth RBFs, this leads to exponential convergence of the RBF interpolant for smooth data [21]. As discussed in [19], global RBF differentiation matrices (DM) are full, requiring floating point operations (FLOPs) to assemble for a given node layout and to time-step.
Alternatively, one can use RBF-generated finite differences (RBF-FD) to introduce sparse DMs (Note: for pure interpolation, compactly supported RBFs can also introduce sparse matrices [54]). RBF-FD was first introduced by Tolstykh in 2000 [49], but it was the simultaneous, yet independent, efforts in [8], [46], [50], [56] that gave the method its real start. The RBF-FD method is similar in concept to classical finite-differences (FD) but differs in that the underlying differentiation weights are exact for RBFs rather than polynomials. RBF-FD share advantages with global RBF methods, like the ability to function without an underlying mesh, easily extend to higher dimensions and afford large time steps; however spectral accuracy is lost. Some of the advantages of RBF-FD include high computational speed together with high-order accuracy (6th to 10th order accuracy is common) and the opportunity for parallelization. As in FD, increasing the stencil size n increases the accuracy of the approximation. Given N total nodes in the domain (such as on the surface of a sphere), N linear systems, each of size , are solved to calculate the differentiation weights. Since , the RBF-FD preprocessing complexity is dominated by operations, much lower than for the global RBF method of , with the cost per time step also being . RBF-FD have been successfully employed for a variety of problems including Hamilton–Jacobi equations [8], convection–diffusion problems [9], [48], incompressible Navier–Stokes equations [10], [46], transport on the sphere [23], and the shallow water equations [17].
As N grows larger, it behooves us to work on parallel architectures, be it CPUs or GPUs. With regard to the latter, there is some research on leveraging RBFs on GPUs in the fields of visualization [13], [53], surface reconstruction [7], [12], and neural networks [5]. However, research on the parallelization of RBF algorithms to solve PDEs on multiple CPU/GPU architectures is essentially non-existent. We have found three studies that have addressed this topic, none of which implement RBF-FD but rather take the route of domain decomposition for global RBFs (similar to a spectral element approach). In [14], Divo and Kassab introduce subdomains with artificial boundaries that are processed independently. Their implementation was designed for a 36 node cluster, but benchmarks and scalability tests are not provided. Kosec and Šarler [36] parallelize coupled heat transfer and fluid flow models using OpenMP on a single workstation with one dual-core processor. They achieved a speedup factor of 1.85x over serial execution, although there were no results from scaling tests. Yokota, Barba and Knepley [59] apply a restrictive additive Schwarz domain decomposition to parallelize global RBF interpolation of more then 50 million nodes on 1024 CPU processors. Only Schmidt et al. [45] have accelerated a global RBF method for PDEs on the GPU. Their MATLAB implementation applies global RBFs to solve the linearized shallow water equations utilizing the AccelerEyes Jacket [2] library to target a single GPU.
To our knowledge, this paper presents the first implementation of RBF-FD to span multiple CPUs. Each CPU has a corresponding GPU attached to it in a one-to-one correspondence. We thus also present the first known implementation of accelerated RBF-FD on the GPU. Within the scope of this paper we detail our method for spanning RBF-FD across multiple CPU/GPU processors and emphasize numerical validation of the implementation rather than optimization strategies. We will consider optimization in future work. The calculations are performed on Keeneland, a high performance computing installation supported by the National Science Foundation and located at Oak Ridge National Lab. Keeneland currently has 240 CPUs accompanied by 360 NVidia Fermi class GPUs with at least double that number expected by the end of 2012 [51].
The remainder of this paper is organized as follows: Section 2 introduces RBF-FD via interpolation. Section 3 details our parallelization strategies for the Keeneland system, including data partitions that are handled concurrently by different CPU processes and the data-parallel explicit time stepping scheme for the GPU. In Section 4, our implementation is verified against well-known hyperbolic PDE test cases on the sphere, advection of a cosine bell and vortex wrap-up. Finally, performance benchmarks and results are presented in Section 5, followed by conclusions and proposals for future optimization strategies in Section 6.
Section snippets
Calculating RBF-FD weights
Given a set of function values, , on a set of N nodes , the operator acting on evaluated at , is approximated by a weighted combination of function values, , in a small neighborhood of , where defines the size of the stencil:The RBF-FD weights, , are found by enforcing that they are exact within the space spanned by the RBFs , centered at the nodes , with being the distance between
Targeting multi-CPU/GPU
Parallelization of the RBF-FD method is achieved at two levels. First, the physical domain of the problem—in this case, the unit sphere—is partitioned into overlapping subdomains, each handled by a different CPU process. All CPUs operate independently to compute/load RBF-FD stencil weights, run diagnostic tests and perform other initialization tasks. A CPU computes only weights corresponding to stencils centered in the interior of its partition. After initialization, CPUs continue concurrently
Numerical validation
Here, we present the first results in the literature for parallelizing RBF-FDs on multi-CPU and multi-GPU architectures for solving PDEs. To verify our multi-CPU, single GPU and multi-GPU implementations, two hyperbolic PDEs on the surface of the sphere are tested: 1) vortex roll-up [38], [39] and 2) solid body rotation [32]. These tests were chosen since they are not only standard in the numerical literature, but also for the development of RBFs in solving PDEs on the sphere [16], [18], [22],
Performance benchmarks
In this section, we present performance results under a variety of conditions: two different architectures, varying stencil sizes, and different grid resolutions. We also present some timings that break down the contributions from computation and communication between GPU and CPU, as well as between CPUs.
All timings correspond to the average run-time of a full RK4 time-step. Most of our results use speedup as a measure of comparison. Here, speedup is the ratio of serial execution time on a
Conclusion
This paper presents the first known parallel implementation of the RBF-FD method for PDEs. Parallelization is achieved at two levels:
- 1.
The problem geometry is partitioned and distributed across multiple CPUs, with common nodes of overlapping partitions synchronized by MPI communication.
- 2.
Derivative approximations and explicit time-stepping are further parallelized on concurrent GPU hardware, with one GPU supplementing each CPU involved in computation.
Acknowledgements
This work is supported by NSF awards DMS-#0934331 (FSU), DMS-#0934317 (NCAR) and ATM-#0602100 (NCAR).
Many thanks to Bengt Fornberg, Grady Wright, Kiran Katta, Ian Johnson, Steve Henke and Joseph Lohmeier for helpful discussion and insight.
References (59)
- et al.
Numerical methods for high dimensional Hamilton–Jacobi equations using radial basis functions
Journal of Computational Physics
(2004) - et al.
Radial basis functions: developments and applications to planetary scale flows
Computers & Fluids
(2011) - et al.
Rotational transport on a sphere: local node refinement with radial basis functions
Journal of Computational Physics
(2010) - et al.
Rbf-generated finite differences for nonlinear transport on a sphere: shallow water simulations
Journal of Computational Physics
(2012) - et al.
Transport schemes on a sphere using radial basis functions
Journal of Computational Physics
(2007) - et al.
Observations on the behavior of radial basis function approximations near boundaries
Computers & Mathematics with Applications
(2002) - et al.
Stabilization of RBF-generated finite difference methods for convective PDEs
Journal of Computational Physics
(2011) - et al.
On choosing a radial basis function and a shape parameter when solving a convective PDE on a sphere
Journal of Computational Physics
(2008) - et al.
Stable computation of multiquadric interpolants for all values of the shape parameter
Computers & Mathematics with Applications
(2004) - et al.
Spectral transform solutions to the shallow water test set
Journal of Computational Physics
(1995)