Solution to PDEs using radial basis function finite-differences (RBF-FD) on multiple GPUs

https://doi.org/10.1016/j.jcp.2012.06.030Get rights and content

Abstract

This paper presents parallelization strategies for the radial basis function-finite difference (RBF-FD) method. As a generalized finite differencing scheme, the RBF-FD method functions without the need for underlying meshes to structure nodes. It offers high-order accuracy approximation and scales as O(N) per time step, with N being with the total number of nodes. To our knowledge, this is the first implementation of the RBF-FD method to leverage GPU accelerators for the solution of PDEs. Additionally, this implementation is the first to span both multiple CPUs and multiple GPUs. OpenCL kernels target the GPUs and inter-processor communication and synchronization is managed by the Message Passing Interface (MPI). We verify our implementation of the RBF-FD method with two hyperbolic PDEs on the sphere, and demonstrate up to 9x speedup on a commodity GPU with unoptimized kernel implementations. On a high performance cluster, the method achieves up to 7x speedup for the maximum problem size of 27,556 nodes.

Introduction

Numerical methods based on radial basis functions (RBFs) are rapidly gaining popularity for the solution of partial differential equations (PDEs). With a history extending back four decades for RBF interpolation schemes [30], and two decades for RBFs applied to solving PDEs [33], many avenues of research remain untouched within their realm. Being a meshless method, RBF methods excel at solving problems that require geometric flexibility with scattered node layouts in n-dimensional space. They naturally extend into higher dimensions without significant increase in programming complexity [18], [57]. In addition to competitive accuracy and convergence compared with other state-of-the-art methods [15], [16], [18], [19], [57], they also boast stability for large time steps.

Examples of infinitely smooth RBFs in 2D space are shown in Fig. 1 (good references on non-smooth and compactly supported RBFs, which are not considered in this paper due to lower order of convergence, are [6], [55]). RBF methods are based on a superposition of translates of these radially symmetric functions, providing a linearly independent but non-orthogonal basis used to interpolate between nodes in n-dimensional space. An example of RBF interpolation in 2D using 15 Gaussians is shown in Fig. 2, where ϕj(r(x)) is an RBF centered at {xj}j=1n. The radial coordinate is r=(x-xj)2+(y-yj)2.

Infinitely smooth RBFs depend on a shape or support parameter ϵ that controls the width of the function. The functional form of the shape function becomes ϕ(ϵr). Decreasing ϵ increases the support of the RBF and in most cases, the accuracy of the interpolation, but worsens the conditioning of the RBF interpolation problem [44]. Fortunately, recent algorithms such as Contour-Padé [26] and RBF-QR [22], [24] allow for numerically stable computation of interpolants in the nearly flat RBF regime (i.e., ϵ0) where high accuracy has been observed [25], [37].

Historically, the most common way to implement RBFs is in a global sense. That is, the value of a function or any of its derivatives at a node location is a linear combination of all the function values over the entire domain, just as in a pseudospectral method. If using infinitely smooth RBFs, this leads to exponential convergence of the RBF interpolant for smooth data [21]. As discussed in [19], global RBF differentiation matrices (DM) are full, requiring O(N3) floating point operations (FLOPs) to assemble for a given node layout and O(N2) to time-step.

Alternatively, one can use RBF-generated finite differences (RBF-FD) to introduce sparse DMs (Note: for pure interpolation, compactly supported RBFs can also introduce sparse matrices [54]). RBF-FD was first introduced by Tolstykh in 2000 [49], but it was the simultaneous, yet independent, efforts in [8], [46], [50], [56] that gave the method its real start. The RBF-FD method is similar in concept to classical finite-differences (FD) but differs in that the underlying differentiation weights are exact for RBFs rather than polynomials. RBF-FD share advantages with global RBF methods, like the ability to function without an underlying mesh, easily extend to higher dimensions and afford large time steps; however spectral accuracy is lost. Some of the advantages of RBF-FD include high computational speed together with high-order accuracy (6th to 10th order accuracy is common) and the opportunity for parallelization. As in FD, increasing the stencil size n increases the accuracy of the approximation. Given N total nodes in the domain (such as on the surface of a sphere), N linear systems, each of size n×n, are solved to calculate the differentiation weights. Since nN, the RBF-FD preprocessing complexity is dominated by O(N) operations, much lower than for the global RBF method of O(N3), with the cost per time step also being O(N). RBF-FD have been successfully employed for a variety of problems including Hamilton–Jacobi equations [8], convection–diffusion problems [9], [48], incompressible Navier–Stokes equations [10], [46], transport on the sphere [23], and the shallow water equations [17].

As N grows larger, it behooves us to work on parallel architectures, be it CPUs or GPUs. With regard to the latter, there is some research on leveraging RBFs on GPUs in the fields of visualization [13], [53], surface reconstruction [7], [12], and neural networks [5]. However, research on the parallelization of RBF algorithms to solve PDEs on multiple CPU/GPU architectures is essentially non-existent. We have found three studies that have addressed this topic, none of which implement RBF-FD but rather take the route of domain decomposition for global RBFs (similar to a spectral element approach). In [14], Divo and Kassab introduce subdomains with artificial boundaries that are processed independently. Their implementation was designed for a 36 node cluster, but benchmarks and scalability tests are not provided. Kosec and Šarler [36] parallelize coupled heat transfer and fluid flow models using OpenMP on a single workstation with one dual-core processor. They achieved a speedup factor of 1.85x over serial execution, although there were no results from scaling tests. Yokota, Barba and Knepley [59] apply a restrictive additive Schwarz domain decomposition to parallelize global RBF interpolation of more then 50 million nodes on 1024 CPU processors. Only Schmidt et al. [45] have accelerated a global RBF method for PDEs on the GPU. Their MATLAB implementation applies global RBFs to solve the linearized shallow water equations utilizing the AccelerEyes Jacket [2] library to target a single GPU.

To our knowledge, this paper presents the first implementation of RBF-FD to span multiple CPUs. Each CPU has a corresponding GPU attached to it in a one-to-one correspondence. We thus also present the first known implementation of accelerated RBF-FD on the GPU. Within the scope of this paper we detail our method for spanning RBF-FD across multiple CPU/GPU processors and emphasize numerical validation of the implementation rather than optimization strategies. We will consider optimization in future work. The calculations are performed on Keeneland, a high performance computing installation supported by the National Science Foundation and located at Oak Ridge National Lab. Keeneland currently has 240 CPUs accompanied by 360 NVidia Fermi class GPUs with at least double that number expected by the end of 2012 [51].

The remainder of this paper is organized as follows: Section 2 introduces RBF-FD via interpolation. Section 3 details our parallelization strategies for the Keeneland system, including data partitions that are handled concurrently by different CPU processes and the data-parallel explicit time stepping scheme for the GPU. In Section 4, our implementation is verified against well-known hyperbolic PDE test cases on the sphere, advection of a cosine bell and vortex wrap-up. Finally, performance benchmarks and results are presented in Section 5, followed by conclusions and proposals for future optimization strategies in Section 6.

Section snippets

Calculating RBF-FD weights

Given a set of function values, {u(xj)}j=1N, on a set of N nodes {xj}j=1N, the operator L acting on u(x) evaluated at xj, is approximated by a weighted combination of function values, {u(xi)}i=1n, in a small neighborhood of xj, where nN defines the size of the stencil:Lu(x)|x=xji=1nwiu(xi)+wn+1p0.The RBF-FD weights, wi, are found by enforcing that they are exact within the space spanned by the RBFs ϕi(ϵr)=ϕ(ϵx-xi), centered at the nodes {xi}i=1n, with r=x-xi being the distance between

Targeting multi-CPU/GPU

Parallelization of the RBF-FD method is achieved at two levels. First, the physical domain of the problem—in this case, the unit sphere—is partitioned into overlapping subdomains, each handled by a different CPU process. All CPUs operate independently to compute/load RBF-FD stencil weights, run diagnostic tests and perform other initialization tasks. A CPU computes only weights corresponding to stencils centered in the interior of its partition. After initialization, CPUs continue concurrently

Numerical validation

Here, we present the first results in the literature for parallelizing RBF-FDs on multi-CPU and multi-GPU architectures for solving PDEs. To verify our multi-CPU, single GPU and multi-GPU implementations, two hyperbolic PDEs on the surface of the sphere are tested: 1) vortex roll-up [38], [39] and 2) solid body rotation [32]. These tests were chosen since they are not only standard in the numerical literature, but also for the development of RBFs in solving PDEs on the sphere [16], [18], [22],

Performance benchmarks

In this section, we present performance results under a variety of conditions: two different architectures, varying stencil sizes, and different grid resolutions. We also present some timings that break down the contributions from computation and communication between GPU and CPU, as well as between CPUs.

All timings correspond to the average run-time of a full RK4 time-step. Most of our results use speedup as a measure of comparison. Here, speedup is the ratio of serial execution time on a

Conclusion

This paper presents the first known parallel implementation of the RBF-FD method for PDEs. Parallelization is achieved at two levels:

  • 1.

    The problem geometry is partitioned and distributed across multiple CPUs, with common nodes of overlapping partitions synchronized by MPI communication.

  • 2.

    Derivative approximations and explicit time-stepping are further parallelized on concurrent GPU hardware, with one GPU supplementing each CPU involved in computation.

Our MPI-based multi-CPU and multi-GPU

Acknowledgements

This work is supported by NSF awards DMS-#0934331 (FSU), DMS-#0934317 (NCAR) and ATM-#0602100 (NCAR).

Many thanks to Bengt Fornberg, Grady Wright, Kiran Katta, Ian Johnson, Steve Henke and Joseph Lohmeier for helpful discussion and insight.

References (59)

  • E.J. Kansa

    Multiquadrics–A scattered data approximation scheme with applications to computational fluid-dynamics. I. Surface approximations and partial derivative estimates

    Computers & Mathematics with Applications

    (1990)
  • E. Larsson et al.

    A numerical study of some radial basis function based solution methods for elliptic PDEs

    Computers & Mathematics with Applications

    (2003)
  • C. Shu et al.

    Local radial basis function-based differential quadrature method and its application to solve two-dimensional incompressible Navier–Stokes equations

    Computer Methods in Applied Mechanics and Engineering

    (2003)
  • G.B. Wright et al.

    Scattered node compact finite difference-type formulas generated from radial basis functions

    Journal of Computational Physics

    (2006)
  • R. Yokota et al.

    PetRBF – a parallel O(N) algorithm for radial basis function interpolation with Gaussians

    Computer Methods in Applied Mechanics and Engineering

    (2010)
  • BOOST C++ Libraries....
  • AccelerEyes. Jacket User Guide – The GPU Engine for MATLAB, 1.2.1ed., November...
  • N. Bell M. Garland, Implementing sparse matrix-vector multiplication on throughput-oriented processors, in: Proceedings...
  • E.F. Bollig

    Multi-GPU solutions of geophysical PDEs with radial basis function-generated finite differences

    PhD thesis Florida State University

    (2012)
  • A. Brandstetter et al.

    Radial basis function networks GPU based implementation

    IEEE Transaction on Neural Network

    (2008)
  • M.D. Buhmann

    Radial Basis Functions: Theory and Implementations

    (2003)
  • J.C. Carr et al.

    Smooth Surface Reconstruction from Noisy Range Data

  • G. Chandhini et al.

    Local RBF-FD solutions for steady convection-diffusion problems

    International Journal for Numerical Methods in Engineering

    (2007)
  • P.P. Chinchapatnam, K. Djidjeli, P.B. Nair, M. Tan, A compact RBF-FD based meshless method for the incompressible...
  • M. Connor et al.

    Fast construction of k-nearest neighbor graphs for point clouds

    IEEE Transactions on Visualization and Computer Graphics

    (2010)
  • A. Corrigan et al.

    Computing and rendering implicit surfaces composed of radial basis functions on the GPU

    International Workshop on Volume Graphics

    (2005)
  • N. Cuntz, M. Leidl, T. Darmstadt, G. Kolb, C. Salama, M. Böttinger, D. Klimarechenzentrum, G. Hamburg, GPU-based...
  • E. Divo et al.

    An efficient localized radial basis function meshless method for fluid flow and conjugate heat transfer

    Journal of Heat Transfer

    (2007)
  • N. Flyer et al.

    A radial basis function method for the shallow water equations on a sphere

    Proceedings of the Royal Society A

    (2009)
  • Cited by (0)

    View full text