Elsevier

Computer Physics Communications

Volume 235, February 2019, Pages 324-335
Computer Physics Communications

BlackNUFFT: Modular customizable black box hybrid parallelization of type 3 NUFFT in 3D

https://doi.org/10.1016/j.cpc.2018.10.005Get rights and content

Abstract

Many applications benefit from an efficient Discrete Fourier Transform (DFT) between arbitrarily spaced points. The Non Uniform Fast Fourier Transform reduces the computational cost of such operation from O(N2) to O(NlogN) exploiting gridding algorithms and a standard Fast Fourier Transform on an equi-spaced grid. The parallelization of the NUFFT of type 3 (between arbitrary points in space and frequency) still poses some challenges: we present a novel and flexible hybrid parallelization in a MPI-multithreaded environment exploiting existing HPC libraries on modern architectures. To ensure the reliability of the developed library, we exploit continuous integration strategies using Travis CI. We present performance analyses to prove the effectiveness of our implementation, possible extensions to the existing library, and an application of NUFFT type 3 to MRI image processing.

Program summary

Program Title: BlackNUFFT

Program Files doi: http://dx.doi.org/10.17632/vxfj6x2p8x.1

Licensing provisions: LGPL

Programming language: C++

External routines/libraries: deal.II , FFTW, PFFT

Nature of problem: Provide a modular and extensible implementation of a parallel Non Uniform Fast Fourier Transform of type 3.

Solution method: Use of hybrid shared distributed memory paradigm to achieve high level of efficiency. We exploit existing HPC library following best practices in scientific computing (as continuous integration via TravisCI) to reach higher complexities and guarantee the accuracy of the solution proposed.

Introduction

The standard Fast Fourier algorithm relies on a distribution of the points on a regular equispaced grid. However, many applications benefit from an accurate and reliable Fourier transform between N arbitrarily spaced points. Many image reconstruction techniques, such as Magnetic Resonance Imaging, are based on a non Cartesian grid in the frequency domain [1], [2], [3]. In [4], [5] the authors propose accelerated convolution techniques based on the discrete Fourier Transform between arbitrary points in both space and frequency domains. The computational cost of the standard discrete transform increases quadratically, quickly becoming unbearable even on modern computational platforms.

A solution to this problem is the Non Uniform Fast Fourier Transform (NUFFT) developed by Dutt and Rokhlin [6]. The authors provide a deep theoretical analysis to approximate the Fourier transform using the classical Fast Fourier Transform algorithm. Another description of such algorithm has been addressed by Leslie Greengard and June-Yub Lee in [7]. In the literature there are 3 different types of NUFFT: type 1 operating between arbitrary points in space and a regular grid in frequency, type 2 correlating a regular grid in space and arbitrary points in frequency, and type 3 dealing with arbitrary points in both spaces.

The key idea of NUFFT is to transfer the non equi-spaced data on a uniform grid to be used with standard FFT algorithms. In [7] the authors use fast convolutions with a Gaussian function to create a uniform grid where standard Fast Fourier Transform algorithms can be used. This is a key step to retrieve the solution in an affordable time, since Gaussian convolutions can be efficiently computed by means of Fast Gaussian Gridding, reducing the overall computational time. Another possible solution is the usage of Kaiser–Bessel window or the min–max interpolator. In recent years the latter has been efficiently implemented in the library PNFFT [8]. In [9] the authors propose a parallelization of type 1 NUFFT based on the P3DFFT [10]. Type 3 NUFFT, developed in [7], has been successfully applied in fast numerical convolutions [4], and has applications in many fields of engineering. Several parallelizations have been performed for such an algorithm: in [11], [12] highly scalable optimization of the library on modern multicore systems are proposed, in [13] the authors use the hardware to accelerate the NUFFT, while in [14] the multicore architecture of GPUs is used to tune and optimize NUFFT.

While the previous implementations have proved to be effective, they are extremely hardware specific and optimized. Moreover, their modification is far from trivial from the user’s perspective. We propose an OpenSOURCE flexible parallelization strategy of the type 3 NUFFT algorithm based on existing High Performance Computing libraries: we call such a library BlackNUFFT [15]. We use Intel Threading Building Blocks (TBB) for shared memory parallelism [16], [17], together with the standard Message Passing Interface (MPI) for distributed parallelism. High modularity is a key aspect of our implementation, allowing the user to plug-in new algorithms for each aspect of NUFFT. We use a black box approach for FFT on the fine grid, making it possible to interchange back-end libraries for the FFT algorithm. We provide a first implementation that uses by default FFTW [18] as back-end FFT. This choice is mainly due to the extensive documentation and high reliability of such a library. We also propose a second implementation using PFFT [19] as back-end FFT library. In this way we improve the parallel efficiency of our library and we show the advantages of using a black-box modular approach that allows the interchanging of its building blocks. We compare the presented BlackNUFFT with the library developed by Lee and Greengard which is freely available under GPL license [20]. We use existing implementations of distributed vectors provided by existing OpenSOURCE libraries. In particular we refer to the implementation of the deal.II library [21], [22], [23], which provides both high performance on modern architectures and the possibility of easily switching between 32 and 64 bits indexing. This capability is a keystone to reach higher computational complexity. We present a performance analysis considering multithread and multiprocessor environments for the current implementation of the library. BlackNUFFT is available under LGPL license on GitHub [15].

The work is organized as follows: in Section 2 we collect some considerations on theoretical aspects of the NUFFT [7], [6] and we describe the actual structure of the algorithm. In Section 3 we exploit shared memory parallelism to achieve a first multicore parallel analysis. In Section 4 we combine shared and distributed memory paradigms to reach higher levels of scalability. Section 5 describes the coupling with the parallel pruned library PFFT [19]. In Section 6 we briefly describe the application of BlackNUFFT to the reconstruction of MRI images. Section 7 describes possible procedures to expand BlackNUFFT capabilities.

Section snippets

Mathematical background

We follow [6], [7] to introduce the main aspects of NUFFT of type 3. Given a set of N complex values in the three dimensional space f(x) we define its discrete Fourier transform (DFT) to a set of N points in the frequency space as F(k)=1Ni=0N1f(x)eixk,with k=(k1,k2,k3) and x=(x1,x2,x3). We also have that N2k1,k2,k3<N21. We can define also the inverse DFT as f(x)=i=N2N21F(k)eixk.DFTs defined in (1) and (2) are characterized by a computational cost O(N2). NUFFT algorithms are based

Shared memory parallelism

All modern CPUs support multicore shared memory parallelism. We use Intel Threading Building Block [16] to exploit this possibility and achieve higher efficiency. This tool has been successfully adopted in many high performing library, as ASPECT [25], or the deal.II library [26]. Moreover it introduces the use of the TaskScheduler concept that allows for higher level of optimization in our library. We follow the shared memory parallelization strategies introduced in [16], [17] and we apply it

Distributed memory parallelism

We use the standard Message Passing Interface between different processors to obtain distributed memory parallelism, and to overcome the two main bottlenecks of the multicore parallelization: the input gridding (through FGG) and the three dimensional execution of the FFT on the fine grid. We start by presenting the coding paradigm we have followed and we explain in detail each function of NUFFT, and finally we analyze the scaling performances of our new implementation.

Coupling with a parallel pruned FFT

We prove the effective modularity of BlackNUFFT interchanging the back-end FFT library. In particular we provide an interface to PFFT [19], which is a generalization of FFTW to pruned FFT. This is straightforward given the modularity of the BlackNUFFT software. Once the back-end library is properly compiled and linked, the modifications required by BlackNUFFT are minimal and mainly regard the peculiarity of using a pruned FFT library instead of a classical FFT. We believe that the usage of such

Image reconstruction from MRI non uniform data

In this section we describe a possible application of the presented library. The raw data of Magnetic Resonance Imaging (MRI) is measured in the domain of spatial frequency (k-space). Such data needs to be transformed into an image in the space domain (r-domain) for diagnosis purposes. While the space domain can be assumed to be a Cartesian grid, the k-space data is usually sampled following non uniform schemes like spiral or radial scans. Such sampling trajectories benefit from non uniform FFT

Introducing new features in the BlackNUFFT framework

This sectiondemonstrates how new features can be incorporated into the BlackNUFFT framework. We focus our attention on the 2 most time demanding algorithms of the library, the griddings and the three dimensional FFT. Both options are selected through the initializing function init_nufft through two different string options provided by the main function.

Conclusions

In this work we presented BlackNUFFT, a parallelization of three-dimensional type 3 NUFFT, from arbitrary points in space to different arbitrary points in frequency and vice versa. We have shown that such an algorithm can be easily applied to MRI imaging reconstruction techniques as the ones proposed in [1], [2]. We have implemented the library in C++ and then parallelized it using a hybrid paradigm combining MPI and Intel Threading Building Blocks. We have followed [16], [17] for the shared

Acknowledgments

The author would like to acknowledge Professor Frano̧is Alouges and Doctor Matthieu Aussal from CMAP — Centre de Mathématiques Appliquées — Ecole polytechnique for the support and the continuous insights. The author would also like to thank Doctor Michael Pippig for the suggestions regarding the usage of PFFT.

References (32)

  • LeeJ.Y. et al.

    J. Comput. Phys.

    (2005)
  • GreengardL. et al.

    J. Comput. Phys.

    (1987)
  • GiulianiN. et al.

    Eng. Anal. Bound. Elem.

    (2015)
  • GiulianiN. et al.

    Adv. Eng. Softw.

    (2018)
  • MolaA. et al.

    Eng. Anal. Bound. Elem.

    (2013)
  • KnoppT. et al.

    Int. J. Biomed. Imaging

    (2007)
  • GreengardL. et al.

    Commun. Appl. Math. Comput. Sci.

    (2006)
  • DesplanquesB. et al.

    IEEE Trans. Nucl. Sci,

    (2002)
  • AlougesF. et al.

    Numer. Algorithms

    (2015)
  • F. Alouges, M. Aussal, A. Lefebvre-lepot, F. Pigeonneau, A. Sellier, 3, 9th International Conference on Multiphase...
  • DuttA. et al.

    SIAM J. Sci. Comput.

    (1993)
  • GreengardL. et al.

    SIAM Rev.

    (2004)
  • PippigM. et al.

    SIAM J. Sci. Comput.

    (2013)
  • Y.B. Bao, A Parallel Implementation of the 3D NUFFT on Distributed-Memory Systems, 2015, pp....
  • PekurovskyD.

    SIAM J. Sci. Comput.

    (2012)
  • KalamkarD.D. et al.
  • Cited by (0)

    This paper and its associated computer program are available via the Computer Physics Communication homepage on ScienceDirect (http://www.sciencedirect.com/science/journal/00104655).

    View full text