Skip to main content
Log in

Memory-accelerated parallel method for multidimensional fast fourier implementation on GPU

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Fast Fourier transform (FFT) is a well-known algorithm that calculates the discrete Fourier transform (DFT) of discrete data and is an essential tool in scientific and engineering computation. Due to the large amounts of data, parallelly executing FFT in graphics processing unit (GPU) can effectively optimize the performance. Following this approach, FFTW and some other FFT packages were designed, but the fixed computation pattern makes it hard to utilize the computing power of GPU. Additionally, the memory access pattern is not optimized to alleviate the bottleneck of data exchange. Motivated by these challenges, we propose an efficient GPU-accelerated multidimensional FFT library to achieve better performance in this paper. We present a detailed and clear implementation strategy and optimize FFT by having as few memory transfers as possible. The data will be reshuffled on the CPU, and the access mode is also optimized to coordinate with the GPU memory access pattern. Several optimizations are also demonstrated to enhance the performance of our approach for varying FFT sizes, and the evaluation shows that our approach consistently outperforms rocFFT with a speedup of about 25% to 250% on average in AMD Instinct MI100 GPU.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study, and source data are provided with the paper in Figs. 6, 7, 8, 9.

References

  1. Cheng S, Yu H-R, Inman D, Liao Q, Wu Q, Lin J (2020) Cube-towards an optimal scaling of cosmological n-body simulations. In: 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pp 685–690. https://doi.org/10.1109/CCGrid49817.2020.00-22

  2. Watson W, Spedding TA (1982) The time series modelling of non-gaussian engineering processes. Wear 83(2):215–231. https://doi.org/10.1016/0043-1648(82)90178-8

    Article  Google Scholar 

  3. Biwer CM, Capano CD, De S, Cabero M, Brown DA, Nitz AH, Raymond V (2019) PyCBC inference: a python-based parameter estimation toolkit for compact binary coalescence signals. Science 131(996):024503. https://doi.org/10.1088/1538-3873/aaef0b

    Article  Google Scholar 

  4. Haynes PD, Côté M (2000) Parallel fast fourier transforms for electronic structure calculations. Comput Phys Commun 130(1):130–136. https://doi.org/10.1016/S0010-4655(00)00049-7

    Article  MATH  Google Scholar 

  5. Després P, Jia X (2017) A review of gpu-based medical image reconstruction. Physica Med 42:76–92. https://doi.org/10.1016/j.ejmp.2017.07.024

    Article  Google Scholar 

  6. Cipra BA (2000) The best of the 20th century: editors name top 10 algorithms. SIAM News 33(4):1–2

    Google Scholar 

  7. Frigo M, Johnson SG (1998) FFTW: an adaptive software architecture for the FFT. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’98 (Cat. No.98CH36181), vol 3, pp 1381–13843. https://doi.org/10.1109/ICASSP.1998.681704

  8. Frigo M, Johnson SG (1997) The fastest fourier transform in the west. mit-lcs-tr-728. In: The Proceedings of the 1998 International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’98

  9. Nukada A, Sato K, Matsuoka S (2012) Scalable multi-gpu 3-d fft for tsubame 2.0 supercomputer. In: SC ’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp 1–10. https://doi.org/10.1109/SC.2012.100

  10. Gu L, Li X, Siegel J (2010) An empirically tuned 2D and 3D fft library on cuda gpu. In: Proceedings of the 24th ACM International Conference on Supercomputing. ICS ’10, pp. 305–314. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1810085.1810127.

  11. Schwaller B, Ramesh B, George AD (2017) Investigating ti keystone ii and quad-core arm cortex-a53 architectures for on-board space processing. In: 2017 IEEE High Performance Extreme Computing Conference (HPEC), pp 1–7. https://doi.org/10.1109/HPEC.2017.8091094

  12. Cooley JW, Tukey JW (1965) An algorithm for the machine calculation of complex fourier series. Math Comput 19(90):297–301

    Article  MathSciNet  MATH  Google Scholar 

  13. Gentleman WM, Sande G (1966) Fast fourier transforms: For fun and profit. In: Proceedings of the November 7–10, 1966, Fall Joint Computer Conference. AFIPS ’66 (Fall), pp 563–578. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1464291.1464352.

  14. Swarztrauber PN (1984) Fft algorithms for vector computers. Parallel Comput 1(1):45–63. https://doi.org/10.1016/S0167-8191(84)90413-7

    Article  MATH  Google Scholar 

  15. Luo Y, Li Y, Yang J, Ma L, Huang W, Xu B (2021) Optimization of the randomness extraction based on toeplitz matrix for high-speed qrng post-processing on gpu. In: 2021 13th International Conference on Communication Software and Networks (ICCSN), pp 261–264. https://doi.org/10.1109/ICCSN52437.2021.9463613

  16. Zhao Z, Zhao Y (2018) The optimization of fft algorithm based with parallel computing on gpu. In: 2018 IEEE 3rd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), pp 2003–2007. https://doi.org/10.1109/IAEAC.2018.8577843

  17. Nejedly P, Plesinger F, Halamek J, Jurak P (2018) Cudafilters: a signalplant library for gpu-accelerated fft and fir filtering. Softw Pract Exp 48(1):3–9. https://doi.org/10.1002/spe.2507

    Article  Google Scholar 

  18. Ogata Y, Endo T, Maruyama N, Matsuoka S (2008) An efficient, model-based cpu-gpu heterogeneous fft library. In: 2008 IEEE International Symposium on Parallel and Distributed Processing, pp 1–10. https://doi.org/10.1109/IPDPS.2008.4536163

  19. Cılasun H, Resch S, Chowdhury ZI, Olson E, Zabihi M, Zhao Z, Peterson T, Wang J-P, Sapatnekar SS, Karpuzcu U (2020) CRAFFT: high resolution FFT accelerator in spintronic computational RAM. In: 2020 57th ACM/IEEE Design Automation Conference (DAC), pp 1–6. https://doi.org/10.1109/DAC18072.2020.9218673

  20. Chen X, Lei Y, Lu Z, Chen S (2018) A variable-size fft hardware accelerator based on matrix transposition. IEEE Trans Very Large Scale Integr Syst 26(10):1953–1966. https://doi.org/10.1109/TVLSI.2018.2846688

    Article  Google Scholar 

  21. Li Z, Jia H, Zhang Y, Chen T, Yuan L, Cao L, Wang X (2019) AutoFFT: a template-based FFT codes auto-generation framework for ARM and X86 CPUs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’19. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3295500.3356138.

  22. Ayala A, Tomov S, Luo X, Shaeik H, Haidar A, Bosilca G, Dongarra J (2019) Impacts of multi-gpu mpi collective communications on large fft computation. In: 2019 IEEE/ACM Workshop on Exascale MPI (ExaMPI), pp 12–18. https://doi.org/10.1109/ExaMPI49596.2019.00007

  23. Chen S, Li X (2013) A hybrid gpu/cpu fft library for large fft problems. In: 2013 IEEE 32nd International Performance Computing and Communications Conference (IPCCC), pp 1–10. https://doi.org/10.1109/PCCC.2013.6742796

  24. Gholami A, Hill J, Malhotra D, Biros G (2015) AccFFT: a library for distributed-memory FFT on CPU and GPU architectures. arXiv preprint arXiv:1506.07933

  25. Cecka C (2017) Low communication fmm-accelerated fft on gpus. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’17. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3126908.3126919.

  26. Markidis S, Chien SWD, Laure E, Peng IB, Vetter JS (2018) Nvidia tensor core programmability, performance amp; precision. In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp 522–531. https://doi.org/10.1109/IPDPSW.2018.00091

  27. Sorna A, Cheng X, D’Azevedo E, Won K, Tomov S (2018) Optimizing the fast fourier transform using mixed precision on tensor core hardware. In: 2018 IEEE 25th International Conference on High Performance Computing Workshops (HiPCW), pp 3–7. https://doi.org/10.1109/HiPCW.2018.8634417

  28. Cheng X, Sorna A, D’Azevedo E, Wong K, Tomov S (2018) Accelerating 2d fft: exploit gpu tensor cores through mixed-precision. In: The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’18), ACM Student Research Poster, Dallas, TX

  29. Durrani S, Chughtai MS, Dakkak A, Hwu W-m, Rauchwerger L (2021) FFT Blitz: The Tensor Cores Strike Back. In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP ’21, pp 488–489. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3437801.3441623.

  30. Abtahi T, Shea C, Kulkarni A, Mohsenin T (2018) Accelerating convolutional neural network with fft on embedded hardware. IEEE Trans Very Large Scale Integr Syst 26(9):1737–1749. https://doi.org/10.1109/TVLSI.2018.2825145

    Article  Google Scholar 

  31. Lee J, Kang H, Yeom H-J, Cheon S, Park J, Kim D (2021) Out-of-core gpu 2D-shift-fft algorithm for ultra-high-resolution hologram generation. Opt Express 29(12):19094–19112

    Article  Google Scholar 

  32. Kang H, Lee J, Kim D (2021) Hi-fft: Heterogeneous parallel in-place algorithm for large-scale 2D-fft. IEEE Access 9:120261–120273. https://doi.org/10.1109/ACCESS.2021.3108404

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the Major Project on the Integration of Industry, Education and Research of Zhongshan under Grant 210602103890051.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lu Lu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, Y., Lu, L. & Li, C. Memory-accelerated parallel method for multidimensional fast fourier implementation on GPU. J Supercomput 78, 18189–18208 (2022). https://doi.org/10.1007/s11227-022-04570-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04570-9

Keywords

Navigation