Abstract
Fast Fourier transform (FFT) is a well-known algorithm that calculates the discrete Fourier transform (DFT) of discrete data and is an essential tool in scientific and engineering computation. Due to the large amounts of data, parallelly executing FFT in graphics processing unit (GPU) can effectively optimize the performance. Following this approach, FFTW and some other FFT packages were designed, but the fixed computation pattern makes it hard to utilize the computing power of GPU. Additionally, the memory access pattern is not optimized to alleviate the bottleneck of data exchange. Motivated by these challenges, we propose an efficient GPU-accelerated multidimensional FFT library to achieve better performance in this paper. We present a detailed and clear implementation strategy and optimize FFT by having as few memory transfers as possible. The data will be reshuffled on the CPU, and the access mode is also optimized to coordinate with the GPU memory access pattern. Several optimizations are also demonstrated to enhance the performance of our approach for varying FFT sizes, and the evaluation shows that our approach consistently outperforms rocFFT with a speedup of about 25% to 250% on average in AMD Instinct MI100 GPU.









Similar content being viewed by others
References
Cheng S, Yu H-R, Inman D, Liao Q, Wu Q, Lin J (2020) Cube-towards an optimal scaling of cosmological n-body simulations. In: 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pp 685–690. https://doi.org/10.1109/CCGrid49817.2020.00-22
Watson W, Spedding TA (1982) The time series modelling of non-gaussian engineering processes. Wear 83(2):215–231. https://doi.org/10.1016/0043-1648(82)90178-8
Biwer CM, Capano CD, De S, Cabero M, Brown DA, Nitz AH, Raymond V (2019) PyCBC inference: a python-based parameter estimation toolkit for compact binary coalescence signals. Science 131(996):024503. https://doi.org/10.1088/1538-3873/aaef0b
Haynes PD, Côté M (2000) Parallel fast fourier transforms for electronic structure calculations. Comput Phys Commun 130(1):130–136. https://doi.org/10.1016/S0010-4655(00)00049-7
Després P, Jia X (2017) A review of gpu-based medical image reconstruction. Physica Med 42:76–92. https://doi.org/10.1016/j.ejmp.2017.07.024
Cipra BA (2000) The best of the 20th century: editors name top 10 algorithms. SIAM News 33(4):1–2
Frigo M, Johnson SG (1998) FFTW: an adaptive software architecture for the FFT. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’98 (Cat. No.98CH36181), vol 3, pp 1381–13843. https://doi.org/10.1109/ICASSP.1998.681704
Frigo M, Johnson SG (1997) The fastest fourier transform in the west. mit-lcs-tr-728. In: The Proceedings of the 1998 International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’98
Nukada A, Sato K, Matsuoka S (2012) Scalable multi-gpu 3-d fft for tsubame 2.0 supercomputer. In: SC ’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp 1–10. https://doi.org/10.1109/SC.2012.100
Gu L, Li X, Siegel J (2010) An empirically tuned 2D and 3D fft library on cuda gpu. In: Proceedings of the 24th ACM International Conference on Supercomputing. ICS ’10, pp. 305–314. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1810085.1810127.
Schwaller B, Ramesh B, George AD (2017) Investigating ti keystone ii and quad-core arm cortex-a53 architectures for on-board space processing. In: 2017 IEEE High Performance Extreme Computing Conference (HPEC), pp 1–7. https://doi.org/10.1109/HPEC.2017.8091094
Cooley JW, Tukey JW (1965) An algorithm for the machine calculation of complex fourier series. Math Comput 19(90):297–301
Gentleman WM, Sande G (1966) Fast fourier transforms: For fun and profit. In: Proceedings of the November 7–10, 1966, Fall Joint Computer Conference. AFIPS ’66 (Fall), pp 563–578. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1464291.1464352.
Swarztrauber PN (1984) Fft algorithms for vector computers. Parallel Comput 1(1):45–63. https://doi.org/10.1016/S0167-8191(84)90413-7
Luo Y, Li Y, Yang J, Ma L, Huang W, Xu B (2021) Optimization of the randomness extraction based on toeplitz matrix for high-speed qrng post-processing on gpu. In: 2021 13th International Conference on Communication Software and Networks (ICCSN), pp 261–264. https://doi.org/10.1109/ICCSN52437.2021.9463613
Zhao Z, Zhao Y (2018) The optimization of fft algorithm based with parallel computing on gpu. In: 2018 IEEE 3rd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), pp 2003–2007. https://doi.org/10.1109/IAEAC.2018.8577843
Nejedly P, Plesinger F, Halamek J, Jurak P (2018) Cudafilters: a signalplant library for gpu-accelerated fft and fir filtering. Softw Pract Exp 48(1):3–9. https://doi.org/10.1002/spe.2507
Ogata Y, Endo T, Maruyama N, Matsuoka S (2008) An efficient, model-based cpu-gpu heterogeneous fft library. In: 2008 IEEE International Symposium on Parallel and Distributed Processing, pp 1–10. https://doi.org/10.1109/IPDPS.2008.4536163
Cılasun H, Resch S, Chowdhury ZI, Olson E, Zabihi M, Zhao Z, Peterson T, Wang J-P, Sapatnekar SS, Karpuzcu U (2020) CRAFFT: high resolution FFT accelerator in spintronic computational RAM. In: 2020 57th ACM/IEEE Design Automation Conference (DAC), pp 1–6. https://doi.org/10.1109/DAC18072.2020.9218673
Chen X, Lei Y, Lu Z, Chen S (2018) A variable-size fft hardware accelerator based on matrix transposition. IEEE Trans Very Large Scale Integr Syst 26(10):1953–1966. https://doi.org/10.1109/TVLSI.2018.2846688
Li Z, Jia H, Zhang Y, Chen T, Yuan L, Cao L, Wang X (2019) AutoFFT: a template-based FFT codes auto-generation framework for ARM and X86 CPUs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’19. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3295500.3356138.
Ayala A, Tomov S, Luo X, Shaeik H, Haidar A, Bosilca G, Dongarra J (2019) Impacts of multi-gpu mpi collective communications on large fft computation. In: 2019 IEEE/ACM Workshop on Exascale MPI (ExaMPI), pp 12–18. https://doi.org/10.1109/ExaMPI49596.2019.00007
Chen S, Li X (2013) A hybrid gpu/cpu fft library for large fft problems. In: 2013 IEEE 32nd International Performance Computing and Communications Conference (IPCCC), pp 1–10. https://doi.org/10.1109/PCCC.2013.6742796
Gholami A, Hill J, Malhotra D, Biros G (2015) AccFFT: a library for distributed-memory FFT on CPU and GPU architectures. arXiv preprint arXiv:1506.07933
Cecka C (2017) Low communication fmm-accelerated fft on gpus. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’17. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3126908.3126919.
Markidis S, Chien SWD, Laure E, Peng IB, Vetter JS (2018) Nvidia tensor core programmability, performance amp; precision. In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp 522–531. https://doi.org/10.1109/IPDPSW.2018.00091
Sorna A, Cheng X, D’Azevedo E, Won K, Tomov S (2018) Optimizing the fast fourier transform using mixed precision on tensor core hardware. In: 2018 IEEE 25th International Conference on High Performance Computing Workshops (HiPCW), pp 3–7. https://doi.org/10.1109/HiPCW.2018.8634417
Cheng X, Sorna A, D’Azevedo E, Wong K, Tomov S (2018) Accelerating 2d fft: exploit gpu tensor cores through mixed-precision. In: The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’18), ACM Student Research Poster, Dallas, TX
Durrani S, Chughtai MS, Dakkak A, Hwu W-m, Rauchwerger L (2021) FFT Blitz: The Tensor Cores Strike Back. In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP ’21, pp 488–489. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3437801.3441623.
Abtahi T, Shea C, Kulkarni A, Mohsenin T (2018) Accelerating convolutional neural network with fft on embedded hardware. IEEE Trans Very Large Scale Integr Syst 26(9):1737–1749. https://doi.org/10.1109/TVLSI.2018.2825145
Lee J, Kang H, Yeom H-J, Cheon S, Park J, Kim D (2021) Out-of-core gpu 2D-shift-fft algorithm for ultra-high-resolution hologram generation. Opt Express 29(12):19094–19112
Kang H, Lee J, Kim D (2021) Hi-fft: Heterogeneous parallel in-place algorithm for large-scale 2D-fft. IEEE Access 9:120261–120273. https://doi.org/10.1109/ACCESS.2021.3108404
Acknowledgements
This work was supported in part by the Major Project on the Integration of Industry, Education and Research of Zhongshan under Grant 210602103890051.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Hu, Y., Lu, L. & Li, C. Memory-accelerated parallel method for multidimensional fast fourier implementation on GPU. J Supercomput 78, 18189–18208 (2022). https://doi.org/10.1007/s11227-022-04570-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04570-9