Memory-accelerated parallel method for multidimensional fast fourier implementation on GPU

Hu, Yichang; Lu, Lu; Li, Cuixu

doi:10.1007/s11227-022-04570-9

Memory-accelerated parallel method for multidimensional fast fourier implementation on GPU

Published: 02 June 2022

Volume 78, pages 18189–18208, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Yichang Hu¹,
Lu Lu¹ &
Cuixu Li²

655 Accesses
4 Citations
Explore all metrics

Abstract

Fast Fourier transform (FFT) is a well-known algorithm that calculates the discrete Fourier transform (DFT) of discrete data and is an essential tool in scientific and engineering computation. Due to the large amounts of data, parallelly executing FFT in graphics processing unit (GPU) can effectively optimize the performance. Following this approach, FFTW and some other FFT packages were designed, but the fixed computation pattern makes it hard to utilize the computing power of GPU. Additionally, the memory access pattern is not optimized to alleviate the bottleneck of data exchange. Motivated by these challenges, we propose an efficient GPU-accelerated multidimensional FFT library to achieve better performance in this paper. We present a detailed and clear implementation strategy and optimize FFT by having as few memory transfers as possible. The data will be reshuffled on the CPU, and the access mode is also optimized to coordinate with the GPU memory access pattern. Several optimizations are also demonstrated to enhance the performance of our approach for varying FFT sizes, and the evaluation shows that our approach consistently outperforms rocFFT with a speedup of about 25% to 250% on average in AMD Instinct MI100 GPU.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An effective 3-D fast fourier transform framework for multi-GPU accelerated distributed-memory systems

Article 13 May 2022

A Parallel 1-D FFT Implementation Method for Multi-core Vector Processors

The Fast Fourier Transform Partitioning Scheme for GPU’s Computation Effectiveness Improvement

Data availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study, and source data are provided with the paper in Figs. 6, 7, 8, 9.

References

Cheng S, Yu H-R, Inman D, Liao Q, Wu Q, Lin J (2020) Cube-towards an optimal scaling of cosmological n-body simulations. In: 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pp 685–690. https://doi.org/10.1109/CCGrid49817.2020.00-22
Watson W, Spedding TA (1982) The time series modelling of non-gaussian engineering processes. Wear 83(2):215–231. https://doi.org/10.1016/0043-1648(82)90178-8
Article Google Scholar
Biwer CM, Capano CD, De S, Cabero M, Brown DA, Nitz AH, Raymond V (2019) PyCBC inference: a python-based parameter estimation toolkit for compact binary coalescence signals. Science 131(996):024503. https://doi.org/10.1088/1538-3873/aaef0b
Article Google Scholar
Haynes PD, Côté M (2000) Parallel fast fourier transforms for electronic structure calculations. Comput Phys Commun 130(1):130–136. https://doi.org/10.1016/S0010-4655(00)00049-7
Article MATH Google Scholar
Després P, Jia X (2017) A review of gpu-based medical image reconstruction. Physica Med 42:76–92. https://doi.org/10.1016/j.ejmp.2017.07.024
Article Google Scholar
Cipra BA (2000) The best of the 20th century: editors name top 10 algorithms. SIAM News 33(4):1–2
Google Scholar
Frigo M, Johnson SG (1998) FFTW: an adaptive software architecture for the FFT. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’98 (Cat. No.98CH36181), vol 3, pp 1381–13843. https://doi.org/10.1109/ICASSP.1998.681704
Frigo M, Johnson SG (1997) The fastest fourier transform in the west. mit-lcs-tr-728. In: The Proceedings of the 1998 International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’98
Nukada A, Sato K, Matsuoka S (2012) Scalable multi-gpu 3-d fft for tsubame 2.0 supercomputer. In: SC ’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp 1–10. https://doi.org/10.1109/SC.2012.100
Gu L, Li X, Siegel J (2010) An empirically tuned 2D and 3D fft library on cuda gpu. In: Proceedings of the 24th ACM International Conference on Supercomputing. ICS ’10, pp. 305–314. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1810085.1810127.
Schwaller B, Ramesh B, George AD (2017) Investigating ti keystone ii and quad-core arm cortex-a53 architectures for on-board space processing. In: 2017 IEEE High Performance Extreme Computing Conference (HPEC), pp 1–7. https://doi.org/10.1109/HPEC.2017.8091094
Cooley JW, Tukey JW (1965) An algorithm for the machine calculation of complex fourier series. Math Comput 19(90):297–301
Article MathSciNet MATH Google Scholar
Gentleman WM, Sande G (1966) Fast fourier transforms: For fun and profit. In: Proceedings of the November 7–10, 1966, Fall Joint Computer Conference. AFIPS ’66 (Fall), pp 563–578. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1464291.1464352.
Swarztrauber PN (1984) Fft algorithms for vector computers. Parallel Comput 1(1):45–63. https://doi.org/10.1016/S0167-8191(84)90413-7
Article MATH Google Scholar
Luo Y, Li Y, Yang J, Ma L, Huang W, Xu B (2021) Optimization of the randomness extraction based on toeplitz matrix for high-speed qrng post-processing on gpu. In: 2021 13th International Conference on Communication Software and Networks (ICCSN), pp 261–264. https://doi.org/10.1109/ICCSN52437.2021.9463613
Zhao Z, Zhao Y (2018) The optimization of fft algorithm based with parallel computing on gpu. In: 2018 IEEE 3rd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), pp 2003–2007. https://doi.org/10.1109/IAEAC.2018.8577843
Nejedly P, Plesinger F, Halamek J, Jurak P (2018) Cudafilters: a signalplant library for gpu-accelerated fft and fir filtering. Softw Pract Exp 48(1):3–9. https://doi.org/10.1002/spe.2507
Article Google Scholar
Ogata Y, Endo T, Maruyama N, Matsuoka S (2008) An efficient, model-based cpu-gpu heterogeneous fft library. In: 2008 IEEE International Symposium on Parallel and Distributed Processing, pp 1–10. https://doi.org/10.1109/IPDPS.2008.4536163
Cılasun H, Resch S, Chowdhury ZI, Olson E, Zabihi M, Zhao Z, Peterson T, Wang J-P, Sapatnekar SS, Karpuzcu U (2020) CRAFFT: high resolution FFT accelerator in spintronic computational RAM. In: 2020 57th ACM/IEEE Design Automation Conference (DAC), pp 1–6. https://doi.org/10.1109/DAC18072.2020.9218673
Chen X, Lei Y, Lu Z, Chen S (2018) A variable-size fft hardware accelerator based on matrix transposition. IEEE Trans Very Large Scale Integr Syst 26(10):1953–1966. https://doi.org/10.1109/TVLSI.2018.2846688
Article Google Scholar
Li Z, Jia H, Zhang Y, Chen T, Yuan L, Cao L, Wang X (2019) AutoFFT: a template-based FFT codes auto-generation framework for ARM and X86 CPUs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’19. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3295500.3356138.
Ayala A, Tomov S, Luo X, Shaeik H, Haidar A, Bosilca G, Dongarra J (2019) Impacts of multi-gpu mpi collective communications on large fft computation. In: 2019 IEEE/ACM Workshop on Exascale MPI (ExaMPI), pp 12–18. https://doi.org/10.1109/ExaMPI49596.2019.00007
Chen S, Li X (2013) A hybrid gpu/cpu fft library for large fft problems. In: 2013 IEEE 32nd International Performance Computing and Communications Conference (IPCCC), pp 1–10. https://doi.org/10.1109/PCCC.2013.6742796
Gholami A, Hill J, Malhotra D, Biros G (2015) AccFFT: a library for distributed-memory FFT on CPU and GPU architectures. arXiv preprint arXiv:1506.07933
Cecka C (2017) Low communication fmm-accelerated fft on gpus. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’17. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3126908.3126919.
Markidis S, Chien SWD, Laure E, Peng IB, Vetter JS (2018) Nvidia tensor core programmability, performance amp; precision. In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp 522–531. https://doi.org/10.1109/IPDPSW.2018.00091
Sorna A, Cheng X, D’Azevedo E, Won K, Tomov S (2018) Optimizing the fast fourier transform using mixed precision on tensor core hardware. In: 2018 IEEE 25th International Conference on High Performance Computing Workshops (HiPCW), pp 3–7. https://doi.org/10.1109/HiPCW.2018.8634417
Cheng X, Sorna A, D’Azevedo E, Wong K, Tomov S (2018) Accelerating 2d fft: exploit gpu tensor cores through mixed-precision. In: The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’18), ACM Student Research Poster, Dallas, TX
Durrani S, Chughtai MS, Dakkak A, Hwu W-m, Rauchwerger L (2021) FFT Blitz: The Tensor Cores Strike Back. In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP ’21, pp 488–489. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3437801.3441623.
Abtahi T, Shea C, Kulkarni A, Mohsenin T (2018) Accelerating convolutional neural network with fft on embedded hardware. IEEE Trans Very Large Scale Integr Syst 26(9):1737–1749. https://doi.org/10.1109/TVLSI.2018.2825145
Article Google Scholar
Lee J, Kang H, Yeom H-J, Cheon S, Park J, Kim D (2021) Out-of-core gpu 2D-shift-fft algorithm for ultra-high-resolution hologram generation. Opt Express 29(12):19094–19112
Article Google Scholar
Kang H, Lee J, Kim D (2021) Hi-fft: Heterogeneous parallel in-place algorithm for large-scale 2D-fft. IEEE Access 9:120261–120273. https://doi.org/10.1109/ACCESS.2021.3108404
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by the Major Project on the Integration of Industry, Education and Research of Zhongshan under Grant 210602103890051.

Author information

Authors and Affiliations

School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510000, Guangdong, China
Yichang Hu & Lu Lu
Guangdong MeiWeiXian Flavoring Food Co., LTD., Zhongshan, 528437, Guangdong, China
Cuixu Li

Authors

Yichang Hu
View author publications
Search author on:PubMed Google Scholar
Lu Lu
View author publications
Search author on:PubMed Google Scholar
Cuixu Li
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Lu Lu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hu, Y., Lu, L. & Li, C. Memory-accelerated parallel method for multidimensional fast fourier implementation on GPU. J Supercomput 78, 18189–18208 (2022). https://doi.org/10.1007/s11227-022-04570-9

Download citation

Accepted: 27 April 2022
Published: 02 June 2022
Issue Date: November 2022
DOI: https://doi.org/10.1007/s11227-022-04570-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Memory-accelerated parallel method for multidimensional fast fourier implementation on GPU

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An effective 3-D fast fourier transform framework for multi-GPU accelerated distributed-memory systems

A Parallel 1-D FFT Implementation Method for Multi-core Vector Processors

The Fast Fourier Transform Partitioning Scheme for GPU’s Computation Effectiveness Improvement

Explore related subjects

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now