skip to main content
10.1145/3295500.3356138acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

AutoFFT: a template-based FFT codes auto-generation framework for ARM and X86 CPUs

Published: 17 November 2019 Publication History

Abstract

The discrete Fourier transform (DFT) is widely used in scientific and engineering computation. This paper proposes a template-based code generation framework named AutoFFT that can automatically generate high-performance fast Fourier transform (FFT) codes. AutoFFT employs the Cooley-Tukey FFT algorithm, which exploits the symmetric and periodic properties of the DFT matrix as the outer parallelization framework. To further reduce the number of floating-point operations of butterflies, we explore more symmetric and periodic properties of the DFT matrix and formulate two optimized calculation templates for prime and power-of-two radices. To fully exploit hardware resources, we encapsulate a series of optimizations in an assembly template optimizer. Given any DFT problem, AutoFFT automatically generates C FFT kernels using these two templates and transfers them to efficient assembly codes using the template optimizer. Experiments show that AutoFFT outperforms FFTW, ARMPL, and Intel MKL on average across all FFT types on ARMv8 and Intel x86-64 processors.

References

[1]
Ayaz Ali and Lennart Johnsson. 2006. UHFFT: A high performance DFT framework. (2006).
[2]
AMD. 2019. AOCL: AMD Optimizing CPU Libraries. https://developer.amd.com/wp-content/resources/AMDCPULibrariesUserGuide_1.0.pdf.
[3]
AMD. 2019. A software library containing FFT functions written in OpenCL. https://github.com/clMathLibraries/clFFT.
[4]
Apple. 2019. The Apple Accelerate libraries - vDSP. https://developer.apple.com/documentation/accelerate/vdsp/fast_fourier_transforms.
[5]
ARM. 2019. ARM Ne10 project. https://github.com/projectNe10/Ne10.
[6]
ARM. 2019. Arm Performance Libraries (ARMPL) 19.2.0. https://static.docs.arm.com/101004/1920/arm_performance_libraries_reference_101004_1920_00_en.pdf.
[7]
Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, et al. 2006. The landscape of parallel computing research: A view from berkeley. Technical Report. Technical Report UCB/EECS-2006-183, EECS Department, University of ....
[8]
Anthony Blake and Matt Hunter. 2014. Dynamically generating FFT code. Journal of Signal Processing Systems 76, 3 (2014), 275--281.
[9]
Leo Bluestein. 1970. A linear filtering approach to the computation of discrete Fourier transform. IEEE Transactions on Audio and Electroacoustics 18, 4 (1970), 451--455.
[10]
Georg Bruun. 1978. z-transform DFT filters and FFT's. IEEE Transactions on Acoustics Speech and Signal Processing 26, 1 (1978), 56--63.
[11]
Cris Cecka. 2017. Low Communication FMM-accelerated FFT on GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). ACM, New York, NY, USA, Article 54, 11 pages.
[12]
James W Cooley, Peter AW Lewis, and Peter D Welch. 1969. The fast Fourier transform and its applications. IEEE Transactions on Education 12, 1 (1969), 27--34.
[13]
James W Cooley and John W Tukey. 1965. An algorithm for the machine calculation of complex Fourier series. Mathematics of computation 19, 90 (1965), 297--301.
[14]
Pedro Costa. 2018. A FFT-based finite-difference solver for massively-parallel direct numerical simulations of turbulent flows. Computers & Mathematics with Applications 76, 8 (2018), 1853--1862.
[15]
Yuri Dotsenko, Sara S. Baghsorkhi, Brandon Lloyd, and Naga K. Govindaraju. 2011. Auto-tuning of Fast Fourier Transform on Graphics Processors. SIGPLAN Not. 46, 8 (Feb. 2011), 257--266.
[16]
Pierre Duhamel and Henk Hollmann. 1984. Split radix'FFT algorithm. Electronics letters 20, 1 (1984), 14--16.
[17]
Franz Franchetti, Tze Meng Low, Doru Thom Popovici, Richard M Veras, Daniele G Spampinato, Jeremy R Johnson, Markus Püschel, James C Hoe, and Jose MFMoura. 2018. SPIRAL: Extreme Performance Portability. Proc. IEEE 106, 11 (2018), 1935--1968.
[18]
F. Franchetti, M. Puschel, Y. Voronenko, S. Chellappa, and J. M. F. Moura. 2009. Discrete fourier transform on multicore. IEEE Signal Processing Magazine 26, 6 (November 2009), 90--102.
[19]
Franz Franchetti, Yevgen Voronenko, and Markus Püschel. 2005. Formal loop merging for signal transforms. ACM SIGPLAN Notices 40, 6 (2005), 315--326.
[20]
M Frigo and SG Johnson. 2019. benchFFT. http://www.ffttw.org/benchfft.
[21]
M Frigo and SG Johnson. 2019. The benchmarking methodology of benchFFT. http://www.fftw.org/speed/.
[22]
Matteo Frigo and Steven G. Johnson. 1997. The Fastest Fourier Transform in the West. Technical Report MIT-LCS-TR-728. Massachusetts Institute of Technology.
[23]
M. Frigo and S. G. Johnson. 1998. FFTW: an adaptive software architecture for the FFT. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181), Vol. 3. 1381--1384 vol.3.
[24]
Matteo Frigo and Steven G. Johnson. 2005. The Design and Implementation of FFTW3. Proc. IEEE 93, 2 (2005), 216--231. Special issue on "Program Generation, Optimization, and Platform Adaptation".
[25]
Amir Gholami, Judith Hill, Dhairya Malhotra, and George Biros. 2015. AccFFT: A library for distributed-memory FFT on CPU and GPU architectures. CoRR abs/1506.07933 (2015). arXiv:1506.07933 http://arxiv.org/abs/1506.07933
[26]
Chunye Gong, Weimin Bao, and Guojian Tang. 2013. A parallel algorithm for the Riesz fractional reaction-diffusion equation with explicit finite difference method. Fractional Calculus and Applied Analysis 16, 3 (2013), 654--669.
[27]
Chunye Gong, Weimin Bao, Guojian Tang, Bo Yang, and Jie Liu. 2014. An efficient parallel solution for Caputo fractional reaction-diffusion equation. The Journal of Supercomputing 68, 3 (2014), 1521--1537.
[28]
IBM. 2019. ESSL: IBM Engineering and Scientific Subroutine Library. https://www.ibm.com/support/knowledgecenter/en/SSFHY8_6.1/navigation/welcome.html.
[29]
Intel. 2016. Intel 64 and IA-32 architectures optimization reference manual (Chapter 2.1). https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf.
[30]
Intel. 2019. Intel Math Kernel Library Developer Reference's Appendix C: FFTW Interface to Intel Math Kernel Library. https://software.intel.com/sites/default/files/mkl-2019-developer-reference-c_2.pdf.
[31]
D Kolba and TW Parks. 1977. A prime factor FFT algorithm using high-speed convolution. IEEE Transactions on Acoustics, Speech, and Signal Processing 25, 4 (1977), 281--294.
[32]
Yan Li, Yun-Quan Zhang, Yi-Qun Liu, Guo-Ping Long, and Hai-Peng Jia. 2013. MPFFT: An autotuning FFT library for OpenCL GPUs. Journal of Computer Science and Technology 28, 1 (2013), 90--105.
[33]
Zhihao Li, Haipeng Jia, Yunquan Zhang, Shice Liu, Shigang Li, Xiao Wang, and Hao Zhang. 2019. Efficient parallel optimizations of a high-performance SIFT on GPUs. J. Parallel and Distrib. Comput. 124 (2019), 78--91.
[34]
Dragan Mirković, Rishad Mahasoom, and Lennart Johnsson. 2000. An Adaptive Software Library for Fast Fourier Transforms. In Proceedings of the 14th International Conference on Supercomputing (ICS '00). ACM, New York, NY, USA, 215--224.
[35]
Akira Nukada, Yutaka Maruyama, and Satoshi Matsuoka. 2012. High Performance 3-D FFT Using Multiple CUDA GPUs. In Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units (GPGPU-5). ACM, New York, NY, USA, 57--63.
[36]
Nvidia. 2019. CUFFT library. https://docs.nvidia.com/pdf/CUFFT_Library.pdf.
[37]
Dan Petre, Adam T. Lake, and Allen Hux. 2016. OpenCL™ FFT Optimizations for Intel® Processor Graphics. In Proceedings of the 4th International Workshop on OpenCL (IWOCL '16). ACM, New York, NY, USA, Article 12, 4 pages.
[38]
D. T. Popovici, T. M. Low, and F. Franchetti. 2018. Large Bandwidth-Efficient FFTs on Multicore and Multi-socket Systems. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 379--388.
[39]
Markus Puschel, José MF Moura, Jeremy R Johnson, David Padua, Manuela M Veloso, Bryan W Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, et al. 2005. SPIRAL: Code generation for DSP transforms. Proc. IEEE 93, 2 (2005), 232--275.
[40]
C Rader and NJIToA Brenner. 1976. A new principle for fast Fourier transformation. IEEE Transactions on Acoustics, Speech, and Signal Processing 24, 3 (1976), 264--266.
[41]
Charles M Rader. 1968. Discrete Fourier transforms when the number of data samples is prime. Proc. IEEE 56, 6 (1968), 1107--1108.
[42]
Thomas G. Stockham, Jr. 1966. High-speed Convolution and Correlation. In Proceedings of the April 26--28, 1966, Spring Joint Computer Conference (AFIPS '66 (Spring)). ACM, New York, NY, USA, 229--233.
[43]
Paul N. Swarztrauber. 1982. Vectorizing the ffts. In Parallel Computations, GARRY RODRIGUE (Ed.). Academic Press, 51 -- 83.
[44]
Daisuke Takahashi. 2014. FFTE: A Fast Fourier Transform Package. http://www.ffte.jp/.
[45]
Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. Intel math kernel library. In High-Performance Computing on the Intel® Xeon Phi™. Springer, 167--188.
[46]
Jianxin Xiong, Jeremy Johnson, Robert Johnson, and David Padua. 2001. SPL: A Language and Compiler for DSP Algorithms. In Proceedings of the ACM SIGPLAN 2001 Conference on Programming Language Design and Implementation (PLDI '01). ACM, New York, NY, USA, 298--308.
[47]
Dongxiao Zhang, Zhibin Chen, Cheng Xiao, Mengze Qin, and Hao Wu. 2019. Accurate simulation of turbulent phase screen using optimization method. Optik 178 (2019), 1023--1028.

Cited By

View all
  • (2024)Pimacolaba: Collaborative Acceleration for FFT on Commercial Processing-In-Memory ArchitecturesProceedings of the International Symposium on Memory Systems10.1145/3695794.3695796(13-25)Online publication date: 30-Sep-2024
  • (2024)Optimizing depthwise separable convolution on DCUCCF Transactions on High Performance Computing10.1007/s42514-024-00200-3Online publication date: 13-Dec-2024
  • (2023)GFFT: a Task Graph Based Fast Fourier Transform Optimization FrameworkProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605587(513-523)Online publication date: 7-Aug-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2019
1921 pages
ISBN:9781450362290
DOI:10.1145/3295500
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. AutoFFT
  2. DFT
  3. FFT
  4. code generation
  5. template

Qualifiers

  • Research-article

Funding Sources

  • National Key Research and Development Program of China
  • National Natural Science Foundation of China

Conference

SC '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)78
  • Downloads (Last 6 weeks)8
Reflects downloads up to 21 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Pimacolaba: Collaborative Acceleration for FFT on Commercial Processing-In-Memory ArchitecturesProceedings of the International Symposium on Memory Systems10.1145/3695794.3695796(13-25)Online publication date: 30-Sep-2024
  • (2024)Optimizing depthwise separable convolution on DCUCCF Transactions on High Performance Computing10.1007/s42514-024-00200-3Online publication date: 13-Dec-2024
  • (2023)GFFT: a Task Graph Based Fast Fourier Transform Optimization FrameworkProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605587(513-523)Online publication date: 7-Aug-2023
  • (2023)Generating Fast FFT Kernels on CPUs via FFT-Specific IntrinsicsProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577477(427-428)Online publication date: 25-Feb-2023
  • (2023)Optimization of the FFT Algorithm on RISC-V CPUsHigh Performance Computing10.1007/978-3-031-40843-4_38(515-525)Online publication date: 25-Aug-2023
  • (2023)Optimizing Yinyang K-Means Algorithm on ARMv8 Many-Core CPUsAlgorithms and Architectures for Parallel Processing10.1007/978-3-031-22677-9_36(676-690)Online publication date: 11-Jan-2023
  • (2022)EasyView: Enabling and Scheduling Tensor Views in Deep Learning CompilersProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545037(1-11)Online publication date: 29-Aug-2022
  • (2022)Optimizing Depthwise Separable Convolution Operations on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.308481333:1(70-87)Online publication date: 1-Jan-2022
  • (2022)A Low-Rank CNN Architecture for Real-Time Semantic Segmentation in Visual SLAM ApplicationsIEEE Open Journal of Circuits and Systems10.1109/OJCAS.2022.31746323(115-133)Online publication date: 2022
  • (2021)A Transpose-free Three-dimensional FFT Algorithm on ARM CPUs2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys53884.2021.00028(1-8)Online publication date: Dec-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media