skip to main content
10.1145/3368474.3368488acmotherconferencesArticle/Chapter ViewAbstractPublication PageshpcasiaConference Proceedingsconference-collections
research-article

FFTE on SVE: SPIRAL-Generated Kernels

Published: 15 January 2020 Publication History

Abstract

In this paper we propose an implementation of the fast Fourier transform (FFT) targeting the ARM Scalable Vector Extension (SVE). We performed automatic vectorization via a compiler and an explicit vectorization through code generation by SPIRAL for FFT kernels, and compared the performance. We show that the explicit vectorization of SPIRAL generated code improves performance significantly. Performance results of FFTs on RIKEN's Fugaku processor simulator are reported. With the ARM compiler SPIRAL-generated FFT kernels written in SVE intrinsic are up to 3.16 times faster than FFT kernels of FFTE written in Fortran and up to 5.62 times faster than SPIRAL-generated FFT kernels written in C.

References

[1]
ARM Limited 2019. ARM C Language Extensions for SVE, Version 00bet2. ARM Limited. https://static.docs.arm.com/100987/0000/acle_sve_100987_0000_01_en.pdf
[2]
ARM Limited 2019. ARM Compiler Scalable Vector Extension User Guide, Version 6.12. ARM Limited.
[3]
David H. Bailey. 1987. A High-Performance Fast Fourier Transform Algorithm for the Cray-2. J. Supercomput. 1 (1987), 43--60.
[4]
E. Oran Brigham. 1988. The Fast Fourier Transform and Its Applications. Prentice-Hall, Upper Saddle River.
[5]
W. T. Cochran, J. W. Cooley, D. L. Favin, H. D. Helms, R. A. Kaenel, W. W. Lang, G. C. Maling, D. E. Nelson, C. M. Rader, and P. D. Welch. 1967. What is the Fast Fourier Transform? IEEE Trans. Audio and Electroacoust. 15 (1967), 45--55.
[6]
James W. Cooley and John W. Tukey. 1965. An Algorithm for the Machine Calculation of Complex Fourier Series. Math. Comput. 19 (1965), 297--301.
[7]
Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin, and Markus Püschel. 2009. Operator Language: A Program Generation Framework for Fast Kernels. In IFLP Working Conference on Domain Specific Languages (DSL WC).
[8]
Franz Franchetti, Tze Meng Low, Doru Thom Popovici, Richard M. Veras, Daniele G. Spampinato, Jeremy R. Johnson, Markus Püschel, James C. Hoe, and José M. F. Moura. 2018. SPIRAL: Extreme Performance Portability. Proc. IEEE 106 (2018), 1935--1968.
[9]
Franz Franchetti and Markus Püschel. 2002. A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms. In Intl. Parallel and Distributed Processing Symposium (IPDPS). 20--26.
[10]
Franz Franchetti and Markus Püschel. 2003. Short Vector Code Generation for the Discrete Fourier Transform. In Parallel and Distributed Processing Symposium, 2003. Proceedings. International. IEEE, 10--pp.
[11]
Franz Franchetti, Daniele G. Spampinato, Anuva Kulkarni, Thom Popovici, Tze-Meng Low, M. Franusich, A. Canning, P. McCorquodale, B. Van Straalen, and P. Colella. 2018. FFTX and SpectralPack: A First Look. In IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC).
[12]
Franz Franchetti, Yevgen Voronenko, and Gheorghe Almasi. 2012. Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P. In High Performance Computing for Computational Science (VECPAR).
[13]
Franz Franchetti, Yevgen Voronenko, Peter A. Milder, Srinivas Chellappa, Marek Telgarsky, Hao Shen, Paolo D'Alberto, Frédéric de Mesmay, James C. Hoe, José M. F. Moura, and Markus Püschel. 2008. Domain-Specific Library Generation for Parallel Software and Hardware Platforms. In NSF Next Generation Software Program workshop (NSFNGS).
[14]
Matteo Frigo and Steven G. Johnson. 2005. The Design and Implementation of FFTW3. Proc. IEEE 93 (2005), 216--231.
[15]
Intel Corporation. 2018. Intel Architecture Instruction Set Extensions and Future Features Programming Reference.
[16]
J. R. Johnson, R. W. Johnson, D. Rodriguez, and R. Tolimieri. 1990. A Methodology for Designing, Modifying, and Implementing FFT Algorithms on Various Architectures. Circuits Systems Signal Processing 9 (1990), 449--500.
[17]
Yuetsu Kodama, Tetsuya Odajima, Akira Asato, and Mitsuhisa Sato. 2019. Evaluation of the RIKEN Post-K Processor Simulator. Computing Research Repository (CoRR) abs/1904.06451 (2019), 1--6. http://arxiv.org/
[18]
Daniel S. McFarlin, Volodymyr Arbatov, Franz Franchetti, and Markus Püschel. 2011. Automatic SIMD Vectorization of Fast Fourier Transforms for the Larrabee and AVX Instruction sets. In Proc. 25th International Conference on Supercomputing (ICS'11). 265--274.
[19]
Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nicholas Rizzolo. 2005. SPIRAL: Code Generation for DSP Transforms. Proc. of the IEEE, special issue on "Program Generation, Optimization, and Adaptation" 93, 2 (2005), 232-- 275.
[20]
Nigel Stephens, Stuart Biles, Matthias Boettcher, Jacob Eapen, Mbou Eyole, Giacomo Gabrielli, Matt Horsnell, Grigorios Magklis, Alejandro Martinez, Nathanael Premillieu, Alastair Reid, Alejandro Rico, and Paul Walker. 2017. The ARM Scalable Vector Extension. IEEE Micro 37 (2017), 26--39.
[21]
Daisuke Takahashi. 2003. A Radix-16 FFT Algorithm Suitable for Multiply-Add Instruction Based on Goedecker Method. In Proc. 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2003), Vol. 2. 665--668.
[22]
Daisuke Takahashi. 2007. An Implementation of Parallel 1-D FFT Using SSE3 Instructions on Dual-Core Processors. In Proc. 8th International Workshop on State of the Art in Scientific Computing (PARA 2006) (Lecture Notes in Computer Science), Vol. 4699. Springer-Verlag, 1178--1187.
[23]
Daisuke Takahashi. 2012. An Implementation of Parallel 2-D FFT Using Intel AVX Instructions on Multi-core Processors. In Proc. 12th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2012), Part II (Lecture Notes in Computer Science), Vol. 7440. Springer-Verlag, 197--205.
[24]
Daisuke Takahashi. 2014. FFTE: A Fast Fourier Transform Package. http://www.ffte.jp/
[25]
Daisuke Takahashi. 2017. An Implementation of Parallel 1-D Real FFT on Intel Xeon Phi Processors. In Proc. 17th International Conference on Computational Science and Its Applications (ICCSA 2017), Part I (Lecture Notes in Computer Science), Vol. 10404. Springer International Publishing, 401--410.
[26]
R. Tolimieri, M. An, and C. Lu. 1997. Algorithms for discrete Fourier transforms and convolution (2nd ed.). Springer.
[27]
C. Van Loan. 1992. Computational Framework of the Fast Fourier Transform. SIAM.
[28]
J. Xiong, J. Johnson, R. Johnson, and D. Padua. 2001. SPL: A Language and Compiler for DSP Algorithms. In Proc. Programming Language Design and Implementation (PLDI). 298--308.
[29]
Toshio Yoshida. 2018. Fujitsu High Performance CPU for the Post-K Computer. In Proc. 2018 IEEE Hot Chips 30 Symposium.

Cited By

View all
  • (2024)Xphase3d: Memory-Distributed Phase Retrieval for Reconstructing Large-Scale 3D Density Maps of Biological Macromolecules2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00041(394-402)Online publication date: 24-Sep-2024
  • (2024)High-Performance FFT Code Generation via MLIR Linalg Dialect and SIMD Micro-Kernels2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00021(155-165)Online publication date: 24-Sep-2024
  • (2023)CPU Architecture Modelling and Co-designHigh Performance Computing10.1007/978-3-031-32041-5_1(3-21)Online publication date: 21-May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
HPCAsia '20: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region
January 2020
247 pages
ISBN:9781450372367
DOI:10.1145/3368474
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 January 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ARM SVE
  2. FFT
  3. SPIRAL
  4. vectorization

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

HPCAsia2020

Acceptance Rates

Overall Acceptance Rate 69 of 143 submissions, 48%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)4
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Xphase3d: Memory-Distributed Phase Retrieval for Reconstructing Large-Scale 3D Density Maps of Biological Macromolecules2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00041(394-402)Online publication date: 24-Sep-2024
  • (2024)High-Performance FFT Code Generation via MLIR Linalg Dialect and SIMD Micro-Kernels2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00021(155-165)Online publication date: 24-Sep-2024
  • (2023)CPU Architecture Modelling and Co-designHigh Performance Computing10.1007/978-3-031-32041-5_1(3-21)Online publication date: 21-May-2023
  • (2022)Assessing the State of Autovectorization Support based on SVE2022 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER51413.2022.00073(556-562)Online publication date: Sep-2022
  • (2021)An Auto-tuning with Adaptation of A64 Scalable Vector Extension for SPIRAL2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW52791.2021.00117(789-797)Online publication date: Jun-2021
  • (2021)A memory bandwidth improvement with memory space partitioning for single-precision floating-point FFT on Stratix 10 FPGA2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00117(787-790)Online publication date: Sep-2021
  • (2021)Accelerating Level 2 BLAS Based on ARM SVE2021 4th International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE)10.1109/AEMCSE51986.2021.00208(1018-1022)Online publication date: Mar-2021
  • (2020)Porting Applications to Arm-based Processors2020 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER49012.2020.00079(559-566)Online publication date: Sep-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media